Re: dfs health Dashboard

2009-11-02 Thread V Satish Kumar
Hi Allen, Once a node goes down, the dfs health dashboard takes 40 to 45 minutes to refresh the status from live to dead. But I see that the 'Last Contact' field at Live 'dfsnodelist' page getting updated every 30 seconds. Allen Wittenauer wrote: On 11/1/09 10:11 PM, "V Satish Kumar"

Re: XML input to map function

2009-11-02 Thread Amandeep Khurana
On Mon, Nov 2, 2009 at 4:39 PM, Vipul Sharma wrote: > Okay I think I was not clear in my first post about the question. Let me > try > again. > > I have an application that gets large number of xml files every minute > which > are copied over to hdfs. Each file is around 1Mb each and contains sev

RE: Multiple Input Paths

2009-11-02 Thread Vipul Sharma
Mark, were you able to concatenate both the xml files together. What did you do to keep the resulting xml well forned? Regards, Vipul Sharma, Cell: 281-217-0761

Re: XML input to map function

2009-11-02 Thread Vipul Sharma
Okay I think I was not clear in my first post about the question. Let me try again. I have an application that gets large number of xml files every minute which are copied over to hdfs. Each file is around 1Mb each and contains several records. Files are well formed xml files with a starting tag

Re: hadoop eclipse plugin

2009-11-02 Thread Martin Hall
On the other hand, a NetBeans plug-in has been and is being super- maintained, so take a look at http://www.hadoopstudio.org if you're not completely wedded to Eclipse. :) Martin On Nov 2, 2009, at 1:12 PM, Philip Zeyliger wrote: Hi Le, Unfortunately as of late the Eclipse plug-in has be

Re: XML input to map function

2009-11-02 Thread Amandeep Khurana
Are the xml's in flat files or stored in Hbase? 1. If they are in flat files, you can use the StreamXmlRecordReader if that works for you. 2. Or you can read the xml into a single string and process it however you want. (This can be done if its in a flat file or stored in an hbase table). I have

XML input to map function

2009-11-02 Thread Vipul Sharma
I am working on a mapreduce application that will take input from lots of small xml files rather than one big xml file. Each xml files has some record that I want to parse and input data in a hbase table. How should I go about parsing xml files and input in map functions. Should I have one mapper p

Server types

2009-11-02 Thread John Martyniak
I am gettin ready to set up a hadoop cluster, starting small but going to add pretty quickly. I am planning on running the following on the cluster, hadoop, hdfs, hbase, nutch, and mahout. So far I have two Dell SC1425 dual processor (2.8 ghz), 4 GB Ram, 2 1.5 TB Sata drives, on a gigabit

Re: Linux Flavor

2009-11-02 Thread Praveen Yarlagadda
Thanks Guys! That's helpful. On Mon, Nov 2, 2009 at 1:22 PM, Todd Lipcon wrote: > We generally recommend sticking with whatever Linux is already common > inside > your organization. Hadoop itself should run equally well on CentOS 5, RHEL > 5, or any reasonably recent Ubuntu/Debian. It will proba

A way to input xml files in mapreduce

2009-11-02 Thread VIPUL SHARMA
I am new to hadoop and still learning most of the details. I am working on an application that will take input from lots of small xml files. Each xml files has some record that I want to parse and input data in a hbase table. How should I go about parsing xml files and input in map functions. Shou

Re: Linux Flavor

2009-11-02 Thread Todd Lipcon
We generally recommend sticking with whatever Linux is already common inside your organization. Hadoop itself should run equally well on CentOS 5, RHEL 5, or any reasonably recent Ubuntu/Debian. It will probably be OK on any other variety of Linux as well (eg SLES), though they are less commonly us

Re: hadoop eclipse plugin

2009-11-02 Thread Philip Zeyliger
Hi Le, Unfortunately as of late the Eclipse plug-in has been undermaintained. I've anecdotally heard that appyling the fix in " http://issues.apache.org/jira/browse/HADOOP-3744";, and building the plugin ("ant -Declipse.home=... binary" should build it) will make it work. Do comment on that JIRA

Re: dfs health Dashboard

2009-11-02 Thread Koji Noguchi
Hi Satish, This doesn't solve your current problem, but from 0.20 (after HADOOP-4029), "4. List of live/dead nodes is moved to separate page. " Koji On 11/1/09 11:11 PM, "V Satish Kumar" wrote: > Hi, > > I have noticed that the dfs health dashboard(Running on port 50070) > takes a long tim

Re: Linux Flavor

2009-11-02 Thread Tom Wheeler
Based on what I've seen on the list, larger installations tend to use RedHat Enterprise Linux or one of its clones like CentOS. On Mon, Nov 2, 2009 at 2:16 PM, Praveen Yarlagadda wrote: > Hi all, > > I have been running Hadoop on Ubuntu for a while now in distributed mode (4 > node cluster). Just

Linux Flavor

2009-11-02 Thread Praveen Yarlagadda
Hi all, I have been running Hadoop on Ubuntu for a while now in distributed mode (4 node cluster). Just playing around with it. Going forward, I am planning to have more nodes added to the cluster. Just want to know which linux flavor is the best to run Hadoop on? Please let me know. Regards, Pra

hadoop eclipse plugin

2009-11-02 Thread Le Zhao
Hi All (& sorry for possible double posting), Does anybody know whether the hadoop eclipse plugin is still supported? I've tried using the 0.18.0 and 0.18.3 plugins to talk to the hadoop 0.18.0 virtual machine, or an installed hadoop 0.18.3. All trials have been unsuccessful. Btw, I closel

Join me as a memeber

2009-11-02 Thread Mohan Agarwal

RE: Multiple Input Paths

2009-11-02 Thread Mark Vigeant
Ok, thank you very much Amogh, I will redesign my program. -Original Message- From: Amogh Vasekar [mailto:am...@yahoo-inc.com] Sent: Monday, November 02, 2009 11:45 AM To: common-user@hadoop.apache.org Subject: Re: Multiple Input Paths Mark, Set-up for a mapred job consumes a considerabl

Re: Multiple Input Paths

2009-11-02 Thread Amogh Vasekar
Mark, Set-up for a mapred job consumes a considerable amount of time and resources and so, if possible a single job is preferred. You can add multiple paths to your job, and if you need different processing logic depending upon the input being consumed, you can use parameter map.input.file in yo

Re: dfs health Dashboard

2009-11-02 Thread Allen Wittenauer
On 11/1/09 10:11 PM, "V Satish Kumar" wrote: > I have noticed that the dfs health dashboard(Running on port 50070) > takes a long time to refresh the number of live nodes and dead nodes. Is > there a config parameter in hadoop that can be changed to make the > dashboard shows these changes mor

RE: Multiple Input Paths

2009-11-02 Thread Mark Vigeant
Yes, the structure is similar. They're both XML log files documenting the same set of data, just in different ways. That's a really cool idea though, to combine them. How exactly would I go about doing that? -Original Message- From: L [mailto:archit...@galatea.com] Sent: Monday, Novemb

Re: Multiple Input Paths

2009-11-02 Thread L
Mark, Is the structure of both files the same? It makes even more sense to combine the files, if you can, as I have seen a considerable speed up when I've done that (at least when I've had small files to deal with). Lajos Mark Vigeant wrote: Hey, quick question: I'm writing a program that

Multiple Input Paths

2009-11-02 Thread Mark Vigeant
Hey, quick question: I'm writing a program that parses data from 2 different files and puts the data into a table. Currently I have 2 different map functions and so I submit 2 separate jobs to the job client. Would it be more efficient to add both paths to the same mapper and only submit one jo

Re: too many 100% mapper does not complete / finish / commit

2009-11-02 Thread Jason Venner
Nominally, when the map is done, the close is fired, and all framework opened output files are flushed and the task waits for all of the ack's from the block hosting datanodes, then the output committer stages files into the task output directory. It sounds like there may be an issue with the clos

Re: too many 100% mapper does not complete / finish / commit

2009-11-02 Thread Amandeep Khurana
inline On Mon, Nov 2, 2009 at 3:15 AM, Zhang Bingjun (Eddy) wrote: > Dear Khurana, > > We didn't use MapRunnable. In stead, we used directly the package > org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper and passed our > normal Mapper Class to it using its getMapperClass() interface. We se

Re: too many 100% mapper does not complete / finish / commit

2009-11-02 Thread Zhang Bingjun (Eddy)
Dear Khurana, We didn't use MapRunnable. In stead, we used directly the package org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper and passed our normal Mapper Class to it using its getMapperClass() interface. We set the number of threads using its setNumberOfThreads(). Is this one correct wa

Re: too many 100% mapper does not complete / finish / commit

2009-11-02 Thread Zhang Bingjun (Eddy)
Hi all, An important observation. The 100% mapper without completion all have temporary files of 64MB exactly, which means the output of the mapper is cut off at the block boundary. However, we do have some successfully completed mappers having output files larger than 64MB and we also have less t

Re: too many 100% mapper does not complete / finish / commit

2009-11-02 Thread Amandeep Khurana
On Mon, Nov 2, 2009 at 2:40 AM, Zhang Bingjun (Eddy) wrote: > Hi Pallavi, Khurana, and Vasekar, > > Thanks a lot for your reply. To make up, the mapper we are using is the > multithreaded mapper. > How are you doing this? Did you your own MapRunnable? > > To answer your questions: > > Pallavi,

Re: too many 100% mapper does not complete / finish / commit

2009-11-02 Thread Zhang Bingjun (Eddy)
Hi Pallavi, Khurana, and Vasekar, Thanks a lot for your reply. To make up, the mapper we are using is the multithreaded mapper. To answer your questions: Pallavi, Khurana: I have checked the logs. The key it got stuck on is the last key it reads in. Since the progress is 100% I suppose the key i

Re: too many 100% mapper does not complete / finish / commit

2009-11-02 Thread Amogh Vasekar
Hi, Quick questions... Are you creating too many small files? Are there any task side files being created? Is the heap for NN having enough space to list metadata? Any details on its general health will probably be helpful to people on the list. Amogh On 11/2/09 2:02 PM, "Zhang Bingjun (Eddy)

RE: too many 100% mapper does not complete / finish / commit

2009-11-02 Thread Palleti, Pallavi
Hi Eddy, I faced similar issue when I used pig script for fetching webpages for certain urls. I could see the map phase showing100% and it is still running. As I was logging the page that it is currently fetching, I could see the process hasn't yet finished. It might be the same issue. So, you ca

Re: too many 100% mapper does not complete / finish / commit

2009-11-02 Thread Amandeep Khurana
Did you try to add any logging and see what keys are they getting stuck on or whats the last keys it processed? Do the same number of mappers get stuck every time? Not having reducers is not a problem. Its pretty normal to do that. On Mon, Nov 2, 2009 at 12:32 AM, Zhang Bingjun (Eddy) wrote: > D

too many 100% mapper does not complete / finish / commit

2009-11-02 Thread Zhang Bingjun (Eddy)
Dear hadoop fellows, We have been using Hadoop-0.20.1 MapReduce to crawl some web data. In this case, we only have mappers to crawl data and save data into HDFS in a distributed way. No reducers is specified in the job conf. The problem is that for every job we have about one third mappers stuck