Re: suggest Best way to upload xml files to HDFS

2012-07-12 Thread Harsh J
If you're looking at automated file/record/event collection, take a look at Apache Flume: http://incubator.apache.org/flume/. It does well for distributed collections as well and is very configurable. Otherwise, write a scheduled script to do the uploads every X period (your choice). Consider usin

Re: Jobs randomly not starting

2012-07-12 Thread Harsh J
Hey Robert, Any chance you can pastebin the JT logs, grepped for the bad job ID, and send the link across? They shouldn't hang the way you describe. On Fri, Jul 13, 2012 at 9:33 AM, Robert Dyer wrote: > I'm using Hadoop 1.0.3 on a small cluster (1 namenode, 1 jobtracker, 2 > compute nodes). My

Re: suggest Best way to upload xml files to HDFS

2012-07-12 Thread Manoj Babu
Hi, Could you kindly provide the pros and cons of Multifile, combilefile, sequencefile input format? Thanks in Advance. Cheers! Manoj. On Fri, Jul 13, 2012 at 10:15 AM, Bejoy KS wrote: > ** > Hi Manoj > > If you are looking at a scheduler and a work flow manager to carry out > this task you

Re: suggest Best way to upload xml files to HDFS

2012-07-12 Thread Bejoy KS
Hi Manoj If you are looking at a scheduler and a work flow manager to carry out this task you can have a look at oozie. If your xml files are smaller(smaller than hdfs block size) then definitely it is a better practice to combine them to form larger files. Combining into Sequence Files should

Re: Jobs randomly not starting

2012-07-12 Thread Bejoy KS
Hi Robert It could be because there are no free slots available in your cluster during job submission time to launch those tasks. Some other tasks may have already occupied the map/reduce slots. When you experience this random issue please verify whether there are free task slots available.

Jobs randomly not starting

2012-07-12 Thread Robert Dyer
I'm using Hadoop 1.0.3 on a small cluster (1 namenode, 1 jobtracker, 2 compute nodes). My input size is a sequence file of around 280mb. Generally, my jobs run just fine and all finish in 2-5 minutes. However, quite randomly the jobs refuse to run. They submit and appear when running 'hadoop jo

Re: StreamXMLRecordReader

2012-07-12 Thread Harsh J
Hi Siv, Moving this to mapreduce-user@, please use developer lists only for project development questions. The class StreamXmlRecordReader is still present in 2.0.0 and will continue to be present. Why do you say it has been removed? It now resides in the hadoop-tools jar (which is what carries s

RE: Extra output files from mapper ?

2012-07-12 Thread Connell, Chuck
This works. Thank you. -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Thursday, July 12, 2012 10:58 AM To: mapreduce-user@hadoop.apache.org Subject: Re: Extra output files from mapper ? Chuck, Note that the regular file opens from within an MR program (be it strea

hbase map reduce is talking lot of time

2012-07-12 Thread syed kather
Team , i had wrote a mapreduce program . scenario of my program is to emit . Total no user : 825 Total no seqid:6583100 No of map which the program will emit is : 825 * 6583100 I have Hbase table called ObjectSequence : which consist of 6583100(rows) i had use TableMapper and

Re: Extra output files from mapper ?

2012-07-12 Thread Harsh J
You can ship the module along with a symlink and have Python auto-import it since "." is always on PATH? I can imagine that helping you get Pydoop on a cluster without Pydoop on all nodes (or other libs). On Thu, Jul 12, 2012 at 11:08 PM, Connell, Chuck wrote: > Thanks yet again. Since my goal is

RE: Extra output files from mapper ?

2012-07-12 Thread Connell, Chuck
Thanks yet again. Since my goal is to run an existing Python program, as is, under MR, it looks like I need the os.system(copy-local-to-hdfs) technique. Chuck -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Thursday, July 12, 2012 1:15 PM To: mapreduce-user@hadoop.

Re: Extra output files from mapper ?

2012-07-12 Thread Harsh J
Unfortunately Python does not recognize hdfs:// URIs. It isn't a standard like HTTP is, so to say, at least not yet :) You can instead use Pydoop's HDFS APIs though http://pydoop.sourceforge.net/docs/api_docs/hdfs_api.html#hdfs-api. Pydoop authors are pretty active and do releases from time to tim

Re: How to use CombineFileInputFormat in Hadoop?

2012-07-12 Thread Manoj Babu
Ya Harsh its been posted 2 months back no response, if you google for CombineFileInputFormat sure you will see it. Mapredue user group rocks! Thanks for the response. On 12 Jul 2012 21:25, "Harsh J" wrote:

RE: Extra output files from mapper ?

2012-07-12 Thread Connell, Chuck
Thank you. I will try that. A related question... Shouldn't I just be able to create HDFS files directly from a Python open statement, when running within MR, like this? It does not seem to work as intended. outfile1 = open("hdfs://localhost/tmp/out1.txt", 'w') Chuck -Original Message-

Re: How to use CombineFileInputFormat in Hadoop?

2012-07-12 Thread Harsh J
Hey Manoj, I find the asker name here quite strange, although it is the same question, ha: http://stackoverflow.com/questions/10380200/how-to-use-combinefileinputformat-in-hadoop Anyhow, here's one example: http://blog.yetitrails.com/2011/04/dealing-with-lots-of-small-files-in.html On Thu, Jul

How to use CombineFileInputFormat in Hadoop?

2012-07-12 Thread Manoj Babu
Gentles, I want to use the CombineFileInputFormat of Hadoop 0.20.0 / 0.20.2 such that it processes 1 file per record and also doesn't compromise on data - locality (which it normally takes care of). It is mentioned in Tom White's Hadoop Definitive Guide but he has not shown how to do it. Instead,

Re: Extra output files from mapper ?

2012-07-12 Thread Harsh J
Chuck, Note that the regular file opens from within an MR program (be it streaming or be it Java), will create files on the local file system of the node the task executed on. Hence, at the end of your script, move them to HDFS after closing them. Something like: os.system("hadoop fs -put outfi

RE: Extra output files from mapper ?

2012-07-12 Thread Connell, Chuck
Here is a test case... The Python code (file_io.py) that I want to run as a map-only job is below. It takes one input file (not stdin) and creates two output files (not stdout). #!/usr/bin/env python import sys infile = open(sys.argv[1], 'r') outfile1 = open(sys.argv[2], 'w') outfile2 = open(

Difference between Nutch crawl giving depth='N' and crawling in loop N times with depth='1'

2012-07-12 Thread ashish vyas
Background of my problem: I am running Nutch1.4 on Hadoop0.20.203. There are series of MapReduce jobs that i am performing on Nutch segments to get final output. But waiting for whole crawl to happen before running mapreduce causes solution to run for longer time. I am now triggering MapReduce jobs