Re: hadoop -put command

2012-12-26 Thread Ted Dunning
The colon is a reserved character in a URI according to RFC 3986[1]. You should be able to percent encode those colons as %3A. [1] http://tools.ietf.org/html/rfc3986 On Wed, Dec 26, 2012 at 1:00 PM, Mohit Anchlia wrote: > It looks like hadoop fs -put command doesn't like ":" in the file names.

hadoop -put command

2012-12-26 Thread Mohit Anchlia
It looks like hadoop fs -put command doesn't like ":" in the file names. Is there a way I can escape it? hadoop fs -put /home/mapr/p/hjob.2012:12:26:11.0.dat /user/apuser/temp-qdc/scratch/merge_jobs put: java.net.URISyntaxException: Relative path in absolute URI: hjob.2012:12:26:11.0.dat

Re: Setting number of mappers in Teragen

2012-12-26 Thread anil gupta
Hi Harsh, Fixed it. I was putting the -Dmapred.map.tasks=20 after specifying the input directory. I completely forgot about this trick of genericOptionParser of Hadoop. Thanks a lot. :) On Wed, Dec 26, 2012 at 10:33 AM, Harsh J wrote: > The MR1 teragen's mappers # depends on the total number of

Re: Setting number of mappers in Teragen

2012-12-26 Thread Harsh J
The MR1 teragen's mappers # depends on the total number of rows and demanded # of maps. How are you passing -Dmapred.map.tasks=20 (no spaces) exactly? All generic options must go in before any other options do, so it should appear right after the word "teragen" in your command. On Wed, Dec 26, 20

Re: good way to debug map reduce code

2012-12-26 Thread SUJIT PAL
Hi Jamal, A missing semi-colon should get flagged by the Java compiler, but one way to keep you debug cycles short is to (1) use local mode and (2) small data sets which you can run through under a minute. Once you are happy that your stuff works, move to distributed and target data sets. HTH

Re: why not hadoop backup name node data to local disk daily or hourly?

2012-12-26 Thread Robert Dyer
I actually have this exact same error. After running my namenode for awhile (with a snn), it gets to a point where the snn starts crashing and if I try to restart the NN I will get this problem. I typically wind up having to go with a much older copy of the image and edits files in order to get i

Re: Map Shuffle Bytes

2012-12-26 Thread Eduard Skaley
For this I need to know where an inputsplit is located. And where a join is computed. How can I do this programmatically ? This isn't called 'shuffle' (but rather a plain remote read) so your original question was confusing, thanks for clarifying! In that case, you could count the bytes coming i

Re: How to estimate hadoop.tmp.dir disk space

2012-12-26 Thread centerqi hu
How much disk space to meet the needs of hadoop.tmp.dir? thx 2012/12/26 Harsh J > The hadoop.tmp.dir is a local directory, usually defaulting to under > /tmp/ and is thereby limited by that mount's space, not the HDFS > space. > > On Wed, Dec 26, 2012 at 1:25 PM, centerqi hu wrote: > > hi all

Re: distributed cache

2012-12-26 Thread Harsh J
Hi, Sorry for having been ambiguous. For (1) I meant a large block (if the block size is large). For (2) I meant multiple, concurrent threads. On Wed, Dec 26, 2012 at 5:36 PM, Lin Ma wrote: > Thanks Harsh, > > For long read, you mean read a large continuous part of a file, other than a > small c

Re: distributed cache

2012-12-26 Thread Lin Ma
Thanks Harsh, 1. For long read, you mean read a large continuous part of a file, other than a small chunk of a file? 2. "gradually decreasing performance for long reads" -- you mean parallel multiple threads long read degrade performance? Or single thread exclusive long read degrade

Re: Map Shuffle Bytes

2012-12-26 Thread Harsh J
This isn't called 'shuffle' (but rather a plain remote read) so your original question was confusing, thanks for clarifying! In that case, you could count the bytes coming in from the required record reader - for example a TextRecordReader uses a Long key that denotes current offset in file, which

Re: distributed cache

2012-12-26 Thread Harsh J
Hi Lin, It is comparable (and is also logically similar) to reading a file multiple times in parallel in a local filesystem - not too much of a performance hit for small reads (by virtue of OS caches, and quick completion per read, as is usually the case for distributed cache files), and gradually

Re: Map Shuffle Bytes

2012-12-26 Thread Eduard Skaley
Hi, I mean TO the mappers. I'm using the CompositeInputFormat for my application to compute map-side joins. I want to join two datasets A and B one is stored on node 1 and the other one on node 2. For example if the join will be computed on node 2 then the inputsplit of the dataset which is st

Re: distributed cache

2012-12-26 Thread Lin Ma
Thanks Harsh, multiple concurrent read is generally faster or? regards, Lin On Wed, Dec 26, 2012 at 6:21 PM, Harsh J wrote: > There is no limitation in HDFS that limits reads of a block to a > single client at a time (no reason to do so) - so downloads can be as > concurrent as possible. > > On

Re: distributed cache

2012-12-26 Thread Harsh J
There is no limitation in HDFS that limits reads of a block to a single client at a time (no reason to do so) - so downloads can be as concurrent as possible. On Wed, Dec 26, 2012 at 3:41 PM, Lin Ma wrote: > Thanks Harsh, > > Supposing DistributedCache is uploaded by client, for each replica, in

Re: distributed cache

2012-12-26 Thread Lin Ma
Thanks Harsh, Supposing DistributedCache is uploaded by client, for each replica, in Hadoop design, it could only serve one download session (download from a mapper or a reducer which requires the DistributedCache) at a time until DistributedCache file download is completed, or it could serve mult

Re: distributed cache

2012-12-26 Thread Harsh J
Hi Lin, DistributedCache files are stored onto the HDFS by the client first. The TaskTrackers download and localize it. Therefore, as with any other file on HDFS, "downloads" can be efficiently parallel with higher replicas. The point of having higher replication for these files is also tied to t

Re: Map Shuffle Bytes

2012-12-26 Thread Harsh J
Hi, What do you mean by "shuffled bytes [to] the mappers"? If you mean "from", it is "Reduce shuffle bytes" you look for; otherwise, you may be looking for the per-map counter of "Map output bytes". Per-partition counters can be constructed on the user side if needed, by pre-computing the partiti

Re: How to estimate hadoop.tmp.dir disk space

2012-12-26 Thread Harsh J
The hadoop.tmp.dir is a local directory, usually defaulting to under /tmp/ and is thereby limited by that mount's space, not the HDFS space. On Wed, Dec 26, 2012 at 1:25 PM, centerqi hu wrote: > hi all > I encountered trouble > > Message: org.apache.hadoop.ipc.RemoteException: java.io.IOExceptio

Re: good way to debug map reduce code

2012-12-26 Thread Harsh J
For Java MR jobs, there is Apache MRUnit that provides a good way of writing test cases. See http://mrunit.apache.org On Wed, Dec 26, 2012 at 7:26 AM, jamal sasha wrote: > Hi, > I have been using python hadoop streaming framework to write the code and > now I am slowly moving towards the core j

Re: How to estimate hadoop.tmp.dir disk space

2012-12-26 Thread Nitin Pawar
Do you have mounted drives on the disk like JBOD setup where you have allocated few drives to hdfs? check df -h on all the nodes you may get the mount which holds the logs or any other information which is outside dfs may be full On Wed, Dec 26, 2012 at 1:25 PM, centerqi hu wrote: > hi all >