Re: Limitation of key-value pairs for a particular key.

2013-01-10 Thread Vinod Kumar Vavilapalli
There isn't any limit like that. Can you reproduce this consistently? If so, please file a ticket. It will definitely help if you can provide a test case which can reproduce this issue. Thanks, +Vinod On Thu, Jan 10, 2013 at 12:41 AM, Utkarsh Gupta utkarsh_gu...@infosys.comwrote: Hi,

Re: JobCache directory cleanup

2013-01-10 Thread Ivan Tretyakov
Thanks for replies! Hemanth, I could see following exception in TaskTracker log: https://issues.apache.org/jira/browse/MAPREDUCE-5 But I'm not sure if it is related to this issue. Now, when a job completes, the directories under the jobCache must get automatically cleaned up. However it doesn't

Re: JobCache directory cleanup

2013-01-10 Thread Hemanth Yamijala
Hi, On Thu, Jan 10, 2013 at 5:17 PM, Ivan Tretyakov itretya...@griddynamics.com wrote: Thanks for replies! Hemanth, I could see following exception in TaskTracker log: https://issues.apache.org/jira/browse/MAPREDUCE-5 But I'm not sure if it is related to this issue. Now, when a job

Re: Not committing output in map reduce

2013-01-10 Thread Hemanth Yamijala
Is this the same as: http://stackoverflow.com/questions/6137139/how-to-save-only-non-empty-reducers-output-in-hdfs? i.e. LazyOutputFormat, etc. ? On Thu, Jan 10, 2013 at 4:51 PM, Pratyush Chandra chandra.praty...@gmail.com wrote: Hi, I am using s3n as file system. I do not wish to create

Re: could only be replicated to 0 nodes instead of minReplication

2013-01-10 Thread Ivan Tretyakov
I also found following exception in datanode, I suppose it might give some clue: 2013-01-10 11:37:55,397 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: node02.303net.pvt:50010:DataXceiver error processing READ_BLOCK operation src: /192.168.1.112:35991 dest: /192.168.1.112:50010

Re: JobCache directory cleanup

2013-01-10 Thread Artem Ervits
As soon as job completes, your jobcache should be cleared. Check your mapred-site.xml for mapred.local.dir setting and make sure job cleanup step is successful in web UI. Setting your job's intermediate output setting to true will keep the jobcache folder smaller. Artem Ervits Data Analyst

Re: Not committing output in map reduce

2013-01-10 Thread Pratyush Chandra
Yes, it worked. Thanks Pratyush On Thu, Jan 10, 2013 at 6:14 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Is this the same as: http://stackoverflow.com/questions/6137139/how-to-save-only-non-empty-reducers-output-in-hdfs? i.e. LazyOutputFormat, etc. ? On Thu, Jan 10, 2013 at

Re: could only be replicated to 0 nodes instead of minReplication

2013-01-10 Thread Robert Molina
Hi Ivan, Here are a couple of more suggestions provided by the wiki: http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo Regards, Robert On Thu, Jan 10, 2013 at 5:33 AM, Ivan Tretyakov itretya...@griddynamics.com wrote: I also found following exception in datanode, I suppose it might give

Re: JobCache directory cleanup

2013-01-10 Thread Vinod Kumar Vavilapalli
Can you check the job configuration for these ~100 jobs? Do they have keep.failed.task.files set to true? If so, these files won't be deleted. If it doesn't, it could be a bug. Sharing your configs for these jobs will definitely help. Thanks, +Vinod On Wed, Jan 9, 2013 at 6:41 AM, Ivan

Re: Issue in Apache Hadoop Documentation

2013-01-10 Thread Vinod Kumar Vavilapalli
Oh, and user@ is the correct mailing list. +Vinod On Thu, Jan 10, 2013 at 9:32 AM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Great catch, it's a shame it still exists in 1.* stable releases! Can you please file a ticket and fix it, thanks! +Vinod On Thu, Jan 10, 2013 at

Re: hortonworks install fail

2013-01-10 Thread Hitesh Shah
Hi ambari-user@ is probably the better list for this. It seems like your puppet command is timing out. Could you reply back with the contents of the /var/log/puppet_apply.log from the node in question? Also, it might be worth waiting a few days for the next release of ambari which should

Re: WEBHDFS API GETCONTENTSUMMARY issue

2013-01-10 Thread Arpit Gupta
Rodrigo GETCONTENTSUMMARY will return the summary of everything under the path you specified, even the subdirectories. So i would suggest take a look in the directories and see what content they have and then the numbers should add up. -- Arpit Gupta Hortonworks Inc. http://hortonworks.com/

Re: WEBHDFS API GETCONTENTSUMMARY issue

2013-01-10 Thread Arpit Gupta
forgot to mention that the path you are using in the api will also count towards the directory count. -- Arpit Gupta Hortonworks Inc. http://hortonworks.com/ On Jan 10, 2013, at 3:55 PM, Arpit Gupta ar...@hortonworks.com wrote: Rodrigo GETCONTENTSUMMARY will return the summary of

Sub-queues in capacity scheduler

2013-01-10 Thread Patai Sangbutsarakum
Capacity scheduler's Hierarchical Queue is exactly what we're looking for. I really wonder if it is possible (and how) to make it work in cdh3u4. -P

Re: JobCache directory cleanup

2013-01-10 Thread Hemanth Yamijala
Good point. Forgot that one :-) On Thu, Jan 10, 2013 at 10:53 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Can you check the job configuration for these ~100 jobs? Do they have keep.failed.task.files set to true? If so, these files won't be deleted. If it doesn't, it could

HDFS disk space requirement

2013-01-10 Thread Panshul Whisper
Hello, I have a hadoop cluster of 5 nodes with a total of available HDFS space 130 GB with replication set to 5. I have a file of 115 GB, which needs to be copied to the HDFS and processed. Do I need to have anymore HDFS space for performing all processing without running into any problems? or is

Re: HDFS disk space requirement

2013-01-10 Thread பாலாஜி நாராயணன்
If the replication factor is 5 you will need at least 5x the space if the file. So this is not going tobe enough. On Thursday, January 10, 2013, Panshul Whisper wrote: Hello, I have a hadoop cluster of 5 nodes with a total of available HDFS space 130 GB with replication set to 5. I have a

queues in haddop

2013-01-10 Thread Panshul Whisper
Hello, I have a hadoop cluster setup of 10 nodes and I an in need of implementing queues in the cluster for receiving high volumes of data. Please suggest what will be more efficient to use in the case of receiving 24 Million Json files.. approx 5 KB each in every 24 hours : 1. Using Capacity

Re: HDFS disk space requirement

2013-01-10 Thread Ravi Mutyala
If the file is a txt file, you could get a good compression ratio. Changing the replication to 3 and the file will fit. But not sure what your usecase is what you want to achieve by putting this data there. Any transformation on this data and you would need more space to save the transformed data.

Re: HDFS disk space requirement

2013-01-10 Thread Panshul Whisper
Thank you for the response. Actually it is not a single file, I have JSON files that amount to 115 GB, these JSON files need to be processed and loaded into a Hbase data tables on the same cluster for later processing. Not considering the disk space required for the Hbase storage, If I reduce the

How to interpret the progress meter?

2013-01-10 Thread Roy Smith
I'm running a job that looks like it's going to take about 12 hours on 4 EC2 instances. I don't really understand the complete percentages reported by http://localhost:9100/jobtasks.jsp. They are extremely non-linear. For my reduce steps, they ramp up to 40-60% in just a few minutes, then

Re: Why the official Hadoop Documents are so messy?

2013-01-10 Thread Jason Lee
Thanks, you guys, deeply appreciated for your replies. I am just a newbie to Hadoop. I noticed this problem when i looking for some documentations on official site. No offensive, i just think these official documentations may be confusing beginners like me. I'm very glad to post this to hadoop

Re: queues in haddop

2013-01-10 Thread Mohit Anchlia
Have you looked at flume? Sent from my iPhone On Jan 10, 2013, at 7:12 PM, Panshul Whisper ouchwhis...@gmail.com wrote: Hello, I have a hadoop cluster setup of 10 nodes and I an in need of implementing queues in the cluster for receiving high volumes of data. Please suggest what will be

unsubscribe

2013-01-10 Thread vyhb杨洪波

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

2013-01-10 Thread Mahesh Balija
Hi, 2 reducers are successfully completed and 1498 have been killed. I assume that you have the data issues. (Either the data is huge or some issues with the data you are trying to process) One possibility could be you have many values associated to a single key, which can

Re: How to interpret the progress meter?

2013-01-10 Thread Mahesh Balija
Hi Smith, In my experience usually the first 40% to around 70% the actual process will occur the remaining would be devoted to write/flush the data to the output files, usually this may take more time. Best, Mahesh Balija, Calsoft Labs. On Fri, Jan 11, 2013 at 9:32 AM, Roy Smith

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

2013-01-10 Thread yaotian
Yes, you are right. The data is GPS trace related to corresponding uid. The reduce is doing this: Sort user to get this kind of result: uid, gps1, gps2, gps3 Yes, the gps data is big because this is 30G data. How to solve this? 2013/1/11 Mahesh Balija balijamahesh@gmail.com Hi,

Re: queues in haddop

2013-01-10 Thread shashwat shriparv
The attached screenshot will shows how flume will work, and also you can consider RabbitMQ, as it persistent too.. ∞ Shashwat Shriparv On Fri, Jan 11, 2013 at 10:24 AM, Mohit Anchlia mohitanch...@gmail.comwrote: Have you looked at flume? Sent from my iPhone On Jan 10, 2013, at 7:12 PM,

Re: HDFS disk space requirement

2013-01-10 Thread Alexander Pivovarov
finish elementary school first. (plus, minus operations at least) On Thu, Jan 10, 2013 at 7:23 PM, Panshul Whisper ouchwhis...@gmail.comwrote: Thank you for the response. Actually it is not a single file, I have JSON files that amount to 115 GB, these JSON files need to be processed and

Re: How to interpret the progress meter?

2013-01-10 Thread Harsh J
The map side percentage is as the map's record reader reports its progress. The reduce side is divided into 3 phases of 33~% each - shuffle (fetch data), sort and finally user-code (reduce). It is normal to see jumps between these values, depending on the work to be done, etc. On Fri, Jan 11,

Re: queues in haddop

2013-01-10 Thread Harsh J
Your question in unclear: HDFS has no queues for ingesting data (it is a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN components have queues for processing data purposes. On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper ouchwhis...@gmail.com wrote: Hello, I have a hadoop

Re: HDFS disk space requirement

2013-01-10 Thread shashwat shriparv
115 * 5 = 575 Minimum GB you need, keep in mind on minimal, and you will have other disk space needs too... ∞ Shashwat Shriparv On Fri, Jan 11, 2013 at 11:19 AM, Alexander Pivovarov apivova...@gmail.comwrote: finish elementary school first. (plus, minus operations at least) On Thu, Jan

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

2013-01-10 Thread Harsh J
If the per-record processing time is very high, you will need to periodically report a status. Without a status change report from the task to the tracker, it will be killed away as a dead task after a default timeout of 10 minutes (600s). Also, beware of holding too much memory in a reduce JVM -

Re: Sub-queues in capacity scheduler

2013-01-10 Thread Harsh J
Hierarchal queues are a feature of YARN's CapacityScheduler, which isn't available in 1.x based releases/distributions such as CDH3u4. On Fri, Jan 11, 2013 at 6:50 AM, Patai Sangbutsarakum silvianhad...@gmail.com wrote: Capacity scheduler's Hierarchical Queue is exactly what we're looking for.

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

2013-01-10 Thread yaotian
See inline. 2013/1/11 Harsh J ha...@cloudera.com If the per-record processing time is very high, you will need to periodically report a status. Without a status change report from the task to the tracker, it will be killed away as a dead task after a default timeout of 10 minutes (600s).

Re: I am running MapReduce on a 30G data on 1master/2 slave, but failed.

2013-01-10 Thread bejoy . hadoop
Hi To add on to Harsh's comments. You need not have to change the task time out. In your map/reduce code, you can increment a counter or report status intermediate on intervals so that there is communication from the task and hence won't have a task time out. Every map and reduce task run