There isn't any limit like that. Can you reproduce this consistently? If
so, please file a ticket.
It will definitely help if you can provide a test case which can reproduce
this issue.
Thanks,
+Vinod
On Thu, Jan 10, 2013 at 12:41 AM, Utkarsh Gupta
utkarsh_gu...@infosys.comwrote:
Hi,
Thanks for replies!
Hemanth,
I could see following exception in TaskTracker log:
https://issues.apache.org/jira/browse/MAPREDUCE-5
But I'm not sure if it is related to this issue.
Now, when a job completes, the directories under the jobCache must get
automatically cleaned up. However it doesn't
Hi,
On Thu, Jan 10, 2013 at 5:17 PM, Ivan Tretyakov itretya...@griddynamics.com
wrote:
Thanks for replies!
Hemanth,
I could see following exception in TaskTracker log:
https://issues.apache.org/jira/browse/MAPREDUCE-5
But I'm not sure if it is related to this issue.
Now, when a job
Is this the same as:
http://stackoverflow.com/questions/6137139/how-to-save-only-non-empty-reducers-output-in-hdfs?
i.e. LazyOutputFormat, etc. ?
On Thu, Jan 10, 2013 at 4:51 PM, Pratyush Chandra
chandra.praty...@gmail.com wrote:
Hi,
I am using s3n as file system. I do not wish to create
I also found following exception in datanode, I suppose it might give some
clue:
2013-01-10 11:37:55,397 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
node02.303net.pvt:50010:DataXceiver error processing READ_BLOCK operation
src: /192.168.1.112:35991 dest: /192.168.1.112:50010
As soon as job completes, your jobcache should be cleared. Check your
mapred-site.xml for mapred.local.dir setting and make sure job cleanup step is
successful in web UI. Setting your job's intermediate output setting to true
will keep the jobcache folder smaller.
Artem Ervits
Data Analyst
Yes, it worked. Thanks
Pratyush
On Thu, Jan 10, 2013 at 6:14 PM, Hemanth Yamijala yhema...@thoughtworks.com
wrote:
Is this the same as:
http://stackoverflow.com/questions/6137139/how-to-save-only-non-empty-reducers-output-in-hdfs?
i.e. LazyOutputFormat, etc. ?
On Thu, Jan 10, 2013 at
Hi Ivan,
Here are a couple of more suggestions provided by the wiki:
http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo
Regards,
Robert
On Thu, Jan 10, 2013 at 5:33 AM, Ivan Tretyakov itretya...@griddynamics.com
wrote:
I also found following exception in datanode, I suppose it might give
Can you check the job configuration for these ~100 jobs? Do they have
keep.failed.task.files set to true? If so, these files won't be deleted. If
it doesn't, it could be a bug.
Sharing your configs for these jobs will definitely help.
Thanks,
+Vinod
On Wed, Jan 9, 2013 at 6:41 AM, Ivan
Oh, and user@ is the correct mailing list.
+Vinod
On Thu, Jan 10, 2013 at 9:32 AM, Vinod Kumar Vavilapalli
vino...@hortonworks.com wrote:
Great catch, it's a shame it still exists in 1.* stable releases! Can you
please file a ticket and fix it, thanks!
+Vinod
On Thu, Jan 10, 2013 at
Hi
ambari-user@ is probably the better list for this.
It seems like your puppet command is timing out. Could you reply back with the
contents of the /var/log/puppet_apply.log from the node in question?
Also, it might be worth waiting a few days for the next release of ambari which
should
Rodrigo
GETCONTENTSUMMARY will return the summary of everything under the path you
specified, even the subdirectories. So i would suggest take a look in the
directories and see what content they have and then the numbers should add up.
--
Arpit Gupta
Hortonworks Inc.
http://hortonworks.com/
forgot to mention that the path you are using in the api will also count
towards the directory count.
--
Arpit Gupta
Hortonworks Inc.
http://hortonworks.com/
On Jan 10, 2013, at 3:55 PM, Arpit Gupta ar...@hortonworks.com wrote:
Rodrigo
GETCONTENTSUMMARY will return the summary of
Capacity scheduler's Hierarchical Queue is exactly what we're looking
for. I really wonder if it is possible (and how) to make it work in
cdh3u4.
-P
Good point. Forgot that one :-)
On Thu, Jan 10, 2013 at 10:53 PM, Vinod Kumar Vavilapalli
vino...@hortonworks.com wrote:
Can you check the job configuration for these ~100 jobs? Do they have
keep.failed.task.files set to true? If so, these files won't be deleted. If
it doesn't, it could
Hello,
I have a hadoop cluster of 5 nodes with a total of available HDFS space 130
GB with replication set to 5.
I have a file of 115 GB, which needs to be copied to the HDFS and processed.
Do I need to have anymore HDFS space for performing all processing without
running into any problems? or is
If the replication factor is 5 you will need at least 5x the space if the
file. So this is not going tobe enough.
On Thursday, January 10, 2013, Panshul Whisper wrote:
Hello,
I have a hadoop cluster of 5 nodes with a total of available HDFS space
130 GB with replication set to 5.
I have a
Hello,
I have a hadoop cluster setup of 10 nodes and I an in need of implementing
queues in the cluster for receiving high volumes of data.
Please suggest what will be more efficient to use in the case of receiving
24 Million Json files.. approx 5 KB each in every 24 hours :
1. Using Capacity
If the file is a txt file, you could get a good compression ratio. Changing
the replication to 3 and the file will fit. But not sure what your usecase
is what you want to achieve by putting this data there. Any transformation
on this data and you would need more space to save the transformed data.
Thank you for the response.
Actually it is not a single file, I have JSON files that amount to 115 GB,
these JSON files need to be processed and loaded into a Hbase data tables
on the same cluster for later processing. Not considering the disk space
required for the Hbase storage, If I reduce the
I'm running a job that looks like it's going to take about 12 hours on 4 EC2
instances. I don't really understand the complete percentages reported by
http://localhost:9100/jobtasks.jsp. They are extremely non-linear. For my
reduce steps, they ramp up to 40-60% in just a few minutes, then
Thanks, you guys, deeply appreciated for your replies.
I am just a newbie to Hadoop. I noticed this problem when i looking for
some documentations on official site. No offensive, i just think these
official documentations may be confusing beginners like me.
I'm very glad to post this to hadoop
Have you looked at flume?
Sent from my iPhone
On Jan 10, 2013, at 7:12 PM, Panshul Whisper ouchwhis...@gmail.com wrote:
Hello,
I have a hadoop cluster setup of 10 nodes and I an in need of implementing
queues in the cluster for receiving high volumes of data.
Please suggest what will be
Hi,
2 reducers are successfully completed and 1498 have been killed.
I assume that you have the data issues. (Either the data is huge or some
issues with the data you are trying to process)
One possibility could be you have many values associated to a
single key, which can
Hi Smith,
In my experience usually the first 40% to around 70% the actual
process will occur the remaining would be devoted to write/flush the data
to the output files, usually this may take more time.
Best,
Mahesh Balija,
Calsoft Labs.
On Fri, Jan 11, 2013 at 9:32 AM, Roy Smith
Yes, you are right. The data is GPS trace related to corresponding uid. The
reduce is doing this: Sort user to get this kind of result: uid, gps1,
gps2, gps3
Yes, the gps data is big because this is 30G data.
How to solve this?
2013/1/11 Mahesh Balija balijamahesh@gmail.com
Hi,
The attached screenshot will shows how flume will work, and also you can
consider RabbitMQ, as it persistent too..
∞
Shashwat Shriparv
On Fri, Jan 11, 2013 at 10:24 AM, Mohit Anchlia mohitanch...@gmail.comwrote:
Have you looked at flume?
Sent from my iPhone
On Jan 10, 2013, at 7:12 PM,
finish elementary school first. (plus, minus operations at least)
On Thu, Jan 10, 2013 at 7:23 PM, Panshul Whisper ouchwhis...@gmail.comwrote:
Thank you for the response.
Actually it is not a single file, I have JSON files that amount to 115 GB,
these JSON files need to be processed and
The map side percentage is as the map's record reader reports its
progress. The reduce side is divided into 3 phases of 33~% each -
shuffle (fetch data), sort and finally user-code (reduce). It is
normal to see jumps between these values, depending on the work to be
done, etc.
On Fri, Jan 11,
Your question in unclear: HDFS has no queues for ingesting data (it is
a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN
components have queues for processing data purposes.
On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper ouchwhis...@gmail.com wrote:
Hello,
I have a hadoop
115 * 5 = 575 Minimum GB you need, keep in mind on minimal, and you will
have other disk space needs too...
∞
Shashwat Shriparv
On Fri, Jan 11, 2013 at 11:19 AM, Alexander Pivovarov
apivova...@gmail.comwrote:
finish elementary school first. (plus, minus operations at least)
On Thu, Jan
If the per-record processing time is very high, you will need to
periodically report a status. Without a status change report from the task
to the tracker, it will be killed away as a dead task after a default
timeout of 10 minutes (600s).
Also, beware of holding too much memory in a reduce JVM -
Hierarchal queues are a feature of YARN's CapacityScheduler, which
isn't available in 1.x based releases/distributions such as CDH3u4.
On Fri, Jan 11, 2013 at 6:50 AM, Patai Sangbutsarakum
silvianhad...@gmail.com wrote:
Capacity scheduler's Hierarchical Queue is exactly what we're looking
for.
See inline.
2013/1/11 Harsh J ha...@cloudera.com
If the per-record processing time is very high, you will need to
periodically report a status. Without a status change report from the task
to the tracker, it will be killed away as a dead task after a default
timeout of 10 minutes (600s).
Hi
To add on to Harsh's comments.
You need not have to change the task time out.
In your map/reduce code, you can increment a counter or report status
intermediate on intervals so that there is communication from the task and
hence won't have a task time out.
Every map and reduce task run
36 matches
Mail list logo