Re: Hadoop Production Issue

2011-07-16 Thread jagaran das
Thanks Dimitry. 1. So I can write a pig job that would do merging files. 2. But again the above Pig job would work around small files, would that not affect performance. 3. For Copy again, if want to un the copy command, we need hadoop installed in the machine or you are suggesting  to use java

Re: Hadoop Production Issue

2011-07-16 Thread Dmitriy Ryaboy
1) Correct. 2) Copy to the cluster from any machine, just have the config on the classpath or specify the full path in your copy command (hdfs://my-nn/path/to/destination). On Sat, Jul 16, 2011 at 1:00 PM, jagaran das wrote: > ok then > > 1. We have to write a pig job for merging or PIG itself

Re: Hadoop Production Issue

2011-07-16 Thread jagaran das
Our Config: 72 G RAM 4 Quad Core processor 1.8 TB local memory 10 node CDH3 cluster  From: jagaran das To: "user@pig.apache.org" Sent: Saturday, 16 July 2011 11:00 AM Subject: Re: Hadoop Production Issue ok then 1. We have to write a pig job for merging or P

Re: Hadoop Production Issue

2011-07-16 Thread jagaran das
ok then 1. We have to write a pig job for merging or PIG itself merges so that less number of mappers are invoked. 2. Can we copy to a cluster from a non cluster machine, using the namespace URI of the NN ? - We can dedicate some good config boxes to do our merging and copying and then copy it

Re: Hadoop Production Issue

2011-07-16 Thread Dmitriy Ryaboy
Merging: doesn't actually speed things up all that much; reduces load on the Namenode, and speeds up job initialization some. You don't have to do it on the namenode itself. Neither do you have to do copying on the NN. In fact, don't do anything but run the NameNode on the namenode. Pig jobs can t

Re: Hadoop Production Issue

2011-07-16 Thread Jeremy Hanna
One thing that we use is filecrush to merge small files below a threshold. It works pretty well. http://www.jointhegrid.com/hadoop_filecrush/index.jsp On Jul 16, 2011, at 1:17 AM, jagaran das wrote: > > > Hi, > > Due to requirements in our current production CDH3 cluster we need to copy > a