Thanks Dimitry.
1. So I can write a pig job that would do merging files.
2. But again the above Pig job would work around small files, would that not
affect performance.
3. For Copy again, if want to un the copy command, we need hadoop installed in
the machine or you are suggesting to use java
1) Correct.
2) Copy to the cluster from any machine, just have the config on the
classpath or specify the full path in your copy command
(hdfs://my-nn/path/to/destination).
On Sat, Jul 16, 2011 at 1:00 PM, jagaran das wrote:
> ok then
>
> 1. We have to write a pig job for merging or PIG itself
Our Config:
72 G RAM 4 Quad Core processor 1.8 TB local memory
10 node CDH3 cluster
From: jagaran das
To: "user@pig.apache.org"
Sent: Saturday, 16 July 2011 11:00 AM
Subject: Re: Hadoop Production Issue
ok then
1. We have to write a pig job for merging or P
ok then
1. We have to write a pig job for merging or PIG itself merges so that less
number of mappers are invoked.
2. Can we copy to a cluster from a non cluster machine, using the namespace URI
of the NN ? - We can dedicate some good config boxes to do our merging and
copying and then copy it
Merging: doesn't actually speed things up all that much; reduces load
on the Namenode, and speeds up job initialization some. You don't have
to do it on the namenode itself. Neither do you have to do copying on
the NN. In fact, don't do anything but run the NameNode on the
namenode.
Pig jobs can t
One thing that we use is filecrush to merge small files below a threshold. It
works pretty well.
http://www.jointhegrid.com/hadoop_filecrush/index.jsp
On Jul 16, 2011, at 1:17 AM, jagaran das wrote:
>
>
> Hi,
>
> Due to requirements in our current production CDH3 cluster we need to copy
> a