Re: how to copy data between two hdfs cluster fastly?

2014-10-17 Thread Shivram Mani
If you still do want to use distcp 1. Break the file into smaller files (only if you have the luxury of doing this 2. Use the "-m” option to set the number of mappers. (Each map task will aim at copying (total bytes across all file) / numSplits. Uses the UniformSizeInputFormat by default 3. di

Re: how to copy data between two hdfs cluster fastly?

2014-10-17 Thread Shivram Mani
Distcp is pretty restrictive w.r.t parallelizing data copy. If all that you are doing is one large file, distcp wouldn't make this any faster. In distcp, files are the lowest level of granularity. So increasing # of maps, may not necessarily increase the overall throughput. The default number of

Re: how to copy data between two hdfs cluster fastly?

2014-10-17 Thread ch huang
yes On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky wrote: > Distcp? > On 17 Oct 2014 20:51, "Alexander Pivovarov" wrote: > >> try to run on dest cluster datanode >> $ hadoop fs -cp hdfs://from_cluster/hdfs://to_cluster/ >> >> >> >> On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani wr

Re: how to copy data between two hdfs cluster fastly?

2014-10-17 Thread ch huang
some file , total size is 2T ,and block size is 128M On Sat, Oct 18, 2014 at 2:26 AM, Shivram Mani wrote: > What is your approx input size ? > Do you have multiple files or is this one large file ? > What is your block size (source and destination cluster) ? > > On Fri, Oct 17, 2014 at 4:19 AM

Re: Dynamically set map / reducer memory

2014-10-17 Thread Girish Lingappa
Peter If you are using oozie to launch the MR jobs you can specify the memory requirements in the workflow action specific to each job, in the workflow xml you are using to launch the job. If you are writing your own driver program to launch the jobs you can still set these parameters in the job c

Re: how to copy data between two hdfs cluster fastly?

2014-10-17 Thread Jakub Stransky
Distcp? On 17 Oct 2014 20:51, "Alexander Pivovarov" wrote: > try to run on dest cluster datanode > $ hadoop fs -cp hdfs://from_cluster/hdfs://to_cluster/ > > > > On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani wrote: > >> What is your approx input size ? >> Do you have multiple files

Re: how to copy data between two hdfs cluster fastly?

2014-10-17 Thread Alexander Pivovarov
try to run on dest cluster datanode $ hadoop fs -cp hdfs://from_cluster/hdfs://to_cluster/ On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani wrote: > What is your approx input size ? > Do you have multiple files or is this one large file ? > What is your block size (source and destina

Re: Spark vs Tez

2014-10-17 Thread Gavin Yue
Spark and tez both make MR faster, this has no doubt. They also provide new features like DAG, which is quite important for interactive query processing. From this perspective, you could view them as a wrapper around MR and try to handle the intermediary buffer(files) more efficiently. It is a b

Re: how to copy data between two hdfs cluster fastly?

2014-10-17 Thread Shivram Mani
What is your approx input size ? Do you have multiple files or is this one large file ? What is your block size (source and destination cluster) ? On Fri, Oct 17, 2014 at 4:19 AM, ch huang wrote: > no ,all default > > On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu wrote: > >> Did you specified how

Re: Spark vs Tez

2014-10-17 Thread Adaryl "Bob" Wakefield, MBA
“The only problem with Spark adoption is the steep learning curve of Scala , and understanding the API properly.” This is why I’m looking for reasons to avoid Spark. In my mind, it’s one more thing to have to master and doesn’t really have anything to offer that can’t be done with other tools

Re: Spark vs Tez

2014-10-17 Thread Alexander Pivovarov
It's going to be spark engine for hive (in addition to mr and tez). Spark API is available for Java and Python as well. Tez engine is available now and it's quite stable. As for speed. For complex queries it shows 10x-20x improvement in comparison to mr engine. e.g. one of my queries runs 30 min

Dynamically set map / reducer memory

2014-10-17 Thread peter 2
HI Guys, I am trying to run a few MR jobs in a succession, some of the jobs don't need that much memory and others do. I want to be able to tell hadoop how much memory should be allocated for the mappers of each job. I know how to increase the memory for a mapper JVM, through the mapred xml. I

Re: Spark vs Tez

2014-10-17 Thread Adaryl "Bob" Wakefield, MBA
It was my understanding that Spark is faster batch processing. Tez is the new execution engine that replaces MapReduce and is also supposed to speed up batch processing. Is that not correct? B. From: Shahab Yunus Sent: Friday, October 17, 2014 1:12 PM To: user@hadoop.apache.org Subject: Re:

Re: Spark vs Tez

2014-10-17 Thread kartik saxena
I did a performance benchmark during my summer internship . I am currently a grad student. Can't reveal much about the specific project but Spark is still faster than around 4-5th iteration of Tez of the same query/dataset. By Iteration I mean utilizing the "hot-container" property of Apache Tez .

Re: Spark vs Tez

2014-10-17 Thread Alexander Pivovarov
Spark creator Amplab did some benchmarks. https://amplab.cs.berkeley.edu/benchmark/ On Fri, Oct 17, 2014 at 11:06 AM, Adaryl "Bob" Wakefield, MBA < adaryl.wakefi...@hotmail.com> wrote: > Does anybody have any performance figures on how Spark stacks up > against Tez? If you don’t have figures, d

Re: Spark vs Tez

2014-10-17 Thread Shahab Yunus
What aspects of Tez and Spark are you comparing? They have different purposes and thus not directly comparable, as far as I understand. Regards, Shahab On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA < adaryl.wakefi...@hotmail.com> wrote: > Does anybody have any performance figure

Spark vs Tez

2014-10-17 Thread Adaryl "Bob" Wakefield, MBA
Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why. B.

Re: how to copy data between two hdfs cluster fastly?

2014-10-17 Thread ch huang
no ,all default On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu wrote: > Did you specified how many map tasks? > > > On Fri, Oct 17, 2014 at 4:58 PM, ch huang wrote: > >> hi,maillist: >> i now use distcp to migrate data from CDH4.4 to CDH5.1 , i >> find when copy small file,it very good

Re: hadoop 2.4 using Protobuf - How does downgrade back to 2.3 works ?

2014-10-17 Thread Azuryy Yu
just stop your cluster, then start your HDFS with '-rollback'. but it's only if you don't finalize HDFS upgrade using command line. On Fri, Oct 17, 2014 at 8:15 AM, Manoj Samel wrote: > Hadoop 2.4.0 mentions that FSImage is stored using protobuf. So upgrade > from 2.3.0 to 2.4 would work since 2

Re: how to copy data between two hdfs cluster fastly?

2014-10-17 Thread Azuryy Yu
Did you specified how many map tasks? On Fri, Oct 17, 2014 at 4:58 PM, ch huang wrote: > hi,maillist: > i now use distcp to migrate data from CDH4.4 to CDH5.1 , i > find when copy small file,it very good, but when transfer big data ,it very > slow ,any good method recommand? thanks

how to copy data between two hdfs cluster fastly?

2014-10-17 Thread ch huang
hi,maillist: i now use distcp to migrate data from CDH4.4 to CDH5.1 , i find when copy small file,it very good, but when transfer big data ,it very slow ,any good method recommand? thanks

How do i find volume failure using java code???

2014-10-17 Thread cho ju il
Hadoop 2.4.1 2 namenodes(ha), 3 datanodes. I want to find failed volumes. but getVolumeFailures() always return zero. How do i find volume failure using java code??? Configuration conf = getConf(configPath); FileSystem fs = null; try {