How do i find volume failure using java code???

2014-10-17 Thread cho ju il
Hadoop 2.4.1
2 namenodes(ha), 3 datanodes.
I want to find failed volumes. but getVolumeFailures() always return zero.
How do i find volume failure using java code??? 
 
Configuration conf = getConf(configPath);
FileSystem fs = null;
try {
fs = FileSystem.get(conf);
if (!(fs instanceof DistributedFileSystem)) {
System.err.println(FileSystem is  + 
fs.getUri());
return ;
}

DistributedFileSystem dfs = (DistributedFileSystem) fs;
DatanodeInfo[] nodes = 
dfs.getDataNodeStats(DatanodeReportType.ALL);
for(DatanodeInfo node : nodes) 
{   
if( node instanceof DatanodeID )
{   
DatanodeDescriptor desc = new 
DatanodeDescriptor(node);
// getVolumeFailures() always return 
zero.

System.out.println(desc.getVolumeFailures());
}

}
} catch (IOException ioe) {
System.err.println(FileSystem is inaccessible due 
to:\n + StringUtils.stringifyException(ioe));
return ;
}


how to copy data between two hdfs cluster fastly?

2014-10-17 Thread ch huang
hi,maillist:
 i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
find when copy small file,it very good, but when transfer big data ,it very
slow ,any good method recommand? thanks


Re: how to copy data between two hdfs cluster fastly?

2014-10-17 Thread Azuryy Yu
Did you specified how many map tasks?


On Fri, Oct 17, 2014 at 4:58 PM, ch huang justlo...@gmail.com wrote:

 hi,maillist:
  i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
 find when copy small file,it very good, but when transfer big data ,it very
 slow ,any good method recommand? thanks



Re: hadoop 2.4 using Protobuf - How does downgrade back to 2.3 works ?

2014-10-17 Thread Azuryy Yu
just stop your cluster, then start your HDFS with '-rollback'. but it's
only if you don't finalize HDFS upgrade using command line.

On Fri, Oct 17, 2014 at 8:15 AM, Manoj Samel manojsamelt...@gmail.com
wrote:

 Hadoop 2.4.0 mentions that FSImage is stored using protobuf. So upgrade
 from 2.3.0 to 2.4 would work since 2.4 can read old (2.3) binary format and
 write the new 2.4 protobuf format.

 After using 2.4, if there is a need to downgrade back to 2.3, how would
 that work ?

 Thanks,



Re: how to copy data between two hdfs cluster fastly?

2014-10-17 Thread ch huang
no ,all default

On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu azury...@gmail.com wrote:

 Did you specified how many map tasks?


 On Fri, Oct 17, 2014 at 4:58 PM, ch huang justlo...@gmail.com wrote:

 hi,maillist:
  i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
 find when copy small file,it very good, but when transfer big data ,it very
 slow ,any good method recommand? thanks





Spark vs Tez

2014-10-17 Thread Adaryl Bob Wakefield, MBA
Does anybody have any performance figures on how Spark stacks up against Tez? 
If you don’t have figures, does anybody have an opinion? Spark seems so popular 
but I’m not really seeing why.
B.

Re: Spark vs Tez

2014-10-17 Thread Shahab Yunus
What aspects of Tez and Spark are you comparing? They have different
purposes and thus not directly comparable, as far as I understand.

Regards,
Shahab

On Fri, Oct 17, 2014 at 2:06 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

   Does anybody have any performance figures on how Spark stacks up
 against Tez? If you don’t have figures, does anybody have an opinion? Spark
 seems so popular but I’m not really seeing why.
 B.



Re: Spark vs Tez

2014-10-17 Thread Alexander Pivovarov
Spark creator Amplab did some benchmarks.
https://amplab.cs.berkeley.edu/benchmark/

On Fri, Oct 17, 2014 at 11:06 AM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

   Does anybody have any performance figures on how Spark stacks up
 against Tez? If you don’t have figures, does anybody have an opinion? Spark
 seems so popular but I’m not really seeing why.
 B.



Re: Spark vs Tez

2014-10-17 Thread kartik saxena
I did a performance benchmark during my summer internship . I am currently
a grad student. Can't reveal much about the specific project but Spark is
still faster than around 4-5th iteration of Tez of the same query/dataset.
By Iteration I mean utilizing the hot-container property of Apache Tez  .
See latest release of Tez and some hortonworks tutorials on their website.

The only problem with Spark adoption is the steep learning curve of Scala ,
and understanding the API properly.

Thanks

On Fri, Oct 17, 2014 at 11:06 AM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

   Does anybody have any performance figures on how Spark stacks up
 against Tez? If you don’t have figures, does anybody have an opinion? Spark
 seems so popular but I’m not really seeing why.
 B.



Re: Spark vs Tez

2014-10-17 Thread Adaryl Bob Wakefield, MBA
It was my understanding that Spark is faster batch processing. Tez is the new 
execution engine that replaces MapReduce and is also supposed to speed up batch 
processing. Is that not correct?
B.



From: Shahab Yunus 
Sent: Friday, October 17, 2014 1:12 PM
To: user@hadoop.apache.org 
Subject: Re: Spark vs Tez

What aspects of Tez and Spark are you comparing? They have different purposes 
and thus not directly comparable, as far as I understand. 

Regards,
Shahab

On Fri, Oct 17, 2014 at 2:06 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

  Does anybody have any performance figures on how Spark stacks up against Tez? 
If you don’t have figures, does anybody have an opinion? Spark seems so popular 
but I’m not really seeing why.
  B.


Dynamically set map / reducer memory

2014-10-17 Thread peter 2

HI Guys,
I am trying to run a few MR jobs in a succession, some of the jobs don't 
need that much memory and others do. I want to be able to tell hadoop 
how much memory should be allocated  for the mappers of each job.

I know how to increase the memory for a mapper JVM, through the mapred xml.
I tried manually setting the mapreduce.reduce.java.opts= 
-XmxsomeNumberm , but wasn't picked up by the mapper jvm, the global 
setting was always been picked up .


In summation
Job 1 - Mappers need only 250 Mg of Ram
Job2 - Mapper
   Reducer need around - 2Gb

I don't want to be able to set those restrictions prior to submitting 
the job to my hadoop cluster.


Re: Spark vs Tez

2014-10-17 Thread Alexander Pivovarov
It's going to be spark engine for hive (in addition to mr and tez).

Spark API is available for Java and Python as well.

Tez engine is available now and it's quite stable. As for speed.  For
complex queries it shows 10x-20x improvement in comparison to mr engine.
e.g. one of my queries runs 30 min using mr (about 100 mr jobs),   if I
switch to tez it done in 100 sec.

I'm using HDP-2.1.5 (hive-0.13.1, tez 0.4.1)

On Fri, Oct 17, 2014 at 11:23 AM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

   It was my understanding that Spark is faster batch processing. Tez is
 the new execution engine that replaces MapReduce and is also supposed to
 speed up batch processing. Is that not correct?
 B.



  *From:* Shahab Yunus shahab.yu...@gmail.com
 *Sent:* Friday, October 17, 2014 1:12 PM
 *To:* user@hadoop.apache.org
 *Subject:* Re: Spark vs Tez

  What aspects of Tez and Spark are you comparing? They have different
 purposes and thus not directly comparable, as far as I understand.

 Regards,
 Shahab

 On Fri, Oct 17, 2014 at 2:06 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   Does anybody have any performance figures on how Spark stacks up
 against Tez? If you don’t have figures, does anybody have an opinion? Spark
 seems so popular but I’m not really seeing why.
 B.





Re: Spark vs Tez

2014-10-17 Thread Adaryl Bob Wakefield, MBA
“The only problem with Spark adoption is the steep learning curve of Scala , 
and understanding the API properly.” 

This is why I’m looking for reasons to avoid Spark. In my mind, it’s one more 
thing to have to master and doesn’t really have anything to offer that can’t be 
done with other tools that are already inside my skillset. I spoke with some 
software engineers recently and basically the discussion boiled down to if you 
need to master Java or Scala go with Java. Three months into Java I don’t want 
to stop that and start learning Scala.

B.
From: kartik saxena 
Sent: Friday, October 17, 2014 1:12 PM
To: user@hadoop.apache.org 
Subject: Re: Spark vs Tez

I did a performance benchmark during my summer internship . I am currently a 
grad student. Can't reveal much about the specific project but Spark is still 
faster than around 4-5th iteration of Tez of the same query/dataset. By 
Iteration I mean utilizing the hot-container property of Apache Tez  . See 
latest release of Tez and some hortonworks tutorials on their website.  

The only problem with Spark adoption is the steep learning curve of Scala , and 
understanding the API properly. 


Thanks


On Fri, Oct 17, 2014 at 11:06 AM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

  Does anybody have any performance figures on how Spark stacks up against Tez? 
If you don’t have figures, does anybody have an opinion? Spark seems so popular 
but I’m not really seeing why.
  B.


Re: how to copy data between two hdfs cluster fastly?

2014-10-17 Thread Shivram Mani
What is your approx input size ?
Do you have multiple files or is this one large file ?
What is your block size (source and destination cluster) ?

On Fri, Oct 17, 2014 at 4:19 AM, ch huang justlo...@gmail.com wrote:

 no ,all default

 On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu azury...@gmail.com wrote:

 Did you specified how many map tasks?


 On Fri, Oct 17, 2014 at 4:58 PM, ch huang justlo...@gmail.com wrote:

 hi,maillist:
  i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
 find when copy small file,it very good, but when transfer big data ,it very
 slow ,any good method recommand? thanks






-- 
Thanks
Shivram


Re: Spark vs Tez

2014-10-17 Thread Gavin Yue
Spark and tez both make MR faster, this has no doubt.

They also provide new features like DAG, which is quite important for
interactive query processing.  From this perspective, you could view them
as a wrapper around MR and try to handle the intermediary buffer(files)
more efficiently.  It is a big pain in MR.

Also they both try to use Memory as the buffer instead of only
filesystems.   Spark has a concept RDD, which is quite interesting and also
limited.



On Fri, Oct 17, 2014 at 11:23 AM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

   It was my understanding that Spark is faster batch processing. Tez is
 the new execution engine that replaces MapReduce and is also supposed to
 speed up batch processing. Is that not correct?
 B.



  *From:* Shahab Yunus shahab.yu...@gmail.com
 *Sent:* Friday, October 17, 2014 1:12 PM
 *To:* user@hadoop.apache.org
 *Subject:* Re: Spark vs Tez

  What aspects of Tez and Spark are you comparing? They have different
 purposes and thus not directly comparable, as far as I understand.

 Regards,
 Shahab

 On Fri, Oct 17, 2014 at 2:06 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   Does anybody have any performance figures on how Spark stacks up
 against Tez? If you don’t have figures, does anybody have an opinion? Spark
 seems so popular but I’m not really seeing why.
 B.





Re: how to copy data between two hdfs cluster fastly?

2014-10-17 Thread Jakub Stransky
Distcp?
On 17 Oct 2014 20:51, Alexander Pivovarov apivova...@gmail.com wrote:

 try to run on dest cluster datanode
 $ hadoop fs -cp hdfs://from_cluster/hdfs://to_cluster/



 On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani sm...@pivotal.io wrote:

 What is your approx input size ?
 Do you have multiple files or is this one large file ?
 What is your block size (source and destination cluster) ?

 On Fri, Oct 17, 2014 at 4:19 AM, ch huang justlo...@gmail.com wrote:

 no ,all default

 On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu azury...@gmail.com wrote:

 Did you specified how many map tasks?


 On Fri, Oct 17, 2014 at 4:58 PM, ch huang justlo...@gmail.com wrote:

 hi,maillist:
  i now use distcp to migrate data from CDH4.4 to CDH5.1 ,
 i find when copy small file,it very good, but when transfer big data ,it
 very slow ,any good method recommand? thanks






 --
 Thanks
 Shivram





Re: Dynamically set map / reducer memory

2014-10-17 Thread Girish Lingappa
Peter

If you are using oozie to launch the MR jobs you can specify the memory
requirements in the workflow action specific to each job, in the workflow
xml you are using to launch the job. If you are writing your own driver
program to launch the jobs you can still set these parameters in the job
configuration you are using to launch the job.
 In the case where you modified mapred-site.xml to set your memory
requirements did you change that on the client machine where you are
launching the job?
 Please share more details on the setup and the way you are launching the
jobs so we can better understand the problem you are facing

Girish

On Fri, Oct 17, 2014 at 11:24 AM, peter 2 regest...@gmail.com wrote:

  HI Guys,
 I am trying to run a few MR jobs in a succession, some of the jobs don't
 need that much memory and others do. I want to be able to tell hadoop how
 much memory should be allocated  for the mappers of each job.
 I know how to increase the memory for a mapper JVM, through the mapred
 xml.
 I tried manually setting the  mapreduce.reduce.java.opts = -XmxsomeNumberm
 , but wasn't picked up by the mapper jvm, the global setting was always
 been picked up .

 In summation
 Job 1 - Mappers need only 250 Mg of Ram
 Job2 - Mapper
Reducer need around - 2Gb

 I don't want to be able to set those restrictions prior to submitting the
 job to my hadoop cluster.



Re: how to copy data between two hdfs cluster fastly?

2014-10-17 Thread ch huang
some file , total size  is 2T ,and block size  is 128M

On Sat, Oct 18, 2014 at 2:26 AM, Shivram Mani sm...@pivotal.io wrote:

 What is your approx input size ?
 Do you have multiple files or is this one large file ?
 What is your block size (source and destination cluster) ?

 On Fri, Oct 17, 2014 at 4:19 AM, ch huang justlo...@gmail.com wrote:

 no ,all default

 On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu azury...@gmail.com wrote:

 Did you specified how many map tasks?


 On Fri, Oct 17, 2014 at 4:58 PM, ch huang justlo...@gmail.com wrote:

 hi,maillist:
  i now use distcp to migrate data from CDH4.4 to CDH5.1 , i
 find when copy small file,it very good, but when transfer big data ,it very
 slow ,any good method recommand? thanks






 --
 Thanks
 Shivram



Re: how to copy data between two hdfs cluster fastly?

2014-10-17 Thread ch huang
yes

On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky stransky...@gmail.com
wrote:

 Distcp?
 On 17 Oct 2014 20:51, Alexander Pivovarov apivova...@gmail.com wrote:

 try to run on dest cluster datanode
 $ hadoop fs -cp hdfs://from_cluster/hdfs://to_cluster/



 On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani sm...@pivotal.io wrote:

 What is your approx input size ?
 Do you have multiple files or is this one large file ?
 What is your block size (source and destination cluster) ?

 On Fri, Oct 17, 2014 at 4:19 AM, ch huang justlo...@gmail.com wrote:

 no ,all default

 On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu azury...@gmail.com wrote:

 Did you specified how many map tasks?


 On Fri, Oct 17, 2014 at 4:58 PM, ch huang justlo...@gmail.com wrote:

 hi,maillist:
  i now use distcp to migrate data from CDH4.4 to CDH5.1 ,
 i find when copy small file,it very good, but when transfer big data ,it
 very slow ,any good method recommand? thanks






 --
 Thanks
 Shivram





Re: how to copy data between two hdfs cluster fastly?

2014-10-17 Thread Shivram Mani
Distcp is pretty restrictive w.r.t parallelizing data copy. If all that you
are doing is one large file, distcp wouldn't make this any faster.

In distcp, files are the lowest level of granularity. So increasing # of
maps, may not necessarily increase the overall throughput.

The default number of mappers if i’m not wrong is 20 for distcp. If all you
were doing was to copy a large file, only one map task is effectively used

On Fri, Oct 17, 2014 at 8:18 PM, ch huang justlo...@gmail.com wrote:

 yes

 On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky stransky...@gmail.com
 wrote:

 Distcp?
 On 17 Oct 2014 20:51, Alexander Pivovarov apivova...@gmail.com wrote:

 try to run on dest cluster datanode
 $ hadoop fs -cp hdfs://from_cluster/hdfs://to_cluster/



 On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani sm...@pivotal.io wrote:

 What is your approx input size ?
 Do you have multiple files or is this one large file ?
 What is your block size (source and destination cluster) ?

 On Fri, Oct 17, 2014 at 4:19 AM, ch huang justlo...@gmail.com wrote:

 no ,all default

 On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu azury...@gmail.com wrote:

 Did you specified how many map tasks?


 On Fri, Oct 17, 2014 at 4:58 PM, ch huang justlo...@gmail.com
 wrote:

 hi,maillist:
  i now use distcp to migrate data from CDH4.4 to CDH5.1
 , i find when copy small file,it very good, but when transfer big data 
 ,it
 very slow ,any good method recommand? thanks






 --
 Thanks
 Shivram






-- 
Thanks
Shivram


Re: how to copy data between two hdfs cluster fastly?

2014-10-17 Thread Shivram Mani
If you still do want to use distcp

1. Break the file into smaller files (only if you have the luxury of doing
this

2. Use the -m” option to set the number of mappers.

(Each map task will aim at copying (total bytes across all file)  /
numSplits. Uses the UniformSizeInputFormat by default

3. distcp by default uses a throttled input stream which by default is set
to 100MB. You can tune this based on your network bandwidth using the
-bandwidth
option

On Fri, Oct 17, 2014 at 10:24 PM, Shivram Mani sm...@pivotal.io wrote:

 Distcp is pretty restrictive w.r.t parallelizing data copy. If all that
 you are doing is one large file, distcp wouldn't make this any faster.

 In distcp, files are the lowest level of granularity. So increasing # of
 maps, may not necessarily increase the overall throughput.

 The default number of mappers if i’m not wrong is 20 for distcp. If all
 you were doing was to copy a large file, only one map task is effectively
 used

 On Fri, Oct 17, 2014 at 8:18 PM, ch huang justlo...@gmail.com wrote:

 yes

 On Sat, Oct 18, 2014 at 3:53 AM, Jakub Stransky stransky...@gmail.com
 wrote:

 Distcp?
 On 17 Oct 2014 20:51, Alexander Pivovarov apivova...@gmail.com
 wrote:

 try to run on dest cluster datanode
 $ hadoop fs -cp hdfs://from_cluster/hdfs://to_cluster/



 On Fri, Oct 17, 2014 at 11:26 AM, Shivram Mani sm...@pivotal.io
 wrote:

 What is your approx input size ?
 Do you have multiple files or is this one large file ?
 What is your block size (source and destination cluster) ?

 On Fri, Oct 17, 2014 at 4:19 AM, ch huang justlo...@gmail.com wrote:

 no ,all default

 On Fri, Oct 17, 2014 at 5:46 PM, Azuryy Yu azury...@gmail.com
 wrote:

 Did you specified how many map tasks?


 On Fri, Oct 17, 2014 at 4:58 PM, ch huang justlo...@gmail.com
 wrote:

 hi,maillist:
  i now use distcp to migrate data from CDH4.4 to CDH5.1
 , i find when copy small file,it very good, but when transfer big data 
 ,it
 very slow ,any good method recommand? thanks






 --
 Thanks
 Shivram






 --
 Thanks
 Shivram




-- 
Thanks
Shivram