Re: HADOOP_MAPRED_HOME not found!

2014-03-28 Thread Azuryy Yu
it was defined at hadoop-config.sh



On Fri, Mar 28, 2014 at 1:19 PM, divye sheth divs.sh...@gmail.com wrote:

 Which version of hadoop are u using? AFAIK the hadoop mapred home is the
 directory where hadoop is installed or in other words untarred.

 Thanks
 Divye Sheth
 On Mar 28, 2014 10:43 AM, Avinash Kujur avin...@gmail.com wrote:

 hi,

 when i am trying to execute this command:
 hadoop job -history ~/1
 its giving error like:
 DEPRECATED: Use of this script to execute mapred command is deprecated.
 Instead use the mapred command for it.

 HADOOP_MAPRED_HOME not found!

 from where can i get HADOOP_MAPRED_HOME?

 thanks.




Re: HADOOP_MAPRED_HOME not found!

2014-03-28 Thread Rahul Singh
Try adding the hadoop bin path to system path.


-Rahul Singh


On Fri, Mar 28, 2014 at 11:32 AM, Azuryy Yu azury...@gmail.com wrote:

 it was defined at hadoop-config.sh



 On Fri, Mar 28, 2014 at 1:19 PM, divye sheth divs.sh...@gmail.com wrote:

 Which version of hadoop are u using? AFAIK the hadoop mapred home is the
 directory where hadoop is installed or in other words untarred.

 Thanks
 Divye Sheth
 On Mar 28, 2014 10:43 AM, Avinash Kujur avin...@gmail.com wrote:

 hi,

 when i am trying to execute this command:
 hadoop job -history ~/1
 its giving error like:
 DEPRECATED: Use of this script to execute mapred command is deprecated.
 Instead use the mapred command for it.

 HADOOP_MAPRED_HOME not found!

 from where can i get HADOOP_MAPRED_HOME?

 thanks.





Re: HADOOP_MAPRED_HOME not found!

2014-03-28 Thread Avinash Kujur
we can execute the above command anywhere or do i need to execute it in any
particular directory?

thanks


On Thu, Mar 27, 2014 at 11:41 PM, divye sheth divs.sh...@gmail.com wrote:

 I believe you are using Hadoop 2. In order to get the mapred working you
 need to set the HADOOP_MAPRED_HOME path in either your /etc/profile or
 .bashrc file or you can use the command given below to temporarily set the
 variable.

 export HADOOP_MAPRED_HOME=$HADOOP_INSTALL

 $HADOOP_INSTALL is the location where the hadoop tar ball is extracted.

 This should work for you.

 Thanks
 Divye Sheth



 On Fri, Mar 28, 2014 at 11:53 AM, Rahul Singh 
 smart.rahul.i...@gmail.comwrote:

 Try adding the hadoop bin path to system path.


 -Rahul Singh


 On Fri, Mar 28, 2014 at 11:32 AM, Azuryy Yu azury...@gmail.com wrote:

 it was defined at hadoop-config.sh



 On Fri, Mar 28, 2014 at 1:19 PM, divye sheth divs.sh...@gmail.comwrote:

 Which version of hadoop are u using? AFAIK the hadoop mapred home is
 the directory where hadoop is installed or in other words untarred.

 Thanks
 Divye Sheth
 On Mar 28, 2014 10:43 AM, Avinash Kujur avin...@gmail.com wrote:

 hi,

 when i am trying to execute this command:
 hadoop job -history ~/1
 its giving error like:
 DEPRECATED: Use of this script to execute mapred command is deprecated.
 Instead use the mapred command for it.

 HADOOP_MAPRED_HOME not found!

 from where can i get HADOOP_MAPRED_HOME?

 thanks.







Re: HADOOP_MAPRED_HOME not found!

2014-03-28 Thread divye sheth
You can execute this command on any machine where you have set the
HADOOP_MAPRED_HOME

Thanks
Divye Sheth


On Fri, Mar 28, 2014 at 12:31 PM, Avinash Kujur avin...@gmail.com wrote:

 we can execute the above command anywhere or do i need to execute it in
 any particular directory?

 thanks


 On Thu, Mar 27, 2014 at 11:41 PM, divye sheth divs.sh...@gmail.comwrote:

 I believe you are using Hadoop 2. In order to get the mapred working you
 need to set the HADOOP_MAPRED_HOME path in either your /etc/profile or
 .bashrc file or you can use the command given below to temporarily set the
 variable.

 export HADOOP_MAPRED_HOME=$HADOOP_INSTALL

 $HADOOP_INSTALL is the location where the hadoop tar ball is extracted.

 This should work for you.

 Thanks
 Divye Sheth



 On Fri, Mar 28, 2014 at 11:53 AM, Rahul Singh smart.rahul.i...@gmail.com
  wrote:

 Try adding the hadoop bin path to system path.


 -Rahul Singh


 On Fri, Mar 28, 2014 at 11:32 AM, Azuryy Yu azury...@gmail.com wrote:

 it was defined at hadoop-config.sh



 On Fri, Mar 28, 2014 at 1:19 PM, divye sheth divs.sh...@gmail.comwrote:

 Which version of hadoop are u using? AFAIK the hadoop mapred home is
 the directory where hadoop is installed or in other words untarred.

 Thanks
 Divye Sheth
 On Mar 28, 2014 10:43 AM, Avinash Kujur avin...@gmail.com wrote:

 hi,

 when i am trying to execute this command:
 hadoop job -history ~/1
 its giving error like:
 DEPRECATED: Use of this script to execute mapred command is
 deprecated.
 Instead use the mapred command for it.

 HADOOP_MAPRED_HOME not found!

 from where can i get HADOOP_MAPRED_HOME?

 thanks.








Re: HADOOP_MAPRED_HOME not found!

2014-03-28 Thread Avinash Kujur
i am not getting where to set HADOOP_MAPRED_HOME and how to set.

thanks


On Fri, Mar 28, 2014 at 12:06 AM, divye sheth divs.sh...@gmail.com wrote:

 You can execute this command on any machine where you have set the
 HADOOP_MAPRED_HOME

 Thanks
 Divye Sheth


 On Fri, Mar 28, 2014 at 12:31 PM, Avinash Kujur avin...@gmail.com wrote:

 we can execute the above command anywhere or do i need to execute it in
 any particular directory?

 thanks


 On Thu, Mar 27, 2014 at 11:41 PM, divye sheth divs.sh...@gmail.comwrote:

 I believe you are using Hadoop 2. In order to get the mapred working you
 need to set the HADOOP_MAPRED_HOME path in either your /etc/profile or
 .bashrc file or you can use the command given below to temporarily set the
 variable.

 export HADOOP_MAPRED_HOME=$HADOOP_INSTALL

 $HADOOP_INSTALL is the location where the hadoop tar ball is extracted.

 This should work for you.

 Thanks
 Divye Sheth



 On Fri, Mar 28, 2014 at 11:53 AM, Rahul Singh 
 smart.rahul.i...@gmail.com wrote:

 Try adding the hadoop bin path to system path.


 -Rahul Singh


 On Fri, Mar 28, 2014 at 11:32 AM, Azuryy Yu azury...@gmail.com wrote:

 it was defined at hadoop-config.sh



 On Fri, Mar 28, 2014 at 1:19 PM, divye sheth divs.sh...@gmail.comwrote:

 Which version of hadoop are u using? AFAIK the hadoop mapred home is
 the directory where hadoop is installed or in other words untarred.

 Thanks
 Divye Sheth
 On Mar 28, 2014 10:43 AM, Avinash Kujur avin...@gmail.com wrote:

 hi,

 when i am trying to execute this command:
 hadoop job -history ~/1
 its giving error like:
 DEPRECATED: Use of this script to execute mapred command is
 deprecated.
 Instead use the mapred command for it.

 HADOOP_MAPRED_HOME not found!

 from where can i get HADOOP_MAPRED_HOME?

 thanks.









Re: How to get locations of blocks programmatically?

2014-03-28 Thread Harsh J
Yes, use 
http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.Path,
long, long)

On Fri, Mar 28, 2014 at 7:33 AM, Libo Yu yu_l...@hotmail.com wrote:
 Hi all,

 hadoop path fsck -files -block -locations can list locations for all
 blocks in the path.
 Is it possible to list all blocks and the block locations for a given path
 programmatically?
 Thanks,

 Libo



-- 
Harsh J


Re: mapred job -list error

2014-03-28 Thread Harsh J
Please also indicate your exact Hadoop version in use.

On Fri, Mar 28, 2014 at 9:04 AM, haihong lu ung3...@gmail.com wrote:
 dear all:

 I had a problem today, when i executed the command mapred job
 -list on a slave, an error came out. show the message as below:

 14/03/28 11:18:47 INFO Configuration.deprecation: session.id is deprecated.
 Instead, use dfs.metrics.session-id
 14/03/28 11:18:47 INFO jvm.JvmMetrics: Initializing JVM Metrics with
 processName=JobTracker, sessionId=
 Exception in thread main java.lang.NullPointerException
 at org.apache.hadoop.mapreduce.tools.CLI.listJobs(CLI.java:504)
 at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:312)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
 at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1237)

 when i executed the same command yesterday, it was ok.
 Thanks for any help



-- 
Harsh J


Re: HADOOP_MAPRED_HOME not found!

2014-03-28 Thread divye sheth
Hi Avinash,

The export command you can execute on any one machine in the cluster as of
now. Once you have executed the export command i.e. export
HADOOP_MAPRED_HOME=/path/to/your/hadoop/installation you can then execute
the mapred job -list command from that very same machine.

Thanks
Divye Sheth


On Fri, Mar 28, 2014 at 12:57 PM, Avinash Kujur avin...@gmail.com wrote:

 i am not getting where to set HADOOP_MAPRED_HOME and how to set.

 thanks


 On Fri, Mar 28, 2014 at 12:06 AM, divye sheth divs.sh...@gmail.comwrote:

 You can execute this command on any machine where you have set the
 HADOOP_MAPRED_HOME

 Thanks
 Divye Sheth


 On Fri, Mar 28, 2014 at 12:31 PM, Avinash Kujur avin...@gmail.comwrote:

 we can execute the above command anywhere or do i need to execute it in
 any particular directory?

 thanks


 On Thu, Mar 27, 2014 at 11:41 PM, divye sheth divs.sh...@gmail.comwrote:

 I believe you are using Hadoop 2. In order to get the mapred working
 you need to set the HADOOP_MAPRED_HOME path in either your /etc/profile or
 .bashrc file or you can use the command given below to temporarily set the
 variable.

 export HADOOP_MAPRED_HOME=$HADOOP_INSTALL

 $HADOOP_INSTALL is the location where the hadoop tar ball is extracted.

 This should work for you.

 Thanks
 Divye Sheth



 On Fri, Mar 28, 2014 at 11:53 AM, Rahul Singh 
 smart.rahul.i...@gmail.com wrote:

 Try adding the hadoop bin path to system path.


 -Rahul Singh


 On Fri, Mar 28, 2014 at 11:32 AM, Azuryy Yu azury...@gmail.comwrote:

 it was defined at hadoop-config.sh



 On Fri, Mar 28, 2014 at 1:19 PM, divye sheth divs.sh...@gmail.comwrote:

 Which version of hadoop are u using? AFAIK the hadoop mapred home is
 the directory where hadoop is installed or in other words untarred.

 Thanks
 Divye Sheth
 On Mar 28, 2014 10:43 AM, Avinash Kujur avin...@gmail.com wrote:

 hi,

 when i am trying to execute this command:
 hadoop job -history ~/1
 its giving error like:
 DEPRECATED: Use of this script to execute mapred command is
 deprecated.
 Instead use the mapred command for it.

 HADOOP_MAPRED_HOME not found!

 from where can i get HADOOP_MAPRED_HOME?

 thanks.










Re: Maps stuck on Pending

2014-03-28 Thread Dieter De Witte
There's is a big chance that your map output is being copied to your
reducer, this could take quite some time if you have a lot of data and
could be resolved by:

1) having more reducers
2) adjust the slowstart parameter so that the copying can start while the
map tasks are still running

Regards, Dieter


2014-03-27 20:42 GMT+01:00 Clay McDonald stuart.mcdon...@bateswhite.com:

 Thanks Serge, looks like I need to at memory to my datanodes.

 Clay McDonald
 Cell: 202.560.4101
 Direct: 202.747.5962

 -Original Message-
 From: Serge Blazhievsky [mailto:hadoop...@gmail.com]
 Sent: Thursday, March 27, 2014 2:16 PM
 To: user@hadoop.apache.org
 Cc: user@hadoop.apache.org
 Subject: Re: Maps stuck on Pending

 Next step would be to look in the logs under userlog directory for that job

 Sent from my iPhone

  On Mar 27, 2014, at 11:08 AM, Clay McDonald 
 stuart.mcdon...@bateswhite.com wrote:
 
  Hi all, I have a job running with 1750 maps and 1 reduce and the status
 has been the same for the last two hours. Any thoughts?
 
  Thanks, Clay



when it's safe to read map-reduce result?

2014-03-28 Thread Li Li
I have a program that do some map-reduce job and then read the result
of the job.
I learned that hdfs is not strong consistent. when it's safe to read the result?
as long as output/_SUCCESS exist?


Re: when it's safe to read map-reduce result?

2014-03-28 Thread Dieter De Witte
_SUCCES implies that the job has succesfully terminated, so this seems like
a reasonable criterion.

Regards, Dieter


2014-03-28 9:33 GMT+01:00 Li Li fancye...@gmail.com:

 I have a program that do some map-reduce job and then read the result
 of the job.
 I learned that hdfs is not strong consistent. when it's safe to read the
 result?
 as long as output/_SUCCESS exist?



Re: when it's safe to read map-reduce result?

2014-03-28 Thread Li Li
thanks. is the following codes safe?
int exitCode=ToolRunner.run()
if(exitCode==0){
   //safe to read result
}

On Fri, Mar 28, 2014 at 4:36 PM, Dieter De Witte drdwi...@gmail.com wrote:
 _SUCCES implies that the job has succesfully terminated, so this seems like
 a reasonable criterion.

 Regards, Dieter


 2014-03-28 9:33 GMT+01:00 Li Li fancye...@gmail.com:

 I have a program that do some map-reduce job and then read the result
 of the job.
 I learned that hdfs is not strong consistent. when it's safe to read the
 result?
 as long as output/_SUCCESS exist?




How to run data node block scanner on data node in a cluster from a remote machine?

2014-03-28 Thread reena upadhyay


How to run data node block scanner on data node in a cluster from a 
remote machine?


By default data node executes block scanner in 504 hours. This is the
 default value of dfs.datanode.scan.period . If I want to run the data 
node block scanner then  one way is to configure the property of 
dfs.datanode.scan.period in hdfs-site.xml but is there any other other 
way.
 Is it possible to run data node block scanner on data node either 
through command or pragmatically.


  

How to run data node block scanner on data node in a cluster from a remote machine?

2014-03-28 Thread reena upadhyay
How to run data node block scanner on data node in a cluster from a remote 
machine?


By default data node executes block scanner in 504 hours. This is the
 default value of dfs.datanode.scan.period . If I want to run the data 
node block scanner then  one way is to configure the property of 
dfs.datanode.scan.period in hdfs-site.xml but is there any other other 
way.
 Is it possible to run data node block scanner on data node either 
through command or pragmatically. 

Does hadoop depends on ecc memory to generate checksum for data stored in HDFS

2014-03-28 Thread reena upadhyay
To ensure data I/O integrity,  hadoop uses CRC 32 mechanism  to generate 
checksum for the data stored on hdfs . But suppose I have a data node machine 
that does not have ecc(error correcting code) type of memory, So will hadoop 
hdfs will be able to generate checksum for data blocks when read/write will 
happen in hdfs?

Or In simple words, Does hadoop depends on ecc memory to generate checksum for 
data stored in HDFS?


  

Re: How to run data node block scanner on data node in a cluster from a remote machine?

2014-03-28 Thread Harsh J
Hello Reena,

No there isn't a programmatic way to invoke the block scanner. Note
though that the property to control its period is DN-local, so you can
change it on DNs and do a DN rolling restart to make it take effect
without requiring a HDFS downtime.

On Fri, Mar 28, 2014 at 3:07 PM, reena upadhyay reena2...@outlook.com wrote:
 How to run data node block scanner on data node in a cluster from a remote
 machine?
 By default data node executes block scanner in 504 hours. This is the
 default value of dfs.datanode.scan.period . If I want to run the data node
 block scanner then one way is to configure the property of
 dfs.datanode.scan.period in hdfs-site.xml but is there any other other way.
 Is it possible to run data node block scanner on data node either through
 command or pragmatically.



-- 
Harsh J


Re: Does hadoop depends on ecc memory to generate checksum for data stored in HDFS

2014-03-28 Thread Harsh J
While the HDFS functionality of computing, storing and validating
checksums for block files does not specifically _require_ ECC, you do
_want_ ECC to avoid frequent checksum failures.

This is noted in Tom's book as well, in the chapter that discusses
setting up your own cluster:
ECC memory is strongly recommended, as several Hadoop users have
reported seeing many checksum errors when using non-ECC memory on
Hadoop clusters.

On Fri, Mar 28, 2014 at 3:15 PM, reena upadhyay reena2...@outlook.com wrote:
 To ensure data I/O integrity,  hadoop uses CRC 32 mechanism  to generate
 checksum for the data stored on hdfs . But suppose I have a data node
 machine that does not have ecc(error correcting code) type of memory, So
 will hadoop hdfs will be able to generate checksum for data blocks when
 read/write will happen in hdfs?

 Or In simple words, Does hadoop depends on ecc memory to generate checksum
 for data stored in HDFS?





-- 
Harsh J


How check sum are generated for blocks in data node

2014-03-28 Thread reena upadhyay
I was going through this link 
http://stackoverflow.com/questions/9406477/data-integrity-in-hdfs-which-data-nodes-verifies-the-checksum
 . Its written that in recent version of hadoop only the last data node 
verifies the checksum as the write happens in a pipeline fashion. 
Now I have a question:
Assuming my cluster has two data nodes A and B cluster, I have a file, half of 
the file content is written on first data node A and the other remaining half 
is written on the second data node B to take advantage of parallelism.  My 
question is:  Will data node A will not store the check sum for the blocks 
stored on it. 

As per the line only the last data node verifies the checksum, it looks like 
only the  last data node in my case it will be data node B, will generate the 
checksum. But if only data node B generates checksum, then it will generate the 
check sum only for the blocks stored on data node B. What about the checksum 
for the data blocks on data node  machine A?
  

how to be assignee ?

2014-03-28 Thread Avinash Kujur
hi,

how can i be assignee fro a particular issue?
i can't see any option for being assignee on the page.

Thanks.


Re: YarnException: Unauthorized request to start container. This token is expired.

2014-03-28 Thread Leibnitz
no doubt

Sent from my iPhone 6

 On Mar 23, 2014, at 17:37, Fengyun RAO raofeng...@gmail.com wrote:
 
 What does this exception mean? I googled a lot, all the results tell me it's 
 because the time is not synchronized between datanode and namenode.
 However, I checked all the servers, that the ntpd service is on, and the time 
 differences are less than 1 second.
 What's more, the tasks are not always failing on certain datanodes. 
 It fails and then it restarts and succeeds. If it were the time problem, I 
 guess it would always fail.
 
 My hadoop version is CDH5 beta. Below is the detailed log:
 
 14/03/23 14:57:06 INFO mapreduce.Job: Running job: job_1394434496930_0032
 14/03/23 14:57:17 INFO mapreduce.Job: Job job_1394434496930_0032 running in 
 uber mode : false
 14/03/23 14:57:17 INFO mapreduce.Job:  map 0% reduce 0%
 14/03/23 15:08:01 INFO mapreduce.Job: Task Id : 
 attempt_1394434496930_0032_m_34_0, Status : FAILED
 Container launch failed for container_1394434496930_0032_01_41 : 
 org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to 
 start container.
 This token is expired. current time is 1395558481146 found 1395558443384
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
 org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
at 
 org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
at 
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:155)
at 
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:370)
at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
 
 14/03/23 15:08:02 INFO mapreduce.Job:  map 1% reduce 0%
 14/03/23 15:09:36 INFO mapreduce.Job: Task Id : 
 attempt_1394434496930_0032_m_36_0, Status : FAILED
 Container launch failed for container_1394434496930_0032_01_38 : 
 org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to 
 start container.
 This token is expired. current time is 1395558575889 found 1395558443245
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
 org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
at 
 org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
at 
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:155)
at 
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:370)
at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
 


Replication HDFS

2014-03-28 Thread Victor Belizário
Hey,
I did look in HDFS for replication in filesystem master x slave.
Have any way to do master x master?
I just have 1 TB of files in a server and i want to replicate to another 
server, in real time sync.
Thanks !  

Hadoop documentation: control flow and FSM diagrams

2014-03-28 Thread Emilio Coppa
Hi All,

I have created a wiki on github:

https://github.com/ercoppa/HadoopDiagrams/wiki

This is an effort to provide an updated documentation of how the internals
of Hadoop work.  The main idea is to help the user understand the big
picture without removing too much internal details. You can find several
diagrams (e.g. Finite State Machine and control flow). They are based on
Hadoop 2.3.0.

Notice that:

- they are not specified in any formal language (e.g., UML) but they should
easy to understand (Do you agree?)
- they cover only some aspects of Hadoop but I am improving them day after
day
- they are not always correct but I am trying to fix errors,
remove ambiguities, etc

I hope this can be helpful to somebody out there. Any feedback from you may
be valuable for me.

Emilio.


RE: R on hadoop

2014-03-28 Thread Martin, Nick
If you’re spitballing options might also look at Pattern 
http://www.cascading.org/projects/pattern/

Has some nuances so be sure to spend the time to vet your specific use case 
(i.e. what you’re actually doing in R and what you want to accomplish 
leveraging data in Hadoop).

From: Sri [mailto:hadoop...@gmail.com]
Sent: Thursday, March 27, 2014 2:51 AM
To: user@hadoop.apache.org
Cc: user@hadoop.apache.org
Subject: Re: R on hadoop

Try OpenSource h2o.ai - a cran-style package that allows fast  scalable R on 
Hadoop in-Memory.
One can invoke single threaded R from h2o package and the runtime on clusters 
is Java (not R!) - So you get better memory management.

http://docs.0xdata.com/deployment/hadoop.html

http://docs.0xdata.com/Ruser/Rpackage.html


Sri

On Mar 26, 2014, at 6:53, Saravanan Nagarajan 
saravanan.nagarajan...@gmail.commailto:saravanan.nagarajan...@gmail.com 
wrote:
HI Jay,

Below is my understanding of Hadoop+R environment.

1. R contain Many data mining algorithm, to re-use this we have many tools like 
RHIPE,RHAdoop,etc
2.This tools will convert R algorithm and run in Hadoop map Reduce  using 
RMR,But i am not sure whether it will work for all algorithms in R.

Please let me know if you have any other points.

Thanks,
Saravanan
linkedin.com/in/saravanan303http://linkedin.com/in/saravanan303



On Wed, Mar 26, 2014 at 5:35 PM, Jay Vyas 
jayunit...@gmail.commailto:jayunit...@gmail.com wrote:
Do you mean
(1) running mapreduce jobs from R ?

(2) Running R from a mapreduce job ?
Without much extra ceremony, for the latter, you could use either MapReduce 
streaming or pig to call a custom program, as long as R is installed on every 
node of the cluster itself


On Wed, Mar 26, 2014 at 6:39 AM, Saravanan Nagarajan 
saravanan.nagarajan...@gmail.commailto:saravanan.nagarajan...@gmail.com 
wrote:
HI Siddharth,

You can try Big Data Analytics with R and Hadoop  Book, it gives many options 
and detailed steps to integrate Hadoop and R.

If you need this book then mail me to 
saravanan.nagarajan...@gmail.commailto:saravanan.nagarajan...@gmail.com.

Thanks,
Saravanan
linkedin.com/in/saravanan303http://linkedin.com/in/saravanan303






On Tue, Mar 25, 2014 at 2:04 AM, Jagat Singh 
jagatsi...@gmail.commailto:jagatsi...@gmail.com wrote:
Hi,
Please see RHadoop and RMR

https://www.google.com.au/search?q=rhadoop+installation
Thanks,
Jagat Singh

On Tue, Mar 25, 2014 at 7:19 AM, Siddharth Tiwari 
siddharth.tiw...@live.commailto:siddharth.tiw...@live.com wrote:
Hi team any docummentation around installing r on hadoop

Sent from my iPhone




--
Jay Vyas
http://jayunit100.blogspot.com



Re: Replication HDFS

2014-03-28 Thread Serge Blazhievsky
You mean replication between two different hadoop cluster or you just need data 
to be replicated between two different nodes? 

Sent from my iPhone

 On Mar 28, 2014, at 8:10 AM, Victor Belizário victor_beliza...@hotmail.com 
 wrote:
 
 Hey,
 
 I did look in HDFS for replication in filesystem master x slave.
 
 Have any way to do master x master?
 
 I just have 1 TB of files in a server and i want to replicate to another 
 server, in real time sync.
 
 Thanks !


Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

2014-03-28 Thread Hardik Pandya
what is your compression format gzip, lzo or snappy

for lzo final output

FileOutputFormat.setCompressOutput(conf, true);
FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class);

In addition, to make LZO splittable, you need to make a LZO index file.


On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew kchew...@gmail.com wrote:

 Thanks folks.

 I am not awared my input data file has been compressed.
 FileOutputFromat.setCompressOutput() is set to true when the file is
 written. 8-(

 Kim


 On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead mostafa.g@gmail.comwrote:

 The following might answer you partially:

 Input key is not read from HDFS, it is auto generated as the offset of
 the input value in the input file. I think that is (partially) why read
 hdfs bytes is smaller than written hdfs bytes.
  On Mar 27, 2014 1:34 PM, Kim Chew kchew...@gmail.com wrote:

 I am also wondering if, say, I have two identical timestamp so they are
 going to be written to the same file. Does MulitpleOutputs handle appending?

 Thanks.

 Kim


 On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen t...@bentzn.com wrote:

 Have you checked the content of the files you write?


 /th

 On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
  I have a simple M/R job using Mapper only thus no reducer. The mapper
  read a timestamp from the value, generate a path to the output file
  and writes the key and value to the output file.
 
 
  The input file is a sequence file, not compressed and stored in the
  HDFS, it has a size of 162.68 MB.
 
 
  Output also is written as a sequence file.
 
 
 
  However, after I ran my job, I have two output part files from the
  mapper. One has a size of 835.12 MB and the other has a size of 224.77
  MB. So why is the total outputs size is so much larger? Shouldn't it
  be more or less equal to the input's size of 162.68MB since I just
  write the key and value passed to mapper to the output?
 
 
  Here is the mapper code snippet,
 
  public void map(BytesWritable key, BytesWritable value, Context
  context) throws IOException, InterruptedException {
 
  long timestamp = bytesToInt(value.getBytes(),
  TIMESTAMP_INDEX);;
  String tsStr = sdf.format(new Date(timestamp * 1000L));
 
  mos.write(key, value, generateFileName(tsStr)); // mos is a
  MultipleOutputs object.
  }
 
  private String generateFileName(String key) {
  return outputDir+/+key+/raw-vectors;
  }
 
 
  And here are the job outputs,
 
  14/03/27 11:00:56 INFO mapred.JobClient: Launched map tasks=2
  14/03/27 11:00:56 INFO mapred.JobClient: Data-local map tasks=2
  14/03/27 11:00:56 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
  14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
  Counters
  14/03/27 11:00:56 INFO mapred.JobClient: Bytes Written=0
  14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
  14/03/27 11:00:56 INFO mapred.JobClient: HDFS_BYTES_READ=171086386
  14/03/27 11:00:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=54272
  14/03/27 11:00:56 INFO mapred.JobClient:
  HDFS_BYTES_WRITTEN=374798
  14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
  14/03/27 11:00:56 INFO mapred.JobClient: Bytes Read=170782415
  14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
  14/03/27 11:00:56 INFO mapred.JobClient: Map input records=547
  14/03/27 11:00:56 INFO mapred.JobClient: Physical memory (bytes)
  snapshot=166428672
  14/03/27 11:00:56 INFO mapred.JobClient: Spilled Records=0
  14/03/27 11:00:56 INFO mapred.JobClient: Total committed heap
  usage (bytes)=38351872
  14/03/27 11:00:56 INFO mapred.JobClient: CPU time spent (ms)=20080
  14/03/27 11:00:56 INFO mapred.JobClient: Virtual memory (bytes)
  snapshot=1240104960
  14/03/27 11:00:56 INFO mapred.JobClient: SPLIT_RAW_BYTES=286
  14/03/27 11:00:56 INFO mapred.JobClient: Map output records=0
 
 
  TIA,
 
 
  Kim
 







Re: How to get locations of blocks programmatically?

2014-03-28 Thread Hardik Pandya
have you looked into FileSystem API this is hadoop v2.2.0

http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html

does not exist in
http://hadoop.apache.org/docs/r1.2.0/api/org/apache/hadoop/fs/FileSystem.html

 
org.apache.hadoop.fs.RemoteIteratorLocatedFileStatushttp://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/LocatedFileStatus.html
 *listFiles
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#listFiles%28org.apache.hadoop.fs.Path,%20boolean%29*
(Pathhttp://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/Path.html
f,
boolean recursive)
  List the statuses and block locations of the files in the given
path.   
org.apache.hadoop.fs.RemoteIteratorLocatedFileStatushttp://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/LocatedFileStatus.html
 *listLocatedStatus
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#listLocatedStatus%28org.apache.hadoop.fs.Path%29*
(Pathhttp://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/Path.html
 f)
  List the statuses of the files/directories in the given path if
the path is a directory.


On Thu, Mar 27, 2014 at 10:03 PM, Libo Yu yu_l...@hotmail.com wrote:

 Hi all,

 hadoop path fsck -files -block -locations can list locations for all
 blocks in the path.
 Is it possible to list all blocks and the block locations for a given path
 programmatically?
 Thanks,

 Libo



Re: reducing HDFS FS connection timeouts

2014-03-28 Thread Hardik Pandya
how about adding

ipc.client.connect.max.retries.on.timeouts
*2 (default is 45)*Indicates the number of retries a client will make on
socket timeout to establish a server connection.
does that help?


On Thu, Mar 27, 2014 at 4:23 PM, John Lilley john.lil...@redpoint.netwrote:

  It seems to take a very long time to timeout a connection to an invalid
 NN URI.  Our application is interactive so the defaults of taking many
 minutes don't work well.  I've tried setting:

 conf.set(ipc.client.connect.max.retries, 2);

 conf.set(ipc.client.connect.timeout, 7000);

 before calling FileSystem.get() but it doesn't seem to matter.

 What is the prescribed technique for lowering connection timeout to HDFS?

 Thanks

 john





Re: how to be assignee ?

2014-03-28 Thread Azuryy Yu
Hi Avin,

You should be added as an sub-project's contributor, then you can be an
assignee. so you can find how to be an contributor on the Wiki.


On Fri, Mar 28, 2014 at 6:50 PM, Avinash Kujur avin...@gmail.com wrote:

 hi,

 how can i be assignee fro a particular issue?
 i can't see any option for being assignee on the page.

 Thanks.



Re: Hadoop documentation: control flow and FSM diagrams

2014-03-28 Thread Hardik Pandya
Very helpful indeed Emillio, thanks!


On Fri, Mar 28, 2014 at 12:58 PM, Emilio Coppa erco...@gmail.com wrote:

 Hi All,

 I have created a wiki on github:

 https://github.com/ercoppa/HadoopDiagrams/wiki

 This is an effort to provide an updated documentation of how the internals
 of Hadoop work.  The main idea is to help the user understand the big
 picture without removing too much internal details. You can find several
 diagrams (e.g. Finite State Machine and control flow). They are based on
 Hadoop 2.3.0.

 Notice that:

 - they are not specified in any formal language (e.g., UML) but they
 should easy to understand (Do you agree?)
 - they cover only some aspects of Hadoop but I am improving them day after
 day
 - they are not always correct but I am trying to fix errors,
 remove ambiguities, etc

 I hope this can be helpful to somebody out there. Any feedback from you
 may be valuable for me.

 Emilio.



Re: when it's safe to read map-reduce result?

2014-03-28 Thread Hardik Pandya
if the job complets without any failures exitCode should be 0 and safe
to read the result
public class MyApp extends Configured implements Tool {

   public int run(String[] args) throws Exception {
 // Configuration processed by ToolRunner
 Configuration conf = getConf();

 // Create a JobConf using the processed conf
 JobConf job = new JobConf(conf, MyApp.class);

 // Process custom command-line options
 Path in = new Path(args[1]);
 Path out = new Path(args[2]);

 // Specify various job-specific parameters
 job.setJobName(my-app);
 job.setInputPath(in);
 job.setOutputPath(out);
 job.setMapperClass(MyMapper.class);
 job.setReducerClass(MyReducer.class);

 // Submit the job, then poll for progress until the job is complete
 JobClient.runJob(job);
 return 0;
   }

   public static void main(String[] args) throws Exception {
 // Let ToolRunner handle generic command-line options
 int res = ToolRunner.run(new Configuration(), new MyApp(), args);

 System.exit(res);
   }
 }



On Fri, Mar 28, 2014 at 4:41 AM, Li Li fancye...@gmail.com wrote:

 thanks. is the following codes safe?
 int exitCode=ToolRunner.run()
 if(exitCode==0){
//safe to read result
 }

 On Fri, Mar 28, 2014 at 4:36 PM, Dieter De Witte drdwi...@gmail.com
 wrote:
  _SUCCES implies that the job has succesfully terminated, so this seems
 like
  a reasonable criterion.
 
  Regards, Dieter
 
 
  2014-03-28 9:33 GMT+01:00 Li Li fancye...@gmail.com:
 
  I have a program that do some map-reduce job and then read the result
  of the job.
  I learned that hdfs is not strong consistent. when it's safe to read the
  result?
  as long as output/_SUCCESS exist?
 
 



Re: Replication HDFS

2014-03-28 Thread Wellington Chevreuil
Hi Victor,

if by replication you mean copy from one cluster to other, you can use the 
distcp command.

Cheers.

On 28 Mar 2014, at 16:30, Serge Blazhievsky hadoop...@gmail.com wrote:

 You mean replication between two different hadoop cluster or you just need 
 data to be replicated between two different nodes? 
 
 Sent from my iPhone
 
 On Mar 28, 2014, at 8:10 AM, Victor Belizário victor_beliza...@hotmail.com 
 wrote:
 
 Hey,
 
 I did look in HDFS for replication in filesystem master x slave.
 
 Have any way to do master x master?
 
 I just have 1 TB of files in a server and i want to replicate to another 
 server, in real time sync.
 
 Thanks !



Re: How check sum are generated for blocks in data node

2014-03-28 Thread Wellington Chevreuil
Hi Reena,

the pipeline is per block. If you have half of your file in data node A only, 
that means the pipeline had only one node (node A, in this case, probably 
because replication factor is set to 1) and then, data node A has the checksums 
for its block. The same applies to data node B.  

All nodes will have checksums for the blocks they own. Checksums is passed 
together with the block, as it goes through the pipeline, but as the last node 
on the pipeline receives the original checksums along with the block from 
previous nodes, its only needed to make the validation on this last one, 
because if it passes there, it means the file was not corrupted in any of the 
previous nodes as well.

Cheers.

On 28 Mar 2014, at 10:28, reena upadhyay reena2...@outlook.com wrote:

 I was going through this link 
 http://stackoverflow.com/questions/9406477/data-integrity-in-hdfs-which-data-nodes-verifies-the-checksum
  . Its written that in recent version of hadoop only the last data node 
 verifies the checksum as the write happens in a pipeline fashion. 
 Now I have a question:
 Assuming my cluster has two data nodes A and B cluster, I have a file, half 
 of the file content is written on first data node A and the other remaining 
 half is written on the second data node B to take advantage of parallelism.  
 My question is:  Will data node A will not store the check sum for the blocks 
 stored on it. 
 
 As per the line only the last data node verifies the checksum, it looks 
 like only the  last data node in my case it will be data node B, will 
 generate the checksum. But if only data node B generates checksum, then it 
 will generate the check sum only for the blocks stored on data node B. What 
 about the checksum for the data blocks on data node  machine A?



How to find generated mapreduce code for pig/hive query

2014-03-28 Thread Spark Storm
hello experts,

am really new to hadoop - Is it possible to find out based on pig or hive
query to find out under the hood map reduce algorithm??

thanks


Re: How to find generated mapreduce code for pig/hive query

2014-03-28 Thread Shahab Yunus
You can use ILLUSTRATE and EXPLAIN commands to see the execution plan, if
you mean that by 'under the hood algorithm'

http://pig.apache.org/docs/r0.11.1/test.html

Regards,
Shahab


On Fri, Mar 28, 2014 at 5:51 PM, Spark Storm using.had...@gmail.com wrote:

 hello experts,

 am really new to hadoop - Is it possible to find out based on pig or hive
 query to find out under the hood map reduce algorithm??

 thanks



Re: Need help get the hadoop cluster started in EC2

2014-03-28 Thread Yusaku Sako
Hi Max,

Not sure if you have already, but you might also want to look into
Apache Ambari [1] for provisioning, managing, and monitoring Hadoop
clusters.
Many have successfully deployed Hadoop clusters on EC2 using Ambari.

[1] http://ambari.apache.org/

Yusaku

On Fri, Mar 28, 2014 at 7:07 PM, Max Zhao gz123forhad...@gmail.com wrote:
 Hi Everybody,

 I am trying to get my first hadoop cluster started using the Amazon EC2. I
 tried quite a few times and searched the web for the solutions, yet I still
 cannot get it up. I hope somebody can help out here.

 Here is what I did based on the Apache Whirr Quick Guide
 (http://whirr.apache.org/docs/0.8.1/quick-start-guide.html):

 1) I downloaded a Whirr tar ball and installed it.
 bin/whirr version shows the following:  Apache Whirr
 0.8.2jclouds 1.5.8
 2) I created the ./whirr directory and edit the credential file with my
 Amazon PROVIDER, IDENTITY and CREDENTIAL
IDENTITY=AAS, with no extra quotes or curly
 quotes around the actual key_id
 3) I used the following command to creat the key pair for whirr and stored
 it at the folder .ssh
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr
 4) I think I am ready to use one of the properties file provided with whirr
 in the recipes folder. Here is the command I ran:
 bin/whirr launch-cluster --config
 recipes/hadoop-yarn-ec2.properties --private-key-file ~/.ssh/id_rsa_whi
 The command ran into the error and did not bring up the hadoop.  My questin
 is: Do we need to change anything the default properties provided in the
 recipes folder in the whirr-0.8.2 folder, such as the
 hadoop-yarn-ec2.properties  I used?

 Here are the error messages:

 ---
 [ec2-user@ip-172-31-20-120 whirr-0.8.2]$ bin/whirr launch-cluster --config
 recipes/hadoop-yarn-ec2.properties  --private-key-file ~/.ssh/id_rsa_whirr
 Running on provider aws-ec2 using identity AKIAJLFVRARQ3IZE3KGF
 Unable to start the cluster. Terminating all nodes.
 com.google.common.util.concurrent.UncheckedExecutionException:
 com.google.inject.CreationException: Guice creation errors:
 1) org.jclouds.rest.RestContextorg.jclouds.aws.ec2.AWSEC2Client, A cannot
 be used as a key; It is not fully specified.
 1 error
 at
 com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2258)
 at com.google.common.cache.LocalCache.get(LocalCache.java:3990)
 at
 com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3994)
 at
 com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4878)
 at
 com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4884)
 at org.apache.whirr.service.ComputeCache.apply(ComputeCache.java:88)
 at org.apache.whirr.service.ComputeCache.apply(ComputeCache.java:80)
 at
 org.apache.whirr.actions.ScriptBasedClusterAction.execute(ScriptBasedClusterAction.java:110)
 at
 org.apache.whirr.ClusterController.bootstrapCluster(ClusterController.java:137)
 at
 org.apache.whirr.ClusterController.launchCluster(ClusterController.java:113)
 at
 org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:69)
 at
 org.apache.whirr.cli.command.LaunchClusterCommand.run(LaunchClusterCommand.java:59)
 at org.apache.whirr.cli.Main.run(Main.java:69)
 at org.apache.whirr.cli.Main.main(Main.java:102)
 Caused by: com.google.inject.CreationException: Guice creation errors:
 1) org.jclouds.rest.RestContextorg.jclouds.aws.ec2.AWSEC2Client, A cannot
 be used as a key; It is not fully specified.
 1 error
 at
 com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:435)
 at
 com.google.inject.internal.InternalInjectorCreator.initializeStatically(InternalInjectorCreator.java:154)
 at
 com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:106)
 at com.google.inject.Guice.createInjector(Guice.java:95)
 at org.jclouds.ContextBuilder.buildInjector(ContextBuilder.java:401)
 at org.jclouds.ContextBuilder.buildInjector(ContextBuilder.java:325)
 at org.jclouds.ContextBuilder.buildView(ContextBuilder.java:600)
 at org.jclouds.ContextBuilder.buildView(ContextBuilder.java:580)
 at
 org.apache.whirr.service.ComputeCache$1.load(ComputeCache.java:119)
 at
 org.apache.whirr.service.ComputeCache$1.load(ComputeCache.java:98)
 at
 com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3589)
 at
 com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2374)
 at
 com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2337)
 at
 com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2252)
 ... 13 more
 Unable to load cluster state, assuming it has no