Re: Data Locality and WebHDFS

2014-03-17 Thread Alejandro Abdelnur
I may have expressed myself wrong. You don't need to do any test to see how
locality works with files of multiple blocks. If you are accessing a file
of more than one block over webhdfs, you only have assured locality for the
first block of the file.

Thanks.


On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thank you, Mingjiang and Alejandro.

 This is interesting.  Since we will use the data locality information for
 scheduling, we could hack this to get the data locality information, at
 least for the first block.  As Alejandro says, we'd have to test what
 happens for other data blocks -- e.g., what if, knowing the block sizes, we
 request the second or third block?

 Interesting food for thought!  I see some experiments in my future!

 Thanks!


 On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur t...@cloudera.comwrote:

 well, this is for the first block of the file, the rest of the file
 (blocks being local or not) are streamed out by the same datanode. for
 small files (one block) you'll get locality, for large files only the first
 block, and by chance if other blocks are local to that datanode.


 Alejandro
 (phone typing)

 On Mar 16, 2014, at 18:53, Mingjiang Shi m...@gopivotal.com wrote:

 According to this page:
 http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/

 *Data Locality*: The file read and file write calls are redirected to
 the corresponding datanodes. It uses the full bandwidth of the Hadoop
 cluster for streaming data.

 *A HDFS Built-in Component*: WebHDFS is a first class built-in
 component of HDFS. It runs inside Namenodes and Datanodes, therefore, it
 can use all HDFS functionalities. It is a part of HDFS - there are no
 additional servers to install


 So it looks like the data locality is built-into webhdfs, client will be
 redirected to the data node automatically.




 On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,

 I'm writing up a Google Summer of Code proposal to add HDFS support to
 Disco, an Erlang MapReduce framework.

 We're interested in using WebHDFS.  I have two questions:

 1) Does WebHDFS allow querying data locality information?

 2) If the data locality information is known, can data on specific data
 nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
 through a single server?

 Thanks,
 RJ

 --
 em rnowl...@gmail.com
 c 954.496.2314




 --
 Cheers
 -MJ




 --
 em rnowl...@gmail.com
 c 954.496.2314




-- 
Alejandro


Re: class org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto overrides final method getUnknownFields

2014-03-17 Thread Margusja

Hi thanks for your replay.

What I did:
[speech@h14 mahout]$ /usr/share/apache-maven/bin/mvn -DskipTests clean 
install -Dhadoop2.profile=hadoop2 - is hadoop2 right string? I found it 
from pom profile section so I used it.


...
it compiled:
[INFO] 


[INFO] Reactor Summary:
[INFO]
[INFO] Mahout Build Tools  SUCCESS [  
1.751 s]
[INFO] Apache Mahout . SUCCESS [  
0.484 s]
[INFO] Mahout Math ... SUCCESS [ 
12.946 s]
[INFO] Mahout Core ... SUCCESS [ 
14.192 s]
[INFO] Mahout Integration  SUCCESS [  
1.857 s]
[INFO] Mahout Examples ... SUCCESS [ 
10.762 s]
[INFO] Mahout Release Package  SUCCESS [  
0.012 s]
[INFO] Mahout Math/Scala wrappers  SUCCESS [ 
25.431 s]
[INFO] Mahout Spark bindings . SUCCESS [ 
40.376 s]
[INFO] 


[INFO] BUILD SUCCESS
[INFO] 


[INFO] Total time: 01:48 min
[INFO] Finished at: 2014-03-17T12:06:31+02:00
[INFO] Final Memory: 79M/2947M
[INFO] 



How to check is there hadoop2 libs in use?

but unfortunately again:
[speech@h14 ~]$ mahout/bin/mahout seqdirectory -c UTF-8 -i 
/user/speech/demo -o demo-seqfiles

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/bin/hadoop and 
HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: 
/home/speech/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
14/03/17 12:07:21 INFO common.AbstractJob: Command line arguments: 
{--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], 
--fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], 
--input=[/user/speech/demo], --keyPrefix=[], --method=[mapreduce], 
--output=[demo-seqfiles], --startPhase=[0], --tempDir=[temp]}
14/03/17 12:07:22 INFO Configuration.deprecation: mapred.input.dir is 
deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/03/17 12:07:22 INFO Configuration.deprecation: 
mapred.compress.map.output is deprecated. Instead, use 
mapreduce.map.output.compress
14/03/17 12:07:22 INFO Configuration.deprecation: mapred.output.dir is 
deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/03/17 12:07:22 INFO Configuration.deprecation: session.id is 
deprecated. Instead, use dfs.metrics.session-id
14/03/17 12:07:22 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
processName=JobTracker, sessionId=
14/03/17 12:07:23 INFO input.FileInputFormat: Total input paths to 
process : 10
14/03/17 12:07:23 INFO input.CombineFileInputFormat: DEBUG: Terminated 
node allocation with : CompletedNodes: 4, size left: 29775

14/03/17 12:07:23 INFO mapreduce.JobSubmitter: number of splits:1
14/03/17 12:07:23 INFO Configuration.deprecation: user.name is 
deprecated. Instead, use mapreduce.job.user.name
14/03/17 12:07:23 INFO Configuration.deprecation: mapred.output.compress 
is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
14/03/17 12:07:23 INFO Configuration.deprecation: mapred.jar is 
deprecated. Instead, use mapreduce.job.jar
14/03/17 12:07:23 INFO Configuration.deprecation: mapred.reduce.tasks is 
deprecated. Instead, use mapreduce.job.reduces
14/03/17 12:07:23 INFO Configuration.deprecation: 
mapred.output.value.class is deprecated. Instead, use 
mapreduce.job.output.value.class
14/03/17 12:07:23 INFO Configuration.deprecation: 
mapred.mapoutput.value.class is deprecated. Instead, use 
mapreduce.map.output.value.class
14/03/17 12:07:23 INFO Configuration.deprecation: mapreduce.map.class is 
deprecated. Instead, use mapreduce.job.map.class
14/03/17 12:07:23 INFO Configuration.deprecation: mapred.job.name is 
deprecated. Instead, use mapreduce.job.name
14/03/17 12:07:23 INFO Configuration.deprecation: 
mapreduce.inputformat.class is deprecated. Instead, use 
mapreduce.job.inputformat.class
14/03/17 12:07:23 INFO Configuration.deprecation: mapred.max.split.size 
is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/03/17 12:07:23 INFO Configuration.deprecation: 
mapreduce.outputformat.class is deprecated. Instead, use 
mapreduce.job.outputformat.class
14/03/17 12:07:23 INFO Configuration.deprecation: mapred.map.tasks is 
deprecated. Instead, use mapreduce.job.maps
14/03/17 12:07:23 INFO Configuration.deprecation: 
mapred.output.key.class is deprecated. Instead, use 
mapreduce.job.output.key.class
14/03/17 12:07:23 INFO Configuration.deprecation: 
mapred.mapoutput.key.class is deprecated. Instead, use 
mapreduce.map.output.key.class
14/03/17 12:07:23 INFO Configuration.deprecation: mapred.working.dir is 
deprecated. Instead, 

Re: class org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto overrides final method getUnknownFields

2014-03-17 Thread Margusja

Okey sorry for the mess

[speech@h14 mahout]$ /usr/share/apache-maven/bin/mvn -DskipTests clean 
install -Dhadoop2.version=2.2.0 - did the trick


Best regards, Margus (Margusja) Roo
+372 51 48 780
http://margus.roo.ee
http://ee.linkedin.com/in/margusroo
skype: margusja
ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314)
-BEGIN PUBLIC KEY-
MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCvbeg7LwEC2SCpAEewwpC3ajxE
5ZsRMCB77L8bae9G7TslgLkoIzo9yOjPdx2NN6DllKbV65UjTay43uUDyql9g3tl
RhiJIcoAExkSTykWqAIPR88LfilLy1JlQ+0RD8OXiWOVVQfhOHpQ0R/jcAkM2lZa
BjM8j36yJvoBVsfOHQIDAQAB
-END PUBLIC KEY-

On 17/03/14 12:16, Margusja wrote:

Hi thanks for your replay.

What I did:
[speech@h14 mahout]$ /usr/share/apache-maven/bin/mvn -DskipTests clean 
install -Dhadoop2.profile=hadoop2 - is hadoop2 right string? I found 
it from pom profile section so I used it.


...
it compiled:
[INFO] 


[INFO] Reactor Summary:
[INFO]
[INFO] Mahout Build Tools  SUCCESS [  
1.751 s]
[INFO] Apache Mahout . SUCCESS [  
0.484 s]
[INFO] Mahout Math ... SUCCESS [ 
12.946 s]
[INFO] Mahout Core ... SUCCESS [ 
14.192 s]
[INFO] Mahout Integration  SUCCESS [  
1.857 s]
[INFO] Mahout Examples ... SUCCESS [ 
10.762 s]
[INFO] Mahout Release Package  SUCCESS [  
0.012 s]
[INFO] Mahout Math/Scala wrappers  SUCCESS [ 
25.431 s]
[INFO] Mahout Spark bindings . SUCCESS [ 
40.376 s]
[INFO] 


[INFO] BUILD SUCCESS
[INFO] 


[INFO] Total time: 01:48 min
[INFO] Finished at: 2014-03-17T12:06:31+02:00
[INFO] Final Memory: 79M/2947M
[INFO] 



How to check is there hadoop2 libs in use?

but unfortunately again:
[speech@h14 ~]$ mahout/bin/mahout seqdirectory -c UTF-8 -i 
/user/speech/demo -o demo-seqfiles

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/bin/hadoop and 
HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: 
/home/speech/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
14/03/17 12:07:21 INFO common.AbstractJob: Command line arguments: 
{--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], 
--fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], 
--input=[/user/speech/demo], --keyPrefix=[], --method=[mapreduce], 
--output=[demo-seqfiles], --startPhase=[0], --tempDir=[temp]}
14/03/17 12:07:22 INFO Configuration.deprecation: mapred.input.dir is 
deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/03/17 12:07:22 INFO Configuration.deprecation: 
mapred.compress.map.output is deprecated. Instead, use 
mapreduce.map.output.compress
14/03/17 12:07:22 INFO Configuration.deprecation: mapred.output.dir is 
deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/03/17 12:07:22 INFO Configuration.deprecation: session.id is 
deprecated. Instead, use dfs.metrics.session-id
14/03/17 12:07:22 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
processName=JobTracker, sessionId=
14/03/17 12:07:23 INFO input.FileInputFormat: Total input paths to 
process : 10
14/03/17 12:07:23 INFO input.CombineFileInputFormat: DEBUG: Terminated 
node allocation with : CompletedNodes: 4, size left: 29775

14/03/17 12:07:23 INFO mapreduce.JobSubmitter: number of splits:1
14/03/17 12:07:23 INFO Configuration.deprecation: user.name is 
deprecated. Instead, use mapreduce.job.user.name
14/03/17 12:07:23 INFO Configuration.deprecation: 
mapred.output.compress is deprecated. Instead, use 
mapreduce.output.fileoutputformat.compress
14/03/17 12:07:23 INFO Configuration.deprecation: mapred.jar is 
deprecated. Instead, use mapreduce.job.jar
14/03/17 12:07:23 INFO Configuration.deprecation: mapred.reduce.tasks 
is deprecated. Instead, use mapreduce.job.reduces
14/03/17 12:07:23 INFO Configuration.deprecation: 
mapred.output.value.class is deprecated. Instead, use 
mapreduce.job.output.value.class
14/03/17 12:07:23 INFO Configuration.deprecation: 
mapred.mapoutput.value.class is deprecated. Instead, use 
mapreduce.map.output.value.class
14/03/17 12:07:23 INFO Configuration.deprecation: mapreduce.map.class 
is deprecated. Instead, use mapreduce.job.map.class
14/03/17 12:07:23 INFO Configuration.deprecation: mapred.job.name is 
deprecated. Instead, use mapreduce.job.name
14/03/17 12:07:23 INFO Configuration.deprecation: 
mapreduce.inputformat.class is deprecated. Instead, use 
mapreduce.job.inputformat.class
14/03/17 12:07:23 INFO Configuration.deprecation: 
mapred.max.split.size is deprecated. Instead, use 

Benchmark Failure

2014-03-17 Thread Lixiang Ao
Hi all,

I'm running jobclient tests(on single node), other tests like TestDFSIO,
mrbench succeed except nnbench.

I got a lot of Exceptions but without any explanation(see below).

Could anyone tell me what might went wrong?

Thanks!


14/03/17 23:54:22 INFO hdfs.NNBench: Waiting in barrier for: 112819 ms
14/03/17 23:54:23 INFO mapreduce.Job: Job job_local2133868569_0001 running
in uber mode : false
14/03/17 23:54:23 INFO mapreduce.Job:  map 0% reduce 0%
14/03/17 23:54:28 INFO mapred.LocalJobRunner: hdfs://
0.0.0.0:9000/benchmarks/NNBench-aolx-PC/control/NNBench_Controlfile_10:0+125
map
14/03/17 23:54:29 INFO mapreduce.Job:  map 6% reduce 0%
14/03/17 23:56:15 INFO hdfs.NNBench: Exception recorded in op:
Create/Write/Close
14/03/17 23:56:15 INFO hdfs.NNBench: Exception recorded in op:
Create/Write/Close
14/03/17 23:56:15 INFO hdfs.NNBench: Exception recorded in op:
Create/Write/Close
14/03/17 23:56:15 INFO hdfs.NNBench: Exception recorded in op:
Create/Write/Close
14/03/17 23:56:15 INFO hdfs.NNBench: Exception recorded in op:
Create/Write/Close
14/03/17 23:56:15 INFO hdfs.NNBench: Exception recorded in op:
Create/Write/Close
14/03/17 23:56:15 INFO hdfs.NNBench: Exception recorded in op:
Create/Write/Close
14/03/17 23:56:15 INFO hdfs.NNBench: Exception recorded in op:
Create/Write/Close
(1000 Exceptions)
.
.
.
results:

File System Counters
FILE: Number of bytes read=18769411
 FILE: Number of bytes written=21398315
FILE: Number of read operations=0
FILE: Number of large read operations=0
 FILE: Number of write operations=0
HDFS: Number of bytes read=11185
HDFS: Number of bytes written=19540
 HDFS: Number of read operations=325
HDFS: Number of large read operations=0
HDFS: Number of write operations=13210
 Map-Reduce Framework
Map input records=12
Map output records=95
 Map output bytes=1829
Map output materialized bytes=2091
Input split bytes=1538
 Combine input records=0
Combine output records=0
Reduce input groups=8
 Reduce shuffle bytes=0
Reduce input records=95
Reduce output records=8
 Spilled Records=214
Shuffled Maps =0
Failed Shuffles=0
 Merged Map outputs=0
GC time elapsed (ms)=211
CPU time spent (ms)=0
 Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=4401004544
 File Input Format Counters
Bytes Read=1490
File Output Format Counters
 Bytes Written=170
14/03/17 23:56:18 INFO hdfs.NNBench: -- NNBench --
:
14/03/17 23:56:18 INFO hdfs.NNBench:
 Version: NameNode Benchmark 0.4
14/03/17 23:56:18 INFO hdfs.NNBench:Date 
time: 2014-03-17 23:56:18,619
14/03/17 23:56:18 INFO hdfs.NNBench:
14/03/17 23:56:18 INFO hdfs.NNBench: Test
Operation: create_write
14/03/17 23:56:18 INFO hdfs.NNBench: Start
time: 2014-03-17 23:56:15,521
14/03/17 23:56:18 INFO hdfs.NNBench:Maps to
run: 12
14/03/17 23:56:18 INFO hdfs.NNBench: Reduces to
run: 6
14/03/17 23:56:18 INFO hdfs.NNBench: Block Size
(bytes): 1
14/03/17 23:56:18 INFO hdfs.NNBench: Bytes to
write: 0
14/03/17 23:56:18 INFO hdfs.NNBench: Bytes per
checksum: 1
14/03/17 23:56:18 INFO hdfs.NNBench:Number of
files: 1000
14/03/17 23:56:18 INFO hdfs.NNBench: Replication
factor: 3
14/03/17 23:56:18 INFO hdfs.NNBench: Successful file
operations: 0
14/03/17 23:56:18 INFO hdfs.NNBench:
14/03/17 23:56:18 INFO hdfs.NNBench: # maps that missed the
barrier: 11
14/03/17 23:56:18 INFO hdfs.NNBench:   #
exceptions: 1000
14/03/17 23:56:18 INFO hdfs.NNBench:
14/03/17 23:56:18 INFO hdfs.NNBench:TPS:
Create/Write/Close: 0
14/03/17 23:56:18 INFO hdfs.NNBench: Avg exec time (ms):
Create/Write/Close: Infinity
14/03/17 23:56:18 INFO hdfs.NNBench: Avg Lat (ms):
Create/Write: NaN
14/03/17 23:56:18 INFO hdfs.NNBench:Avg Lat (ms):
Close: NaN
14/03/17 23:56:18 INFO hdfs.NNBench:
14/03/17 23:56:18 INFO hdfs.NNBench:  RAW DATA: AL Total
#1: 0
14/03/17 23:56:18 INFO hdfs.NNBench:  RAW DATA: AL Total
#2: 0
14/03/17 23:56:18 INFO hdfs.NNBench:   RAW DATA: TPS Total
(ms): 1131
14/03/17 23:56:18 INFO hdfs.NNBench:RAW DATA: Longest Map Time
(ms): 1.395071776653E12
14/03/17 23:56:18 INFO hdfs.NNBench:RAW DATA: Late
maps: 11
14/03/17 23:56:18 INFO hdfs.NNBench:  RAW DATA: # of
exceptions: 1000
14/03/17 23:56:18 INFO hdfs.NNBench:


Re: Data Locality and WebHDFS

2014-03-17 Thread RJ Nowling
Hi Alejandro,

The WebHDFS API allows specifying an offset and length for the request.  If
I specify an offset that start in the second block for a file (thus
skipping the first block all together), will the namenode still direct me
to a datanode with the first block or will it direct me to a namenode with
the second block?  I.e., am I assured data locality only on the first block
of the file (as you're saying) or on the first block I am accessing?

If it is as you say, then I may want to reach out the WebHDFS developers
and see if they would be interested in the additional functionality.

Thank you,
RJ


On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur t...@cloudera.comwrote:

 I may have expressed myself wrong. You don't need to do any test to see
 how locality works with files of multiple blocks. If you are accessing a
 file of more than one block over webhdfs, you only have assured locality
 for the first block of the file.

 Thanks.


 On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thank you, Mingjiang and Alejandro.

 This is interesting.  Since we will use the data locality information for
 scheduling, we could hack this to get the data locality information, at
 least for the first block.  As Alejandro says, we'd have to test what
 happens for other data blocks -- e.g., what if, knowing the block sizes, we
 request the second or third block?

 Interesting food for thought!  I see some experiments in my future!

 Thanks!


 On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur 
 t...@cloudera.comwrote:

 well, this is for the first block of the file, the rest of the file
 (blocks being local or not) are streamed out by the same datanode. for
 small files (one block) you'll get locality, for large files only the first
 block, and by chance if other blocks are local to that datanode.


 Alejandro
 (phone typing)

 On Mar 16, 2014, at 18:53, Mingjiang Shi m...@gopivotal.com wrote:

 According to this page:
 http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/

 *Data Locality*: The file read and file write calls are redirected to
 the corresponding datanodes. It uses the full bandwidth of the Hadoop
 cluster for streaming data.

 *A HDFS Built-in Component*: WebHDFS is a first class built-in
 component of HDFS. It runs inside Namenodes and Datanodes, therefore, it
 can use all HDFS functionalities. It is a part of HDFS - there are no
 additional servers to install


 So it looks like the data locality is built-into webhdfs, client will be
 redirected to the data node automatically.




 On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,

 I'm writing up a Google Summer of Code proposal to add HDFS support to
 Disco, an Erlang MapReduce framework.

 We're interested in using WebHDFS.  I have two questions:

 1) Does WebHDFS allow querying data locality information?

 2) If the data locality information is known, can data on specific data
 nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
 through a single server?

 Thanks,
 RJ

 --
 em rnowl...@gmail.com
 c 954.496.2314




 --
 Cheers
 -MJ




 --
 em rnowl...@gmail.com
 c 954.496.2314




 --
 Alejandro




-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Data Locality and WebHDFS

2014-03-17 Thread Alejandro Abdelnur
dont recall how skips are handled in webhdfs, but i would assume that you'll 
get to the first block As usual, and the skip is handled by the DN serving the 
file (as webhdfs doesnot know at open that you'll skip)

Alejandro
(phone typing)

 On Mar 17, 2014, at 9:47, RJ Nowling rnowl...@gmail.com wrote:
 
 Hi Alejandro,
 
 The WebHDFS API allows specifying an offset and length for the request.  If I 
 specify an offset that start in the second block for a file (thus skipping 
 the first block all together), will the namenode still direct me to a 
 datanode with the first block or will it direct me to a namenode with the 
 second block?  I.e., am I assured data locality only on the first block of 
 the file (as you're saying) or on the first block I am accessing?
 
 If it is as you say, then I may want to reach out the WebHDFS developers and 
 see if they would be interested in the additional functionality.
 
 Thank you,
 RJ
 
 
 On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur t...@cloudera.com 
 wrote:
 I may have expressed myself wrong. You don't need to do any test to see how 
 locality works with files of multiple blocks. If you are accessing a file of 
 more than one block over webhdfs, you only have assured locality for the 
 first block of the file.
 
 Thanks.
 
 
 On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling rnowl...@gmail.com wrote:
 Thank you, Mingjiang and Alejandro.
 
 This is interesting.  Since we will use the data locality information for 
 scheduling, we could hack this to get the data locality information, at 
 least for the first block.  As Alejandro says, we'd have to test what 
 happens for other data blocks -- e.g., what if, knowing the block sizes, we 
 request the second or third block?
 
 Interesting food for thought!  I see some experiments in my future!  
 
 Thanks!
 
 
 On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur t...@cloudera.com 
 wrote:
 well, this is for the first block of the file, the rest of the file 
 (blocks being local or not) are streamed out by the same datanode. for 
 small files (one block) you'll get locality, for large files only the 
 first block, and by chance if other blocks are local to that datanode. 
 
 
 Alejandro
 (phone typing)
 
 On Mar 16, 2014, at 18:53, Mingjiang Shi m...@gopivotal.com wrote:
 
 According to this page: 
 http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
 Data Locality: The file read and file write calls are redirected to the 
 corresponding datanodes. It uses the full bandwidth of the Hadoop 
 cluster for streaming data.
 
 A HDFS Built-in Component: WebHDFS is a first class built-in component 
 of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use 
 all HDFS functionalities. It is a part of HDFS – there are no additional 
 servers to install
 
 
 So it looks like the data locality is built-into webhdfs, client will be 
 redirected to the data node automatically. 
 
 
 
 
 On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling rnowl...@gmail.com wrote:
 Hi all,
 
 I'm writing up a Google Summer of Code proposal to add HDFS support to 
 Disco, an Erlang MapReduce framework.  
 
 We're interested in using WebHDFS.  I have two questions:
 
 1) Does WebHDFS allow querying data locality information?
 
 2) If the data locality information is known, can data on specific data 
 nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go 
 through a single server?
 
 Thanks,
 RJ
 
 -- 
 em rnowl...@gmail.com
 c 954.496.2314
 
 
 
 -- 
 Cheers
 -MJ
 
 
 
 -- 
 em rnowl...@gmail.com
 c 954.496.2314
 
 
 
 -- 
 Alejandro
 
 
 
 -- 
 em rnowl...@gmail.com
 c 954.496.2314


Writing Bytes Directly to Distributed Cache?

2014-03-17 Thread Jonathan Miller
Hello,

I was wondering if anyone might know of a way to write bytes directly to the
distributed cache.  I know I can call job.addCacheFile(URI uri), but in my
case the file I wish to add to the cache is in memory and is job specific.
I would prefer not writing it to a location that I have to then manage the
cleanup of post job execution simply to get it into the cache.  If I cannot
use the distributed cache to do this, is there another way I could
accomplish the same thing using m/r¹s cleanup logic?

Any pointers would be most appreciated.

Thanks! 




Re: Data Locality and WebHDFS

2014-03-17 Thread Tsz Wo Sze
The file offset is considered in WebHDFS redirection.  It redirects to a 
datanode with the first block the client going to read, not the first block of 
the file.

Hope it helps.
Tsz-Wo



On Monday, March 17, 2014 10:09 AM, Alejandro Abdelnur t...@cloudera.com 
wrote:
 
actually, i am wrong, the webhdfs rest call has an offset. 

Alejandro
(phone typing)

On Mar 17, 2014, at 10:07, Alejandro Abdelnur t...@cloudera.com wrote:


dont recall how skips are handled in webhdfs, but i would assume that you'll 
get to the first block As usual, and the skip is handled by the DN serving the 
file (as webhdfs doesnot know at open that you'll skip)

Alejandro
(phone typing)

On Mar 17, 2014, at 9:47, RJ Nowling rnowl...@gmail.com wrote:


Hi Alejandro,


The WebHDFS API allows specifying an offset and length for the request.  If I 
specify an offset that start in the second block for a file (thus skipping 
the first block all together), will the namenode still direct me to a 
datanode with the first block or will it direct me to a namenode with the 
second block?  I.e., am I assured data locality only on the first block of 
the file (as you're saying) or on the first block I am accessing?


If it is as you say, then I may want to reach out the WebHDFS developers and 
see if they would be interested in the additional functionality.


Thank you,
RJ



On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur t...@cloudera.com wrote:

I may have expressed myself wrong. You don't need to do any test to see how 
locality works with files of multiple blocks. If you are accessing a file of 
more than one block over webhdfs, you only have assured locality for the 
first block of the file.


Thanks.



On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling rnowl...@gmail.com wrote:

Thank you, Mingjiang and Alejandro.


This is interesting.  Since we will use the data locality information for 
scheduling, we could hack this to get the data locality information, at 
least for the first block.  As Alejandro says, we'd have to test what 
happens for other data blocks -- e.g., what if, knowing the block sizes, we 
request the second or third block?


Interesting food for thought!  I see some experiments in my future!  


Thanks!



On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur t...@cloudera.com 
wrote:

well, this is for the first block of the file, the rest of the file (blocks 
being local or not) are streamed out by the same datanode. for small files 
(one block) you'll get locality, for large files only the first block, and 
by chance if other blocks are local to that datanode. 



Alejandro
(phone typing)

On Mar 16, 2014, at 18:53, Mingjiang Shi m...@gopivotal.com wrote:


According to this page: 
http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/

Data Locality: The file read and file write calls 
are redirected to the corresponding datanodes. It uses the full 
bandwidth of the Hadoop cluster for streaming data.
A HDFS Built-in Component: WebHDFS is a first class 
built-in component of HDFS. It runs inside Namenodes and Datanodes, 
therefore, it can use all HDFS functionalities. It is a part of HDFS – 
there are no additional servers to install

So it looks like the data locality is built-into webhdfs, client will be 
redirected to the data node automatically. 






On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling rnowl...@gmail.com wrote:

Hi all,


I'm writing up a Google Summer of Code proposal to add HDFS support to 
Disco, an Erlang MapReduce framework.  


We're interested in using WebHDFS.  I have two questions:


1) Does WebHDFS allow querying data locality information?


2) If the data locality information is known, can data on specific data 
nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go 
through a single server?

Thanks,
RJ


-- 
em rnowl...@gmail.com
c 954.496.2314 


-- 

Cheers
-MJ




-- 
em rnowl...@gmail.com
c 954.496.2314 



-- 
Alejandro 



-- 
em rnowl...@gmail.com
c 954.496.2314 



issue with defaut yarn.application.classpath value from yarn-default.xml for hadoop-2.3.0

2014-03-17 Thread REYANE OUKPEDJO
Hi,

I believe there is an issue with yarn-default.xml setting of 
yarn.application.classpath on hadoop-2.3.0. This parameter's default is not set 
and I have an application that fails because of that. Below is part of the the 
content of yarn-default.xml which shows an empty value of that parameter. When 
I checked the same file on hadoop-2.2.0, I can see that a default value is set. 
Can anyone explain why this happens ? should I open a bug report? or was this 
intentional ?


Thanks

Reyane OUKPEDJO

  property
    descriptionDOOP_CONF_DIR,
          $HADOOP_COMMON_HOME/share/hadoop/common/*,
                $HADOOP_COMMON_HOME/share/hadoop/common/lib/*,
                      $HADOOP_HDFS_HOME/share/hadoop/hdfs/*,
                            $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,
                                  $HADOOP_YARN_HOME/share/hadoop/yarn/*,
                                        
$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*
      CLASSPATH for YARN applications. A comma-separated list
      of CLASSPATH entries. When this value is empty, the following default
      CLASSPATH for YARN applications would be used.
      For Linux:
      $HADOOP_CONF_DIR,
      $HADOOP_COMMON_HOME/share/hadoop/common/*,
      $HADOOP_COMMON_HOME/share/hadoop/common/lib/*,
      $HADOOP_HDFS_HOME/share/hadoop/hdfs/*,
      $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,
      $HADOOP_YARN_HOME/share/hadoop/yarn/*,
      $HADOOP_YARN_HOME/share/hadoop/yarn/lib/*
      For Windows:
      %HADOOP_CONF_DIR%,
      %HADOOP_COMMON_HOME%/share/hadoop/common/*,
      %HADOOP_COMMON_HOME%/share/hadoop/common/lib/*,
      %HADOOP_HDFS_HOME%/share/hadoop/hdfs/*,
      %HADOOP_HDFS_HOME%/share/hadoop/hdfs/lib/*,
      %HADOOP_YARN_HOME%/share/hadoop/yarn/*,
      %HADOOP_YARN_HOME%/share/hadoop/yarn/lib/*
    /description
    nameyarn.application.classpath/name
    value/value
  /property


Re: Data Locality and WebHDFS

2014-03-17 Thread RJ Nowling
Thank you, Tsz.  That helps!


On Mon, Mar 17, 2014 at 2:30 PM, Tsz Wo Sze szets...@yahoo.com wrote:

 The file offset is considered in WebHDFS redirection.  It redirects to a
 datanode with the first block the client going to read, not the first block
 of the file.

 Hope it helps.
 Tsz-Wo


   On Monday, March 17, 2014 10:09 AM, Alejandro Abdelnur 
 t...@cloudera.com wrote:

 actually, i am wrong, the webhdfs rest call has an offset.

 Alejandro
 (phone typing)

 On Mar 17, 2014, at 10:07, Alejandro Abdelnur t...@cloudera.com wrote:

 dont recall how skips are handled in webhdfs, but i would assume that
 you'll get to the first block As usual, and the skip is handled by the DN
 serving the file (as webhdfs doesnot know at open that you'll skip)

 Alejandro
 (phone typing)

 On Mar 17, 2014, at 9:47, RJ Nowling rnowl...@gmail.com wrote:

 Hi Alejandro,

 The WebHDFS API allows specifying an offset and length for the request.
  If I specify an offset that start in the second block for a file (thus
 skipping the first block all together), will the namenode still direct me
 to a datanode with the first block or will it direct me to a namenode with
 the second block?  I.e., am I assured data locality only on the first block
 of the file (as you're saying) or on the first block I am accessing?

 If it is as you say, then I may want to reach out the WebHDFS developers
 and see if they would be interested in the additional functionality.

 Thank you,
 RJ


 On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur t...@cloudera.comwrote:

 I may have expressed myself wrong. You don't need to do any test to see
 how locality works with files of multiple blocks. If you are accessing a
 file of more than one block over webhdfs, you only have assured locality
 for the first block of the file.

 Thanks.


 On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thank you, Mingjiang and Alejandro.

 This is interesting.  Since we will use the data locality information for
 scheduling, we could hack this to get the data locality information, at
 least for the first block.  As Alejandro says, we'd have to test what
 happens for other data blocks -- e.g., what if, knowing the block sizes, we
 request the second or third block?

 Interesting food for thought!  I see some experiments in my future!

 Thanks!


 On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur t...@cloudera.comwrote:

 well, this is for the first block of the file, the rest of the file
 (blocks being local or not) are streamed out by the same datanode. for
 small files (one block) you'll get locality, for large files only the first
 block, and by chance if other blocks are local to that datanode.


 Alejandro
 (phone typing)

 On Mar 16, 2014, at 18:53, Mingjiang Shi m...@gopivotal.com wrote:

 According to this page:
 http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/

 *Data Locality*: The file read and file write calls are redirected to the
 corresponding datanodes. It uses the full bandwidth of the Hadoop cluster
 for streaming data.
 *A HDFS Built-in Component*: WebHDFS is a first class built-in component
 of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all
 HDFS functionalities. It is a part of HDFS - there are no additional
 servers to install


 So it looks like the data locality is built-into webhdfs, client will be
 redirected to the data node automatically.




 On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,

 I'm writing up a Google Summer of Code proposal to add HDFS support to
 Disco, an Erlang MapReduce framework.

 We're interested in using WebHDFS.  I have two questions:

 1) Does WebHDFS allow querying data locality information?

 2) If the data locality information is known, can data on specific data
 nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
 through a single server?

 Thanks,
 RJ

 --
 em rnowl...@gmail.com
 c 954.496.2314




 --
 Cheers
 -MJ




 --
 em rnowl...@gmail.com
 c 954.496.2314




 --
 Alejandro




 --
 em rnowl...@gmail.com
 c 954.496.2314






-- 
em rnowl...@gmail.com
c 954.496.2314


Re: issue with defaut yarn.application.classpath value from yarn-default.xml for hadoop-2.3.0

2014-03-17 Thread Akira AJISAKA

This is intentional.
See https://issues.apache.org/jira/browse/YARN-1138 for the detail.

If you want to use the default parameter for your application,
you should write the same parameter to config file or you can use
YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH instead of
using yarn.application.classpath.

Thanks,
Akira

(2014/03/17 14:38), REYANE OUKPEDJO wrote:

Hi,

I believe there is an issue with yarn-default.xml setting
of yarn.application.classpath on hadoop-2.3.0. This parameter's default
is not set and I have an application that fails because of that. Below
is part of the the content of yarn-default.xml which shows an empty
value of that parameter. When I checked the same file on hadoop-2.2.0, I
can see that a default value is set. Can anyone explain why this happens
? should I open a bug report? or was this intentional ?


Thanks

Reyane OUKPEDJO

   property
 descriptionDOOP_CONF_DIR,
   $HADOOP_COMMON_HOME/share/hadoop/common/*,
 $HADOOP_COMMON_HOME/share/hadoop/common/lib/*,
   $HADOOP_HDFS_HOME/share/hadoop/hdfs/*,
$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,
   $HADOOP_YARN_HOME/share/hadoop/yarn/*,

$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*
   CLASSPATH for YARN applications. A comma-separated list
   of CLASSPATH entries. When this value is empty, the following default
   CLASSPATH for YARN applications would be used.
   For Linux:
   $HADOOP_CONF_DIR,
   $HADOOP_COMMON_HOME/share/hadoop/common/*,
   $HADOOP_COMMON_HOME/share/hadoop/common/lib/*,
   $HADOOP_HDFS_HOME/share/hadoop/hdfs/*,
   $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,
   $HADOOP_YARN_HOME/share/hadoop/yarn/*,
   $HADOOP_YARN_HOME/share/hadoop/yarn/lib/*
   For Windows:
   %HADOOP_CONF_DIR%,
   %HADOOP_COMMON_HOME%/share/hadoop/common/*,
   %HADOOP_COMMON_HOME%/share/hadoop/common/lib/*,
   %HADOOP_HDFS_HOME%/share/hadoop/hdfs/*,
   %HADOOP_HDFS_HOME%/share/hadoop/hdfs/lib/*,
   %HADOOP_YARN_HOME%/share/hadoop/yarn/*,
   %HADOOP_YARN_HOME%/share/hadoop/yarn/lib/*
 /description
 nameyarn.application.classpath/name
 value/value
   /property





Re: how to unzip a .tar.bz2 file in hadoop/hdfs

2014-03-17 Thread Stanley Shi
download it, unzip and put it back?

Regards,
*Stanley Shi,*



On Fri, Mar 14, 2014 at 5:44 PM, Sai Sai saigr...@yahoo.in wrote:

 Can some one please help:
 How to unzip a .tar.bz2 file which is in hadoop/hdfs
 Thanks
 Sai




Re: how to unzip a .tar.bz2 file in hadoop/hdfs

2014-03-17 Thread Anthony Mattas
You want to use the BZip2Codec to un BZip the file and then use FileUtil to
untar it.



Anthony Mattas
anth...@mattas.net


On Mon, Mar 17, 2014 at 10:06 PM, Stanley Shi s...@gopivotal.com wrote:

 download it, unzip and put it back?

 Regards,
 *Stanley Shi,*



 On Fri, Mar 14, 2014 at 5:44 PM, Sai Sai saigr...@yahoo.in wrote:

 Can some one please help:
 How to unzip a .tar.bz2 file which is in hadoop/hdfs
 Thanks
 Sai





Re: Problem of installing HDFS-385 and the usage

2014-03-17 Thread Stanley Shi
This JIRA is included in Apache code since version
0.21.0https://issues.apache.org/jira/browse/HDFS/fixforversion/12314046
, 1.2.0 https://issues.apache.org/jira/browse/HDFS/fixforversion/12321657
, 1-win https://issues.apache.org/jira/browse/HDFS/fixforversion/12320362
;
If you want to use it, you need to write your own policy, please see this
JIRA for example: https://issues.apache.org/jira/browse/HDFS-3601


Regards,
*Stanley Shi,*



On Mon, Mar 17, 2014 at 11:31 AM, Eric Chiu ericchiu0...@gmail.com wrote:

 HI all,

 Could anyone tell me How to install and use this hadoop plug-in?

 https://issues.apache.org/jira/browse/HDFS-385

 I read the code but do not know where to install and use what command to
 install them all.

 Another problem is that there are .txt and .patch files, which one should
 be applied?

 Some of the .patch files have  -win , does that mean that file is for
 windows hadoop user? (I am using ubuntu)

 Thank you very much.



Re: I am about to lose all my data please help

2014-03-17 Thread Stanley Shi
one possible reason is that you didn't set the namenode working directory,
by default it's in /tmp folder; and the /tmp folder might get deleted
by the OS without any notification. If this is the case, I am afraid you
have lost all your namenode data.

*property
  namedfs.name.dir/name
  value${hadoop.tmp.dir}/dfs/name/value
  descriptionDetermines where on the local filesystem the DFS name node
  should store the name table(fsimage).  If this is a comma-delimited list
  of directories then the name table is replicated in all of the
  directories, for redundancy. /description
/property*


Regards,
*Stanley Shi,*



On Sun, Mar 16, 2014 at 5:29 PM, Mirko Kämpf mirko.kae...@gmail.com wrote:

 Hi,

 what is the location of the namenodes fsimage and editlogs?
 And how much memory has the NameNode.

 Did you work with a Secondary NameNode or a Standby NameNode for
 checkpointing?

 Where are your HDFS blocks located, are those still safe?

 With this information at hand, one might be able to fix your setup, but do
 not format the old namenode before
 all is working with a fresh one.

 Grab a copy of the maintainance guide:
 http://shop.oreilly.com/product/0636920025085.do?sortby=publicationDate
 which helps solving such type of problems as well.

 Best wishes
 Mirko


 2014-03-16 9:07 GMT+00:00 Fatih Haltas fatih.hal...@nyu.edu:

 Dear All,

 I have just restarted machines of my hadoop clusters. Now, I am trying to
 restart hadoop clusters again, but getting error on namenode restart. I am
 afraid of loosing my data as it was properly running for more than 3
 months. Currently, I believe if I do namenode formatting, it will work
 again, however, data will be lost. Is there anyway to solve this without
 losing the data.

 I will really appreciate any help.

 Thanks.


 =
 Here is the logs;
 
 2014-02-26 16:02:39,698 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG:   host = ADUAE042-LAP-V/127.0.0.1
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 1.0.4
 STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r
 1393290; compiled by 'hortonfo' on Wed Oct  3 05:13:58 UTC 2012
 /
 2014-02-26 16:02:40,005 INFO
 org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
 hadoop-metrics2.properties
 2014-02-26 16:02:40,019 INFO
 org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
 MetricsSystem,sub=Stats registered.
 2014-02-26 16:02:40,021 INFO
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
 period at 10 second(s).
 2014-02-26 16:02:40,021 INFO
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system
 started
 2014-02-26 16:02:40,169 INFO
 org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi
 registered.
 2014-02-26 16:02:40,193 INFO
 org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source jvm
 registered.
 2014-02-26 16:02:40,194 INFO
 org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
 NameNode registered.
 2014-02-26 16:02:40,242 INFO org.apache.hadoop.hdfs.util.GSet: VM type
 = 64-bit
 2014-02-26 16:02:40,242 INFO org.apache.hadoop.hdfs.util.GSet: 2% max
 memory = 17.77875 MB
 2014-02-26 16:02:40,242 INFO org.apache.hadoop.hdfs.util.GSet: capacity
= 2^21 = 2097152 entries
 2014-02-26 16:02:40,242 INFO org.apache.hadoop.hdfs.util.GSet:
 recommended=2097152, actual=2097152
 2014-02-26 16:02:40,273 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop
 2014-02-26 16:02:40,273 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
 2014-02-26 16:02:40,274 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 isPermissionEnabled=true
 2014-02-26 16:02:40,279 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 dfs.block.invalidate.limit=100
 2014-02-26 16:02:40,279 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
 accessTokenLifetime=0 min(s)
 2014-02-26 16:02:40,724 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
 FSNamesystemStateMBean and NameNodeMXBean
 2014-02-26 16:02:40,749 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names
 occuring more than 10 times
 2014-02-26 16:02:40,780 ERROR
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
 initialization failed.
 java.io.IOException: NameNode is not formatted.
 at
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:330)
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
 at