Re: Data Locality and WebHDFS
I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file. Thanks. On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling rnowl...@gmail.com wrote: Thank you, Mingjiang and Alejandro. This is interesting. Since we will use the data locality information for scheduling, we could hack this to get the data locality information, at least for the first block. As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block? Interesting food for thought! I see some experiments in my future! Thanks! On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur t...@cloudera.comwrote: well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. Alejandro (phone typing) On Mar 16, 2014, at 18:53, Mingjiang Shi m...@gopivotal.com wrote: According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/ *Data Locality*: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data. *A HDFS Built-in Component*: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS - there are no additional servers to install So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework. We're interested in using WebHDFS. I have two questions: 1) Does WebHDFS allow querying data locality information? 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS? Or do all Web HDFS requests have to go through a single server? Thanks, RJ -- em rnowl...@gmail.com c 954.496.2314 -- Cheers -MJ -- em rnowl...@gmail.com c 954.496.2314 -- Alejandro
Re: class org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto overrides final method getUnknownFields
Hi thanks for your replay. What I did: [speech@h14 mahout]$ /usr/share/apache-maven/bin/mvn -DskipTests clean install -Dhadoop2.profile=hadoop2 - is hadoop2 right string? I found it from pom profile section so I used it. ... it compiled: [INFO] [INFO] Reactor Summary: [INFO] [INFO] Mahout Build Tools SUCCESS [ 1.751 s] [INFO] Apache Mahout . SUCCESS [ 0.484 s] [INFO] Mahout Math ... SUCCESS [ 12.946 s] [INFO] Mahout Core ... SUCCESS [ 14.192 s] [INFO] Mahout Integration SUCCESS [ 1.857 s] [INFO] Mahout Examples ... SUCCESS [ 10.762 s] [INFO] Mahout Release Package SUCCESS [ 0.012 s] [INFO] Mahout Math/Scala wrappers SUCCESS [ 25.431 s] [INFO] Mahout Spark bindings . SUCCESS [ 40.376 s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 01:48 min [INFO] Finished at: 2014-03-17T12:06:31+02:00 [INFO] Final Memory: 79M/2947M [INFO] How to check is there hadoop2 libs in use? but unfortunately again: [speech@h14 ~]$ mahout/bin/mahout seqdirectory -c UTF-8 -i /user/speech/demo -o demo-seqfiles MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using /usr/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf MAHOUT-JOB: /home/speech/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar 14/03/17 12:07:21 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[/user/speech/demo], --keyPrefix=[], --method=[mapreduce], --output=[demo-seqfiles], --startPhase=[0], --tempDir=[temp]} 14/03/17 12:07:22 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 14/03/17 12:07:22 INFO Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress 14/03/17 12:07:22 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 14/03/17 12:07:22 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 14/03/17 12:07:22 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 14/03/17 12:07:23 INFO input.FileInputFormat: Total input paths to process : 10 14/03/17 12:07:23 INFO input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 4, size left: 29775 14/03/17 12:07:23 INFO mapreduce.JobSubmitter: number of splits:1 14/03/17 12:07:23 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 14/03/17 12:07:23 INFO Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress 14/03/17 12:07:23 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 14/03/17 12:07:23 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 14/03/17 12:07:23 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 14/03/17 12:07:23 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class 14/03/17 12:07:23 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 14/03/17 12:07:23 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 14/03/17 12:07:23 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 14/03/17 12:07:23 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 14/03/17 12:07:23 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 14/03/17 12:07:23 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 14/03/17 12:07:23 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 14/03/17 12:07:23 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class 14/03/17 12:07:23 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead,
Re: class org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto overrides final method getUnknownFields
Okey sorry for the mess [speech@h14 mahout]$ /usr/share/apache-maven/bin/mvn -DskipTests clean install -Dhadoop2.version=2.2.0 - did the trick Best regards, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee http://ee.linkedin.com/in/margusroo skype: margusja ldapsearch -x -h ldap.sk.ee -b c=EE (serialNumber=37303140314) -BEGIN PUBLIC KEY- MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCvbeg7LwEC2SCpAEewwpC3ajxE 5ZsRMCB77L8bae9G7TslgLkoIzo9yOjPdx2NN6DllKbV65UjTay43uUDyql9g3tl RhiJIcoAExkSTykWqAIPR88LfilLy1JlQ+0RD8OXiWOVVQfhOHpQ0R/jcAkM2lZa BjM8j36yJvoBVsfOHQIDAQAB -END PUBLIC KEY- On 17/03/14 12:16, Margusja wrote: Hi thanks for your replay. What I did: [speech@h14 mahout]$ /usr/share/apache-maven/bin/mvn -DskipTests clean install -Dhadoop2.profile=hadoop2 - is hadoop2 right string? I found it from pom profile section so I used it. ... it compiled: [INFO] [INFO] Reactor Summary: [INFO] [INFO] Mahout Build Tools SUCCESS [ 1.751 s] [INFO] Apache Mahout . SUCCESS [ 0.484 s] [INFO] Mahout Math ... SUCCESS [ 12.946 s] [INFO] Mahout Core ... SUCCESS [ 14.192 s] [INFO] Mahout Integration SUCCESS [ 1.857 s] [INFO] Mahout Examples ... SUCCESS [ 10.762 s] [INFO] Mahout Release Package SUCCESS [ 0.012 s] [INFO] Mahout Math/Scala wrappers SUCCESS [ 25.431 s] [INFO] Mahout Spark bindings . SUCCESS [ 40.376 s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 01:48 min [INFO] Finished at: 2014-03-17T12:06:31+02:00 [INFO] Final Memory: 79M/2947M [INFO] How to check is there hadoop2 libs in use? but unfortunately again: [speech@h14 ~]$ mahout/bin/mahout seqdirectory -c UTF-8 -i /user/speech/demo -o demo-seqfiles MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using /usr/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf MAHOUT-JOB: /home/speech/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar 14/03/17 12:07:21 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[/user/speech/demo], --keyPrefix=[], --method=[mapreduce], --output=[demo-seqfiles], --startPhase=[0], --tempDir=[temp]} 14/03/17 12:07:22 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 14/03/17 12:07:22 INFO Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress 14/03/17 12:07:22 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 14/03/17 12:07:22 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 14/03/17 12:07:22 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 14/03/17 12:07:23 INFO input.FileInputFormat: Total input paths to process : 10 14/03/17 12:07:23 INFO input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 4, size left: 29775 14/03/17 12:07:23 INFO mapreduce.JobSubmitter: number of splits:1 14/03/17 12:07:23 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 14/03/17 12:07:23 INFO Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress 14/03/17 12:07:23 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 14/03/17 12:07:23 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 14/03/17 12:07:23 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 14/03/17 12:07:23 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class 14/03/17 12:07:23 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 14/03/17 12:07:23 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 14/03/17 12:07:23 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 14/03/17 12:07:23 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use
Benchmark Failure
Hi all, I'm running jobclient tests(on single node), other tests like TestDFSIO, mrbench succeed except nnbench. I got a lot of Exceptions but without any explanation(see below). Could anyone tell me what might went wrong? Thanks! 14/03/17 23:54:22 INFO hdfs.NNBench: Waiting in barrier for: 112819 ms 14/03/17 23:54:23 INFO mapreduce.Job: Job job_local2133868569_0001 running in uber mode : false 14/03/17 23:54:23 INFO mapreduce.Job: map 0% reduce 0% 14/03/17 23:54:28 INFO mapred.LocalJobRunner: hdfs:// 0.0.0.0:9000/benchmarks/NNBench-aolx-PC/control/NNBench_Controlfile_10:0+125 map 14/03/17 23:54:29 INFO mapreduce.Job: map 6% reduce 0% 14/03/17 23:56:15 INFO hdfs.NNBench: Exception recorded in op: Create/Write/Close 14/03/17 23:56:15 INFO hdfs.NNBench: Exception recorded in op: Create/Write/Close 14/03/17 23:56:15 INFO hdfs.NNBench: Exception recorded in op: Create/Write/Close 14/03/17 23:56:15 INFO hdfs.NNBench: Exception recorded in op: Create/Write/Close 14/03/17 23:56:15 INFO hdfs.NNBench: Exception recorded in op: Create/Write/Close 14/03/17 23:56:15 INFO hdfs.NNBench: Exception recorded in op: Create/Write/Close 14/03/17 23:56:15 INFO hdfs.NNBench: Exception recorded in op: Create/Write/Close 14/03/17 23:56:15 INFO hdfs.NNBench: Exception recorded in op: Create/Write/Close (1000 Exceptions) . . . results: File System Counters FILE: Number of bytes read=18769411 FILE: Number of bytes written=21398315 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=11185 HDFS: Number of bytes written=19540 HDFS: Number of read operations=325 HDFS: Number of large read operations=0 HDFS: Number of write operations=13210 Map-Reduce Framework Map input records=12 Map output records=95 Map output bytes=1829 Map output materialized bytes=2091 Input split bytes=1538 Combine input records=0 Combine output records=0 Reduce input groups=8 Reduce shuffle bytes=0 Reduce input records=95 Reduce output records=8 Spilled Records=214 Shuffled Maps =0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=211 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 Total committed heap usage (bytes)=4401004544 File Input Format Counters Bytes Read=1490 File Output Format Counters Bytes Written=170 14/03/17 23:56:18 INFO hdfs.NNBench: -- NNBench -- : 14/03/17 23:56:18 INFO hdfs.NNBench: Version: NameNode Benchmark 0.4 14/03/17 23:56:18 INFO hdfs.NNBench:Date time: 2014-03-17 23:56:18,619 14/03/17 23:56:18 INFO hdfs.NNBench: 14/03/17 23:56:18 INFO hdfs.NNBench: Test Operation: create_write 14/03/17 23:56:18 INFO hdfs.NNBench: Start time: 2014-03-17 23:56:15,521 14/03/17 23:56:18 INFO hdfs.NNBench:Maps to run: 12 14/03/17 23:56:18 INFO hdfs.NNBench: Reduces to run: 6 14/03/17 23:56:18 INFO hdfs.NNBench: Block Size (bytes): 1 14/03/17 23:56:18 INFO hdfs.NNBench: Bytes to write: 0 14/03/17 23:56:18 INFO hdfs.NNBench: Bytes per checksum: 1 14/03/17 23:56:18 INFO hdfs.NNBench:Number of files: 1000 14/03/17 23:56:18 INFO hdfs.NNBench: Replication factor: 3 14/03/17 23:56:18 INFO hdfs.NNBench: Successful file operations: 0 14/03/17 23:56:18 INFO hdfs.NNBench: 14/03/17 23:56:18 INFO hdfs.NNBench: # maps that missed the barrier: 11 14/03/17 23:56:18 INFO hdfs.NNBench: # exceptions: 1000 14/03/17 23:56:18 INFO hdfs.NNBench: 14/03/17 23:56:18 INFO hdfs.NNBench:TPS: Create/Write/Close: 0 14/03/17 23:56:18 INFO hdfs.NNBench: Avg exec time (ms): Create/Write/Close: Infinity 14/03/17 23:56:18 INFO hdfs.NNBench: Avg Lat (ms): Create/Write: NaN 14/03/17 23:56:18 INFO hdfs.NNBench:Avg Lat (ms): Close: NaN 14/03/17 23:56:18 INFO hdfs.NNBench: 14/03/17 23:56:18 INFO hdfs.NNBench: RAW DATA: AL Total #1: 0 14/03/17 23:56:18 INFO hdfs.NNBench: RAW DATA: AL Total #2: 0 14/03/17 23:56:18 INFO hdfs.NNBench: RAW DATA: TPS Total (ms): 1131 14/03/17 23:56:18 INFO hdfs.NNBench:RAW DATA: Longest Map Time (ms): 1.395071776653E12 14/03/17 23:56:18 INFO hdfs.NNBench:RAW DATA: Late maps: 11 14/03/17 23:56:18 INFO hdfs.NNBench: RAW DATA: # of exceptions: 1000 14/03/17 23:56:18 INFO hdfs.NNBench:
Re: Data Locality and WebHDFS
Hi Alejandro, The WebHDFS API allows specifying an offset and length for the request. If I specify an offset that start in the second block for a file (thus skipping the first block all together), will the namenode still direct me to a datanode with the first block or will it direct me to a namenode with the second block? I.e., am I assured data locality only on the first block of the file (as you're saying) or on the first block I am accessing? If it is as you say, then I may want to reach out the WebHDFS developers and see if they would be interested in the additional functionality. Thank you, RJ On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur t...@cloudera.comwrote: I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file. Thanks. On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling rnowl...@gmail.com wrote: Thank you, Mingjiang and Alejandro. This is interesting. Since we will use the data locality information for scheduling, we could hack this to get the data locality information, at least for the first block. As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block? Interesting food for thought! I see some experiments in my future! Thanks! On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur t...@cloudera.comwrote: well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. Alejandro (phone typing) On Mar 16, 2014, at 18:53, Mingjiang Shi m...@gopivotal.com wrote: According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/ *Data Locality*: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data. *A HDFS Built-in Component*: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS - there are no additional servers to install So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework. We're interested in using WebHDFS. I have two questions: 1) Does WebHDFS allow querying data locality information? 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS? Or do all Web HDFS requests have to go through a single server? Thanks, RJ -- em rnowl...@gmail.com c 954.496.2314 -- Cheers -MJ -- em rnowl...@gmail.com c 954.496.2314 -- Alejandro -- em rnowl...@gmail.com c 954.496.2314
Re: Data Locality and WebHDFS
dont recall how skips are handled in webhdfs, but i would assume that you'll get to the first block As usual, and the skip is handled by the DN serving the file (as webhdfs doesnot know at open that you'll skip) Alejandro (phone typing) On Mar 17, 2014, at 9:47, RJ Nowling rnowl...@gmail.com wrote: Hi Alejandro, The WebHDFS API allows specifying an offset and length for the request. If I specify an offset that start in the second block for a file (thus skipping the first block all together), will the namenode still direct me to a datanode with the first block or will it direct me to a namenode with the second block? I.e., am I assured data locality only on the first block of the file (as you're saying) or on the first block I am accessing? If it is as you say, then I may want to reach out the WebHDFS developers and see if they would be interested in the additional functionality. Thank you, RJ On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur t...@cloudera.com wrote: I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file. Thanks. On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling rnowl...@gmail.com wrote: Thank you, Mingjiang and Alejandro. This is interesting. Since we will use the data locality information for scheduling, we could hack this to get the data locality information, at least for the first block. As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block? Interesting food for thought! I see some experiments in my future! Thanks! On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur t...@cloudera.com wrote: well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. Alejandro (phone typing) On Mar 16, 2014, at 18:53, Mingjiang Shi m...@gopivotal.com wrote: According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/ Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data. A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework. We're interested in using WebHDFS. I have two questions: 1) Does WebHDFS allow querying data locality information? 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS? Or do all Web HDFS requests have to go through a single server? Thanks, RJ -- em rnowl...@gmail.com c 954.496.2314 -- Cheers -MJ -- em rnowl...@gmail.com c 954.496.2314 -- Alejandro -- em rnowl...@gmail.com c 954.496.2314
Writing Bytes Directly to Distributed Cache?
Hello, I was wondering if anyone might know of a way to write bytes directly to the distributed cache. I know I can call job.addCacheFile(URI uri), but in my case the file I wish to add to the cache is in memory and is job specific. I would prefer not writing it to a location that I have to then manage the cleanup of post job execution simply to get it into the cache. If I cannot use the distributed cache to do this, is there another way I could accomplish the same thing using m/r¹s cleanup logic? Any pointers would be most appreciated. Thanks!
Re: Data Locality and WebHDFS
The file offset is considered in WebHDFS redirection. It redirects to a datanode with the first block the client going to read, not the first block of the file. Hope it helps. Tsz-Wo On Monday, March 17, 2014 10:09 AM, Alejandro Abdelnur t...@cloudera.com wrote: actually, i am wrong, the webhdfs rest call has an offset. Alejandro (phone typing) On Mar 17, 2014, at 10:07, Alejandro Abdelnur t...@cloudera.com wrote: dont recall how skips are handled in webhdfs, but i would assume that you'll get to the first block As usual, and the skip is handled by the DN serving the file (as webhdfs doesnot know at open that you'll skip) Alejandro (phone typing) On Mar 17, 2014, at 9:47, RJ Nowling rnowl...@gmail.com wrote: Hi Alejandro, The WebHDFS API allows specifying an offset and length for the request. If I specify an offset that start in the second block for a file (thus skipping the first block all together), will the namenode still direct me to a datanode with the first block or will it direct me to a namenode with the second block? I.e., am I assured data locality only on the first block of the file (as you're saying) or on the first block I am accessing? If it is as you say, then I may want to reach out the WebHDFS developers and see if they would be interested in the additional functionality. Thank you, RJ On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur t...@cloudera.com wrote: I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file. Thanks. On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling rnowl...@gmail.com wrote: Thank you, Mingjiang and Alejandro. This is interesting. Since we will use the data locality information for scheduling, we could hack this to get the data locality information, at least for the first block. As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block? Interesting food for thought! I see some experiments in my future! Thanks! On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur t...@cloudera.com wrote: well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. Alejandro (phone typing) On Mar 16, 2014, at 18:53, Mingjiang Shi m...@gopivotal.com wrote: According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/ Data Locality: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data. A HDFS Built-in Component: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS – there are no additional servers to install So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework. We're interested in using WebHDFS. I have two questions: 1) Does WebHDFS allow querying data locality information? 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS? Or do all Web HDFS requests have to go through a single server? Thanks, RJ -- em rnowl...@gmail.com c 954.496.2314 -- Cheers -MJ -- em rnowl...@gmail.com c 954.496.2314 -- Alejandro -- em rnowl...@gmail.com c 954.496.2314
issue with defaut yarn.application.classpath value from yarn-default.xml for hadoop-2.3.0
Hi, I believe there is an issue with yarn-default.xml setting of yarn.application.classpath on hadoop-2.3.0. This parameter's default is not set and I have an application that fails because of that. Below is part of the the content of yarn-default.xml which shows an empty value of that parameter. When I checked the same file on hadoop-2.2.0, I can see that a default value is set. Can anyone explain why this happens ? should I open a bug report? or was this intentional ? Thanks Reyane OUKPEDJO property descriptionDOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/* CLASSPATH for YARN applications. A comma-separated list of CLASSPATH entries. When this value is empty, the following default CLASSPATH for YARN applications would be used. For Linux: $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/* For Windows: %HADOOP_CONF_DIR%, %HADOOP_COMMON_HOME%/share/hadoop/common/*, %HADOOP_COMMON_HOME%/share/hadoop/common/lib/*, %HADOOP_HDFS_HOME%/share/hadoop/hdfs/*, %HADOOP_HDFS_HOME%/share/hadoop/hdfs/lib/*, %HADOOP_YARN_HOME%/share/hadoop/yarn/*, %HADOOP_YARN_HOME%/share/hadoop/yarn/lib/* /description nameyarn.application.classpath/name value/value /property
Re: Data Locality and WebHDFS
Thank you, Tsz. That helps! On Mon, Mar 17, 2014 at 2:30 PM, Tsz Wo Sze szets...@yahoo.com wrote: The file offset is considered in WebHDFS redirection. It redirects to a datanode with the first block the client going to read, not the first block of the file. Hope it helps. Tsz-Wo On Monday, March 17, 2014 10:09 AM, Alejandro Abdelnur t...@cloudera.com wrote: actually, i am wrong, the webhdfs rest call has an offset. Alejandro (phone typing) On Mar 17, 2014, at 10:07, Alejandro Abdelnur t...@cloudera.com wrote: dont recall how skips are handled in webhdfs, but i would assume that you'll get to the first block As usual, and the skip is handled by the DN serving the file (as webhdfs doesnot know at open that you'll skip) Alejandro (phone typing) On Mar 17, 2014, at 9:47, RJ Nowling rnowl...@gmail.com wrote: Hi Alejandro, The WebHDFS API allows specifying an offset and length for the request. If I specify an offset that start in the second block for a file (thus skipping the first block all together), will the namenode still direct me to a datanode with the first block or will it direct me to a namenode with the second block? I.e., am I assured data locality only on the first block of the file (as you're saying) or on the first block I am accessing? If it is as you say, then I may want to reach out the WebHDFS developers and see if they would be interested in the additional functionality. Thank you, RJ On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur t...@cloudera.comwrote: I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file. Thanks. On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling rnowl...@gmail.com wrote: Thank you, Mingjiang and Alejandro. This is interesting. Since we will use the data locality information for scheduling, we could hack this to get the data locality information, at least for the first block. As Alejandro says, we'd have to test what happens for other data blocks -- e.g., what if, knowing the block sizes, we request the second or third block? Interesting food for thought! I see some experiments in my future! Thanks! On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur t...@cloudera.comwrote: well, this is for the first block of the file, the rest of the file (blocks being local or not) are streamed out by the same datanode. for small files (one block) you'll get locality, for large files only the first block, and by chance if other blocks are local to that datanode. Alejandro (phone typing) On Mar 16, 2014, at 18:53, Mingjiang Shi m...@gopivotal.com wrote: According to this page: http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/ *Data Locality*: The file read and file write calls are redirected to the corresponding datanodes. It uses the full bandwidth of the Hadoop cluster for streaming data. *A HDFS Built-in Component*: WebHDFS is a first class built-in component of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS functionalities. It is a part of HDFS - there are no additional servers to install So it looks like the data locality is built-into webhdfs, client will be redirected to the data node automatically. On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I'm writing up a Google Summer of Code proposal to add HDFS support to Disco, an Erlang MapReduce framework. We're interested in using WebHDFS. I have two questions: 1) Does WebHDFS allow querying data locality information? 2) If the data locality information is known, can data on specific data nodes be accessed via Web HDFS? Or do all Web HDFS requests have to go through a single server? Thanks, RJ -- em rnowl...@gmail.com c 954.496.2314 -- Cheers -MJ -- em rnowl...@gmail.com c 954.496.2314 -- Alejandro -- em rnowl...@gmail.com c 954.496.2314 -- em rnowl...@gmail.com c 954.496.2314
Re: issue with defaut yarn.application.classpath value from yarn-default.xml for hadoop-2.3.0
This is intentional. See https://issues.apache.org/jira/browse/YARN-1138 for the detail. If you want to use the default parameter for your application, you should write the same parameter to config file or you can use YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH instead of using yarn.application.classpath. Thanks, Akira (2014/03/17 14:38), REYANE OUKPEDJO wrote: Hi, I believe there is an issue with yarn-default.xml setting of yarn.application.classpath on hadoop-2.3.0. This parameter's default is not set and I have an application that fails because of that. Below is part of the the content of yarn-default.xml which shows an empty value of that parameter. When I checked the same file on hadoop-2.2.0, I can see that a default value is set. Can anyone explain why this happens ? should I open a bug report? or was this intentional ? Thanks Reyane OUKPEDJO property descriptionDOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/* CLASSPATH for YARN applications. A comma-separated list of CLASSPATH entries. When this value is empty, the following default CLASSPATH for YARN applications would be used. For Linux: $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/* For Windows: %HADOOP_CONF_DIR%, %HADOOP_COMMON_HOME%/share/hadoop/common/*, %HADOOP_COMMON_HOME%/share/hadoop/common/lib/*, %HADOOP_HDFS_HOME%/share/hadoop/hdfs/*, %HADOOP_HDFS_HOME%/share/hadoop/hdfs/lib/*, %HADOOP_YARN_HOME%/share/hadoop/yarn/*, %HADOOP_YARN_HOME%/share/hadoop/yarn/lib/* /description nameyarn.application.classpath/name value/value /property
Re: how to unzip a .tar.bz2 file in hadoop/hdfs
download it, unzip and put it back? Regards, *Stanley Shi,* On Fri, Mar 14, 2014 at 5:44 PM, Sai Sai saigr...@yahoo.in wrote: Can some one please help: How to unzip a .tar.bz2 file which is in hadoop/hdfs Thanks Sai
Re: how to unzip a .tar.bz2 file in hadoop/hdfs
You want to use the BZip2Codec to un BZip the file and then use FileUtil to untar it. Anthony Mattas anth...@mattas.net On Mon, Mar 17, 2014 at 10:06 PM, Stanley Shi s...@gopivotal.com wrote: download it, unzip and put it back? Regards, *Stanley Shi,* On Fri, Mar 14, 2014 at 5:44 PM, Sai Sai saigr...@yahoo.in wrote: Can some one please help: How to unzip a .tar.bz2 file which is in hadoop/hdfs Thanks Sai
Re: Problem of installing HDFS-385 and the usage
This JIRA is included in Apache code since version 0.21.0https://issues.apache.org/jira/browse/HDFS/fixforversion/12314046 , 1.2.0 https://issues.apache.org/jira/browse/HDFS/fixforversion/12321657 , 1-win https://issues.apache.org/jira/browse/HDFS/fixforversion/12320362 ; If you want to use it, you need to write your own policy, please see this JIRA for example: https://issues.apache.org/jira/browse/HDFS-3601 Regards, *Stanley Shi,* On Mon, Mar 17, 2014 at 11:31 AM, Eric Chiu ericchiu0...@gmail.com wrote: HI all, Could anyone tell me How to install and use this hadoop plug-in? https://issues.apache.org/jira/browse/HDFS-385 I read the code but do not know where to install and use what command to install them all. Another problem is that there are .txt and .patch files, which one should be applied? Some of the .patch files have -win , does that mean that file is for windows hadoop user? (I am using ubuntu) Thank you very much.
Re: I am about to lose all my data please help
one possible reason is that you didn't set the namenode working directory, by default it's in /tmp folder; and the /tmp folder might get deleted by the OS without any notification. If this is the case, I am afraid you have lost all your namenode data. *property namedfs.name.dir/name value${hadoop.tmp.dir}/dfs/name/value descriptionDetermines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. /description /property* Regards, *Stanley Shi,* On Sun, Mar 16, 2014 at 5:29 PM, Mirko Kämpf mirko.kae...@gmail.com wrote: Hi, what is the location of the namenodes fsimage and editlogs? And how much memory has the NameNode. Did you work with a Secondary NameNode or a Standby NameNode for checkpointing? Where are your HDFS blocks located, are those still safe? With this information at hand, one might be able to fix your setup, but do not format the old namenode before all is working with a fresh one. Grab a copy of the maintainance guide: http://shop.oreilly.com/product/0636920025085.do?sortby=publicationDate which helps solving such type of problems as well. Best wishes Mirko 2014-03-16 9:07 GMT+00:00 Fatih Haltas fatih.hal...@nyu.edu: Dear All, I have just restarted machines of my hadoop clusters. Now, I am trying to restart hadoop clusters again, but getting error on namenode restart. I am afraid of loosing my data as it was properly running for more than 3 months. Currently, I believe if I do namenode formatting, it will work again, however, data will be lost. Is there anyway to solve this without losing the data. I will really appreciate any help. Thanks. = Here is the logs; 2014-02-26 16:02:39,698 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = ADUAE042-LAP-V/127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 1.0.4 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290; compiled by 'hortonfo' on Wed Oct 3 05:13:58 UTC 2012 / 2014-02-26 16:02:40,005 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2014-02-26 16:02:40,019 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered. 2014-02-26 16:02:40,021 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2014-02-26 16:02:40,021 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system started 2014-02-26 16:02:40,169 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered. 2014-02-26 16:02:40,193 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source jvm registered. 2014-02-26 16:02:40,194 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source NameNode registered. 2014-02-26 16:02:40,242 INFO org.apache.hadoop.hdfs.util.GSet: VM type = 64-bit 2014-02-26 16:02:40,242 INFO org.apache.hadoop.hdfs.util.GSet: 2% max memory = 17.77875 MB 2014-02-26 16:02:40,242 INFO org.apache.hadoop.hdfs.util.GSet: capacity = 2^21 = 2097152 entries 2014-02-26 16:02:40,242 INFO org.apache.hadoop.hdfs.util.GSet: recommended=2097152, actual=2097152 2014-02-26 16:02:40,273 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop 2014-02-26 16:02:40,273 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2014-02-26 16:02:40,274 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2014-02-26 16:02:40,279 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.block.invalidate.limit=100 2014-02-26 16:02:40,279 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 2014-02-26 16:02:40,724 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStateMBean and NameNodeMXBean 2014-02-26 16:02:40,749 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names occuring more than 10 times 2014-02-26 16:02:40,780 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:330) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388) at