Sending the entire file content as value to the mapper
Hi Team, I have a file which has semi structured text data with no definite start and end points. How can i send the entire content of the file at once as key or value to the mapper instead of line by line. Thanks, Subbu
RE: Sending the entire file content as value to the mapper
Hi Subbu. Sounds like you'll have to implement a custom non-splittable InputFormat which instantiates a custom RecordReader which in turn consumes the entire file when it's next(K,V) method is called. Once implemented, you specify the input format to the JobConf object: conf.setInputFormat(MyInputFormat.class); -Chuck From: Kasi Subrahmanyam [mailto:kasisubbu...@gmail.com] Sent: Thursday, July 11, 2013 1:08 AM To: common-u...@hadoop.apache.org; mapreduce-user@hadoop.apache.org Subject: Sending the entire file content as value to the mapper Hi Team, I have a file which has semi structured text data with no definite start and end points. How can i send the entire content of the file at once as key or value to the mapper instead of line by line. Thanks, Subbu /prefont face=arial size=2 color=#736F6E a href=http://www.sdl.com/?utm_source=Emailutm_medium=Email%2BSignatureutm_campaign=SDL%2BStandard%2BEmail%2BSignature; img src=http://www.sdl.com/email.png; border=0brbrwww.sdl.com /abrbr font face=arial size=1 color=#736F6E bSDL PLC confidential, all rights reserved./b If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us.BRBR SDL Enterprise Technologies, Inc. - all rights reserved. The information contained in this email may be confidential and/or legally privileged. It has been sent for the sole use of the intended recipient(s). If you are not the intended recipient of this mail, you are hereby notified that any unauthorized review, use, disclosure, dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please reply to the sender and destroy all copies of the message. BRRegistered address: 201 Edgewater Drive, Suite 225, Wakefield, MA 01880, USA /font
Sending the entire file content as value to the mapper
Hi Team, I have a file which has semi structured text data with no definite start and end points. How can i send the entire content of the file at once as key or value to the mapper instead of line by line. Thanks, Subbu
RE: Sending the entire file content as value to the mapper
Hi, You could send the file meta info to the map function as key/value through the split, and then you can read the entire file in your map function. Thanks Devaraj k -Original Message- From: Kasi Subrahmanyam [mailto:kasisubbu...@gmail.com] Sent: 11 July 2013 13:38 To: common-user@hadoop.apache.org; mapreduce-u...@hadoop.apache.org Subject: Sending the entire file content as value to the mapper Hi Team, I have a file which has semi structured text data with no definite start and end points. How can i send the entire content of the file at once as key or value to the mapper instead of line by line. Thanks, Subbu
Re: Task failure in slave node
Hi, It seems mahout-examples-0.7-job.jar is depending on other jars/classes. While running Job Tasks it is not able to find those classes in the classpath and failing those tasks. You need to provide the dependent jar files while submitting/running Job. Thanks Devaraj k -- View this message in context: http://lucene.472066.n3.nabble.com/Task-failure-in-slave-node-tp4077284p4077290.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Datanodes using public ip, why?
Hello Hadoop Community! I've setup datanodes with private network by adding private hostname's to the slaves file. but it looks like when i lookup on the webUI datenodes are registered with public hostnames. are they actually networking with public network? all datanodes have eth0 with public address and eth1 with private address. what am i missing? Thanks a whole lot *Benjamin Kim* *benkimkimben at gmail*
Re: Datanodes using public ip, why?
have you tried playing with this config parameter dfs.datanode.dns.interface ? On Thu, Jul 11, 2013 at 4:20 AM, Ben Kim benkimkim...@gmail.com wrote: Hello Hadoop Community! I've setup datanodes with private network by adding private hostname's to the slaves file. but it looks like when i lookup on the webUI datenodes are registered with public hostnames. are they actually networking with public network? all datanodes have eth0 with public address and eth1 with private address. what am i missing? Thanks a whole lot *Benjamin Kim* *benkimkimben at gmail*
Re: Datanodes using public ip, why?
make sure that your hostnames resolved ( dns or/and hosts files ) with private IPs. if you have records in the nodes hosts files like public IP hosname remove (or comment) them Alex On Jul 11, 2013 2:21 AM, Ben Kim benkimkim...@gmail.com wrote: Hello Hadoop Community! I've setup datanodes with private network by adding private hostname's to the slaves file. but it looks like when i lookup on the webUI datenodes are registered with public hostnames. are they actually networking with public network? all datanodes have eth0 with public address and eth1 with private address. what am i missing? Thanks a whole lot *Benjamin Kim* *benkimkimben at gmail*
Re: ConnectionException in container, happens only sometimes
Here are logs of RM and 2 NMs: RM (master-host): http://pastebin.com/q4qJP8Ld NM where AM ran (slave-1-host): http://pastebin.com/vSsz7mjG NM where slave container ran (slave-2-host): http://pastebin.com/NMFi6gRp The only related error I've found in them is the following (from RM logs): ... 2013-07-11 07:46:06,225 ERROR org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AppAttemptId doesnt exist in cache appattempt_1373465780870_0005_01 2013-07-11 07:46:06,227 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call org.apache.hadoop.yarn.api.AMRMProtocolPB.allocate from 10.128.40.184:47101: output error 2013-07-11 07:46:06,228 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8030 caught an exception java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:456) at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2140) at org.apache.hadoop.ipc.Server.access$2000(Server.java:108) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:939) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1005) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1747) 2013-07-11 07:46:11,238 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved my_user to /default-rack 2013-07-11 07:46:11,283 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node my_user(cmPort: 59267 httpPort: 8042) registered with capability: 8192, assigned nodeId my_user:59267 ... Though from stack trace it's hard to tell where this error came from. Let me know if you need any more information. On Thu, Jul 11, 2013 at 1:00 AM, Andrei faithlessfri...@gmail.com wrote: Hi Omkar, I'm out of office now, so I'll post it as fast as get back there. Thanks On Thu, Jul 11, 2013 at 12:39 AM, Omkar Joshi ojo...@hortonworks.comwrote: can you post RM/NM logs too.? Thanks, Omkar Joshi *Hortonworks Inc.* http://www.hortonworks.com
Task failure in slave node
Hi I have tow nodes: n1 (master, salve) and n2 (slave) after set up I ran wordcount example and it worked fine: [hduser@n1 ~]$ hadoop jar /usr/local/hadoop/hadoop-examples-1.0.4.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output 13/07/11 15:30:44 INFO input.FileInputFormat: Total input paths to process : 7 13/07/11 15:30:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/07/11 15:30:44 WARN snappy.LoadSnappy: Snappy native library not loaded 13/07/11 15:30:44 INFO mapred.JobClient: Running job: job_201307111355_0015 13/07/11 15:30:45 INFO mapred.JobClient: map 0% reduce 0% 13/07/11 15:31:03 INFO mapred.JobClient: map 42% reduce 0% 13/07/11 15:31:06 INFO mapred.JobClient: map 57% reduce 0% 13/07/11 15:31:09 INFO mapred.JobClient: map 71% reduce 0% 13/07/11 15:31:15 INFO mapred.JobClient: map 100% reduce 0% 13/07/11 15:31:18 INFO mapred.JobClient: map 100% reduce 23% 13/07/11 15:31:27 INFO mapred.JobClient: map 100% reduce 100% 13/07/11 15:31:32 INFO mapred.JobClient: Job complete: job_201307111355_0015 13/07/11 15:31:32 INFO mapred.JobClient: Counters: 30 13/07/11 15:31:32 INFO mapred.JobClient: Job Counters 13/07/11 15:31:32 INFO mapred.JobClient: Launched reduce tasks=1 13/07/11 15:31:32 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=67576 13/07/11 15:31:32 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/07/11 15:31:32 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/07/11 15:31:32 INFO mapred.JobClient: Rack-local map tasks=3 13/07/11 15:31:32 INFO mapred.JobClient: Launched map tasks=7 13/07/11 15:31:32 INFO mapred.JobClient: Data-local map tasks=4 13/07/11 15:31:32 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=21992 13/07/11 15:31:32 INFO mapred.JobClient: File Output Format Counters 13/07/11 15:31:32 INFO mapred.JobClient: Bytes Written=1412505 13/07/11 15:31:32 INFO mapred.JobClient: FileSystemCounters 13/07/11 15:31:32 INFO mapred.JobClient: FILE_BYTES_READ=5414195 13/07/11 15:31:32 INFO mapred.JobClient: HDFS_BYTES_READ=6950820 13/07/11 15:31:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=8744993 13/07/11 15:31:32 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1412505 13/07/11 15:31:32 INFO mapred.JobClient: File Input Format Counters 13/07/11 15:31:32 INFO mapred.JobClient: Bytes Read=6950001 13/07/11 15:31:32 INFO mapred.JobClient: Map-Reduce Framework 13/07/11 15:31:32 INFO mapred.JobClient: Map output materialized bytes=3157469 13/07/11 15:31:32 INFO mapred.JobClient: Map input records=137146 13/07/11 15:31:32 INFO mapred.JobClient: Reduce shuffle bytes=2904836 13/07/11 15:31:32 INFO mapred.JobClient: Spilled Records=594764 13/07/11 15:31:32 INFO mapred.JobClient: Map output bytes=11435849 13/07/11 15:31:32 INFO mapred.JobClient: Total committed heap usage (bytes)=1128136704 13/07/11 15:31:32 INFO mapred.JobClient: CPU time spent (ms)=18230 13/07/11 15:31:32 INFO mapred.JobClient: Combine input records=1174991 13/07/11 15:31:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=819 13/07/11 15:31:32 INFO mapred.JobClient: Reduce input records=218990 13/07/11 15:31:32 INFO mapred.JobClient: Reduce input groups=128513 13/07/11 15:31:32 INFO mapred.JobClient: Combine output records=218990 13/07/11 15:31:32 INFO mapred.JobClient: Physical memory (bytes) snapshot=1179656192 13/07/11 15:31:32 INFO mapred.JobClient: Reduce output records=128513 13/07/11 15:31:32 INFO mapred.JobClient: Virtual memory (bytes) snapshot=22992117760 13/07/11 15:31:32 INFO mapred.JobClient: Map output records=1174991 from web interface (http://n1:50030/) I saw that both (n1 and n2 ) were used without any errors. Problems appear if I try to use following commands in master (n1): [hduser@n1 ~]$hadoop jar mahout-distribution-0.7/mahout-examples-0.7-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -p -d testdata/bal_ee_2009.csv -ds testdata/bal_ee_2009.csv.info -sl 10 -o bal_ee_2009_out -t 1 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [file:/usr/local/hadoop-1.0.4/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/hadoop-1.0.4/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 13/07/11 15:36:50 INFO mapreduce.BuildForest: Partial Mapred implementation 13/07/11 15:36:50 INFO mapreduce.BuildForest: Building the forest... 13/07/11 15:36:50 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 13/07/11 15:36:50 INFO input.FileInputFormat: Total input paths to process : 1 13/07/11 15:36:50 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/07/11 15:36:50 WARN snappy.LoadSnappy: Snappy native
copy files from ftp to hdfs in parallel, distcp failed
Hi, I am running a hdfs on Amazon EC2 Say, I have a ftp server where stores some data. I just want to copy these data directly to hdfs in a parallel way (which maybe more efficient). I think hadoop distcp is what I need. But $ bin/hadoop distcp ftp://username:passwd@hostname/some/path/ hdfs://namenode/some/path doesn't work. 13/07/05 16:13:46 INFO tools.DistCp: srcPaths=[ftp://username:passwd@hostname/some/path/] 13/07/05 16:13:46 INFO tools.DistCp: destPath=hdfs://namenode/some/path Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source ftp://username:passwd@hostname/some/path/ does not exist. at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:641) at org.apache.hadoop.tools.DistCp.copy(DistCp.java:656) at org.apache.hadoop.tools.DistCp.run(DistCp.java:881) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.tools.DistCp.main(DistCp.java:908) I checked the path by copying the ftp path in Chrome , and the file really exists, I can even download it. And then, I tried to list the files under the path by: $ bin/hadoop dfs -ls ftp://username:passwd@hostname/some/path/ It ends with: ls: Cannot access ftp://username:passwd@hostname/some/path/: No such file or directory. That seems the same pb. Any workaround here ? Thank you in advance. Hao. -- Hao Ren ClaraVista www.claravista.fr
Cloudera links and Document
Hi All, Can anyone help me the link or document that explain the below. How Cloudera Manager works and handle the clusters (Agent and Master Server)? How the Cloudera Manager Process Flow works? Where can I locate Cloudera configuration files and explanation in brief? Regards Sathish
Re: Task failure in slave node
sorry for typo, mahout, not mahou. sent from mobile On Jul 11, 2013 9:40 PM, Azuryy Yu azury...@gmail.com wrote: hi, put all mahou jars under hadoop_home/lib, then restart cluster. On Jul 11, 2013 8:45 PM, Margusja mar...@roo.ee wrote: Hi I have tow nodes: n1 (master, salve) and n2 (slave) after set up I ran wordcount example and it worked fine: [hduser@n1 ~]$ hadoop jar /usr/local/hadoop/hadoop-**examples-1.0.4.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output 13/07/11 15:30:44 INFO input.FileInputFormat: Total input paths to process : 7 13/07/11 15:30:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/07/11 15:30:44 WARN snappy.LoadSnappy: Snappy native library not loaded 13/07/11 15:30:44 INFO mapred.JobClient: Running job: job_201307111355_0015 13/07/11 15:30:45 INFO mapred.JobClient: map 0% reduce 0% 13/07/11 15:31:03 INFO mapred.JobClient: map 42% reduce 0% 13/07/11 15:31:06 INFO mapred.JobClient: map 57% reduce 0% 13/07/11 15:31:09 INFO mapred.JobClient: map 71% reduce 0% 13/07/11 15:31:15 INFO mapred.JobClient: map 100% reduce 0% 13/07/11 15:31:18 INFO mapred.JobClient: map 100% reduce 23% 13/07/11 15:31:27 INFO mapred.JobClient: map 100% reduce 100% 13/07/11 15:31:32 INFO mapred.JobClient: Job complete: job_201307111355_0015 13/07/11 15:31:32 INFO mapred.JobClient: Counters: 30 13/07/11 15:31:32 INFO mapred.JobClient: Job Counters 13/07/11 15:31:32 INFO mapred.JobClient: Launched reduce tasks=1 13/07/11 15:31:32 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=67576 13/07/11 15:31:32 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/07/11 15:31:32 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/07/11 15:31:32 INFO mapred.JobClient: Rack-local map tasks=3 13/07/11 15:31:32 INFO mapred.JobClient: Launched map tasks=7 13/07/11 15:31:32 INFO mapred.JobClient: Data-local map tasks=4 13/07/11 15:31:32 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=21992 13/07/11 15:31:32 INFO mapred.JobClient: File Output Format Counters 13/07/11 15:31:32 INFO mapred.JobClient: Bytes Written=1412505 13/07/11 15:31:32 INFO mapred.JobClient: FileSystemCounters 13/07/11 15:31:32 INFO mapred.JobClient: FILE_BYTES_READ=5414195 13/07/11 15:31:32 INFO mapred.JobClient: HDFS_BYTES_READ=6950820 13/07/11 15:31:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=8744993 13/07/11 15:31:32 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1412505 13/07/11 15:31:32 INFO mapred.JobClient: File Input Format Counters 13/07/11 15:31:32 INFO mapred.JobClient: Bytes Read=6950001 13/07/11 15:31:32 INFO mapred.JobClient: Map-Reduce Framework 13/07/11 15:31:32 INFO mapred.JobClient: Map output materialized bytes=3157469 13/07/11 15:31:32 INFO mapred.JobClient: Map input records=137146 13/07/11 15:31:32 INFO mapred.JobClient: Reduce shuffle bytes=2904836 13/07/11 15:31:32 INFO mapred.JobClient: Spilled Records=594764 13/07/11 15:31:32 INFO mapred.JobClient: Map output bytes=11435849 13/07/11 15:31:32 INFO mapred.JobClient: Total committed heap usage (bytes)=1128136704 13/07/11 15:31:32 INFO mapred.JobClient: CPU time spent (ms)=18230 13/07/11 15:31:32 INFO mapred.JobClient: Combine input records=1174991 13/07/11 15:31:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=819 13/07/11 15:31:32 INFO mapred.JobClient: Reduce input records=218990 13/07/11 15:31:32 INFO mapred.JobClient: Reduce input groups=128513 13/07/11 15:31:32 INFO mapred.JobClient: Combine output records=218990 13/07/11 15:31:32 INFO mapred.JobClient: Physical memory (bytes) snapshot=1179656192 13/07/11 15:31:32 INFO mapred.JobClient: Reduce output records=128513 13/07/11 15:31:32 INFO mapred.JobClient: Virtual memory (bytes) snapshot=22992117760 13/07/11 15:31:32 INFO mapred.JobClient: Map output records=1174991 from web interface (http://n1:50030/) I saw that both (n1 and n2 ) were used without any errors. Problems appear if I try to use following commands in master (n1): [hduser@n1 ~]$hadoop jar mahout-distribution-0.7/**mahout-examples-0.7-job.jar org.apache.mahout.classifier.**df.mapreduce.BuildForest -Dmapred.max.split.size=**1874231 -p -d testdata/bal_ee_2009.csv -ds testdata/bal_ee_2009.csv.info -sl 10 -o bal_ee_2009_out -t 1 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [file:/usr/local/hadoop-1.0.4/**org/slf4j/impl/** StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/hadoop-1.** 0.4/lib/slf4j-log4j12-1.4.3.**jar!/org/slf4j/impl/** StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.**html#multiple_bindingshttp://www.slf4j.org/codes.html#multiple_bindingsfor an explanation. 13/07/11 15:36:50 INFO mapreduce.BuildForest: Partial Mapred implementation 13/07/11 15:36:50 INFO mapreduce.BuildForest:
Re: Cloudera links and Document
Hi, Go through the links. http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Managing-Clusters/cmmc_CM_architecture.html http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Managing-Clusters/cmmc_CM_architecture.html http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Installation-Guide/cmig_installing_configuring_dbs.html Hi, From, Ramesh. On Thu, Jul 11, 2013 at 6:58 PM, Sathish Kumar sa848...@gmail.com wrote: Hi All, Can anyone help me the link or document that explain the below. How Cloudera Manager works and handle the clusters (Agent and Master Server)? How the Cloudera Manager Process Flow works? Where can I locate Cloudera configuration files and explanation in brief? Regards Sathish
Re: Task failure in slave node
Than you, it resolved the problem. Funny, I don't remember that I copied mahout libs to n1 hadoop but there they are. Tervitades, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee skype: margusja -BEGIN PUBLIC KEY- MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCvbeg7LwEC2SCpAEewwpC3ajxE 5ZsRMCB77L8bae9G7TslgLkoIzo9yOjPdx2NN6DllKbV65UjTay43uUDyql9g3tl RhiJIcoAExkSTykWqAIPR88LfilLy1JlQ+0RD8OXiWOVVQfhOHpQ0R/jcAkM2lZa BjM8j36yJvoBVsfOHQIDAQAB -END PUBLIC KEY- On 7/11/13 4:41 PM, Azuryy Yu wrote: sorry for typo, mahout, not mahou. sent from mobile On Jul 11, 2013 9:40 PM, Azuryy Yu azury...@gmail.com mailto:azury...@gmail.com wrote: hi, put all mahou jars under hadoop_home/lib, then restart cluster. On Jul 11, 2013 8:45 PM, Margusja mar...@roo.ee mailto:mar...@roo.ee wrote: Hi I have tow nodes: n1 (master, salve) and n2 (slave) after set up I ran wordcount example and it worked fine: [hduser@n1 ~]$ hadoop jar /usr/local/hadoop/hadoop-examples-1.0.4.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output 13/07/11 15:30:44 INFO input.FileInputFormat: Total input paths to process : 7 13/07/11 15:30:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/07/11 15:30:44 WARN snappy.LoadSnappy: Snappy native library not loaded 13/07/11 15:30:44 INFO mapred.JobClient: Running job: job_201307111355_0015 13/07/11 15:30:45 INFO mapred.JobClient: map 0% reduce 0% 13/07/11 15:31:03 INFO mapred.JobClient: map 42% reduce 0% 13/07/11 15:31:06 INFO mapred.JobClient: map 57% reduce 0% 13/07/11 15:31:09 INFO mapred.JobClient: map 71% reduce 0% 13/07/11 15:31:15 INFO mapred.JobClient: map 100% reduce 0% 13/07/11 15:31:18 INFO mapred.JobClient: map 100% reduce 23% 13/07/11 15:31:27 INFO mapred.JobClient: map 100% reduce 100% 13/07/11 15:31:32 INFO mapred.JobClient: Job complete: job_201307111355_0015 13/07/11 15:31:32 INFO mapred.JobClient: Counters: 30 13/07/11 15:31:32 INFO mapred.JobClient: Job Counters 13/07/11 15:31:32 INFO mapred.JobClient: Launched reduce tasks=1 13/07/11 15:31:32 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=67576 13/07/11 15:31:32 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/07/11 15:31:32 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/07/11 15:31:32 INFO mapred.JobClient: Rack-local map tasks=3 13/07/11 15:31:32 INFO mapred.JobClient: Launched map tasks=7 13/07/11 15:31:32 INFO mapred.JobClient: Data-local map tasks=4 13/07/11 15:31:32 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=21992 13/07/11 15:31:32 INFO mapred.JobClient: File Output Format Counters 13/07/11 15:31:32 INFO mapred.JobClient: Bytes Written=1412505 13/07/11 15:31:32 INFO mapred.JobClient: FileSystemCounters 13/07/11 15:31:32 INFO mapred.JobClient: FILE_BYTES_READ=5414195 13/07/11 15:31:32 INFO mapred.JobClient: HDFS_BYTES_READ=6950820 13/07/11 15:31:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=8744993 13/07/11 15:31:32 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1412505 13/07/11 15:31:32 INFO mapred.JobClient: File Input Format Counters 13/07/11 15:31:32 INFO mapred.JobClient: Bytes Read=6950001 13/07/11 15:31:32 INFO mapred.JobClient: Map-Reduce Framework 13/07/11 15:31:32 INFO mapred.JobClient: Map output materialized bytes=3157469 13/07/11 15:31:32 INFO mapred.JobClient: Map input records=137146 13/07/11 15:31:32 INFO mapred.JobClient: Reduce shuffle bytes=2904836 13/07/11 15:31:32 INFO mapred.JobClient: Spilled Records=594764 13/07/11 15:31:32 INFO mapred.JobClient: Map output bytes=11435849 13/07/11 15:31:32 INFO mapred.JobClient: Total committed heap usage (bytes)=1128136704 13/07/11 15:31:32 INFO mapred.JobClient: CPU time spent (ms)=18230 13/07/11 15:31:32 INFO mapred.JobClient: Combine input records=1174991 13/07/11 15:31:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=819 13/07/11 15:31:32 INFO mapred.JobClient: Reduce input records=218990 13/07/11 15:31:32 INFO mapred.JobClient: Reduce input groups=128513 13/07/11 15:31:32 INFO mapred.JobClient: Combine output records=218990 13/07/11 15:31:32 INFO mapred.JobClient: Physical memory (bytes) snapshot=1179656192 13/07/11 15:31:32 INFO mapred.JobClient: Reduce output records=128513 13/07/11
RE: New Distributed Cache
So in my driver code, I try to store the file in the cache with this line of code: job.addCacheFile(new URI(file location)); Then in my Mapper code, I do this to try and access the cached file: URI[] localPaths = context.getCacheFiles(); File f = new File(localPaths[0]); However, I get a NullPointerException when I do that in the Mapper code. Any suggesstions? Andrew From: Shahab Yunus [mailto:shahab.yu...@gmail.com] Sent: Wednesday, July 10, 2013 9:43 PM To: user@hadoop.apache.org Subject: Re: New Distributed Cache Also, once you have the array of URIs after calling getCacheFiles you can iterate over them using File class or Path (http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/Path.html#Path(java.net.URI)) Regards, Shahab On Wed, Jul 10, 2013 at 5:08 PM, Omkar Joshi ojo...@hortonworks.commailto:ojo...@hortonworks.com wrote: did you try JobContext.getCacheFiles() ? Thanks, Omkar Joshi Hortonworks Inc.http://www.hortonworks.com On Wed, Jul 10, 2013 at 10:15 AM, Botelho, Andrew andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote: Hi, I am trying to store a file in the Distributed Cache during my Hadoop job. In the driver class, I tell the job to store the file in the cache with this code: Job job = Job.getInstance(); job.addCacheFile(new URI(file name)); That all compiles fine. In the Mapper code, I try accessing the cached file with this method: Path[] localPaths = context.getLocalCacheFiles(); However, I am getting warnings that this method is deprecated. Does anyone know the newest way to access cached files in the Mapper code? (I am using Hadoop 2.0.5) Thanks in advance, Andrew
Re: Cloudera links and Document
Sathish, this mailing list for Apache Hadoop related questions. Please post questions related to other distributions to appropriate vendor's mailing list. On Thu, Jul 11, 2013 at 6:28 AM, Sathish Kumar sa848...@gmail.com wrote: Hi All, Can anyone help me the link or document that explain the below. How Cloudera Manager works and handle the clusters (Agent and Master Server)? How the Cloudera Manager Process Flow works? Where can I locate Cloudera configuration files and explanation in brief? Regards Sathish -- http://hortonworks.com/download/
How are 'PHYSICAL_MEMORY_BYTES' and 'VIRTUAL_MEMORY_BYTES' calculated?
Hello, I am wondering how memory counters 'PHYSICAL_MEMORY_BYTES' and 'VIRTUAL_MEMORY_BYTES' are calculated? They are peaks of memory usage or cumulative usage? Thanks for help,
Re: Cloudera links and Document
Satish, the right alias for Cloudera Manager questions scm-us...@cloudera.org Thanks On Thu, Jul 11, 2013 at 9:20 AM, Suresh Srinivas sur...@hortonworks.comwrote: Sathish, this mailing list for Apache Hadoop related questions. Please post questions related to other distributions to appropriate vendor's mailing list. On Thu, Jul 11, 2013 at 6:28 AM, Sathish Kumar sa848...@gmail.com wrote: Hi All, Can anyone help me the link or document that explain the below. How Cloudera Manager works and handle the clusters (Agent and Master Server)? How the Cloudera Manager Process Flow works? Where can I locate Cloudera configuration files and explanation in brief? Regards Sathish -- http://hortonworks.com/download/ -- Alejandro
Re: New Distributed Cache
Yeah Andrew.. there seems to be some problem with context.getCacheFiles() api which is returning null.. Path[] cachedFilePaths = context.getLocalCacheFiles(); // I am checking why it is deprecated... for (Path cachedFilePath : cachedFilePaths) { File cachedFile = new File(cachedFilePath.toUri().getRawPath()); System.out.println(cached fie path + cachedFile.getAbsolutePath()); } I hope this helps for the time being.. JobContext was suppose to replace DistributedCache api (it will be deprecated) however there is some problem with that or I am missing something... Will reply if I find the solution to it. context.getCacheFiles will give you the uri used for localizing files... (original uri used for adding it to cache)... However you can use DistributedCache.getCacheFiles() api till context api is fixed. context.getLocalCacheFiles .. will give you the actual file path on node manager... (after file is localized). Thanks, Omkar Joshi *Hortonworks Inc.* http://www.hortonworks.com On Thu, Jul 11, 2013 at 8:19 AM, Botelho, Andrew andrew.bote...@emc.comwrote: So in my driver code, I try to store the file in the cache with this line of code: ** ** job.addCacheFile(new URI(file location)); ** ** Then in my Mapper code, I do this to try and access the cached file: ** ** URI[] localPaths = context.getCacheFiles(); File f = new File(localPaths[0]); ** ** However, I get a NullPointerException when I do that in the Mapper code.** ** ** ** Any suggesstions? ** ** Andrew ** ** *From:* Shahab Yunus [mailto:shahab.yu...@gmail.com] *Sent:* Wednesday, July 10, 2013 9:43 PM *To:* user@hadoop.apache.org *Subject:* Re: New Distributed Cache ** ** Also, once you have the array of URIs after calling getCacheFiles you can iterate over them using File class or Path ( http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/Path.html#Path(java.net.URI) ) ** ** Regards, Shahab ** ** On Wed, Jul 10, 2013 at 5:08 PM, Omkar Joshi ojo...@hortonworks.com wrote: did you try JobContext.getCacheFiles() ? ** ** Thanks, Omkar Joshi *Hortonworks Inc.* http://www.hortonworks.com ** ** On Wed, Jul 10, 2013 at 10:15 AM, Botelho, Andrew andrew.bote...@emc.com wrote: Hi, I am trying to store a file in the Distributed Cache during my Hadoop job. In the driver class, I tell the job to store the file in the cache with this code: Job job = Job.getInstance(); job.addCacheFile(new URI(file name)); That all compiles fine. In the Mapper code, I try accessing the cached file with this method: Path[] localPaths = context.getLocalCacheFiles(); However, I am getting warnings that this method is deprecated. Does anyone know the newest way to access cached files in the Mapper code? (I am using Hadoop 2.0.5) Thanks in advance, Andrew ** ** ** **
Re: copy files from ftp to hdfs in parallel, distcp failed
On 11 July 2013 06:27, Hao Ren h@claravista.fr wrote: Hi, I am running a hdfs on Amazon EC2 Say, I have a ftp server where stores some data. I just want to copy these data directly to hdfs in a parallel way (which maybe more efficient). I think hadoop distcp is what I need. http://hadoop.apache.org/docs/stable/distcp.html DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting I doubt this is going to help. Are these lot of files. If yes, how about multiple copy jobs to hdfs? -balaji
Re: CompositeInputFormat
Map Side joins will use the CompositeInputFormat. They will only really be worth doing if one data set is small, and the other is large. This is a good example : http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/ the trick is to google for CompositeInputFormat.compose() :) On Thu, Jul 11, 2013 at 5:02 PM, Botelho, Andrew andrew.bote...@emc.comwrote: Hi, ** ** I want to perform a JOIN on two sets of data with Hadoop. I read that the class CompositeInputFormat can be used to perform joins on data, but I can’t find any examples of how to do it. Could someone help me out? It would be much appreciated. J ** ** Thanks in advance, ** ** Andrew -- Jay Vyas http://jayunit100.blogspot.com
RE: CompositeInputFormat
Sorry I should've specified that I need an example of CompositeInputFormat that uses the new API. The example linked below uses old API objects like JobConf. Any known examples of CompositeInputFormat using the new API? Thanks in advance, Andrew From: Jay Vyas [mailto:jayunit...@gmail.com] Sent: Thursday, July 11, 2013 5:10 PM To: common-u...@hadoop.apache.org Subject: Re: CompositeInputFormat Map Side joins will use the CompositeInputFormat. They will only really be worth doing if one data set is small, and the other is large. This is a good example : http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/ the trick is to google for CompositeInputFormat.compose() :) On Thu, Jul 11, 2013 at 5:02 PM, Botelho, Andrew andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote: Hi, I want to perform a JOIN on two sets of data with Hadoop. I read that the class CompositeInputFormat can be used to perform joins on data, but I can't find any examples of how to do it. Could someone help me out? It would be much appreciated. :) Thanks in advance, Andrew -- Jay Vyas http://jayunit100.blogspot.com
Staging directory ENOTDIR error.
Hi , I'm getting an ungoogleable exception, never seen this before. This is on a hadoop 1.1. cluster... It appears that its permissions related... Any thoughts as to how this could crop up? I assume its a bug in my filesystem, but not sure. 13/07/11 18:39:43 ERROR security.UserGroupInformation: PriviledgedActionException as:root cause:ENOTDIR: Not a directory ENOTDIR: Not a directory at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:699) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:654) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116) -- Jay Vyas http://jayunit100.blogspot.com
Re: Issues Running Hadoop 1.1.2 on multi-node cluster
I figured out the issue! The problem was in the permission to rum Hadoop scripts from root user. I create a dedicated hadoop user to rum hadoop cluster but one of the time i accidentally started hadoop from root. Hence, some of the permissions of hadoop scripts changed. The solution is to again change the ownership of the hadoop folder to the dedicated user using chown. Its working fine now. Thanks a lot for the pointers! Regards, Siddharth On Thu, Jul 11, 2013 at 1:43 AM, Ram pramesh...@gmail.com wrote: Hi, Please check all directories/files are existed in local system configured mapres-site.xml and permissions to the files/directories as mapred as user and hadoop as a group. Hi, From, P.Ramesh Babu, +91-7893442722. On Wed, Jul 10, 2013 at 9:36 PM, Leonid Fedotov lfedo...@hortonworks.comwrote: Make sure your mapred.local.dir (check it in mapred-site.xml) is actually exists and writable by your mapreduce usewr. *Thank you!* * * *Sincerely,* *Leonid Fedotov* On Jul 9, 2013, at 6:09 PM, Kiran Dangeti wrote: Hi Siddharth, While running the multi-node we need to take care of the local host of the slave machine from the error messages the task tracker root directory not able to get to the masters. Please check and rerun it. Thanks, Kiran On Tue, Jul 9, 2013 at 10:26 PM, siddharth mathur sidh1...@gmail.comwrote: Hi, I have installed Hadoop 1.1.2 on a 5 nodes cluster. I installed it watching this tutorial * http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ * When I startup the hadoop, I get the folloing error in *all* the tasktrackers. 2013-07-09 12:15:22,301 INFO org.apache.hadoop.mapred.UserLogCleaner: Adding job_201307051203_0001 for user-log deletion with retainTimeStamp:1373472921775 2013-07-09 12:15:22,301 INFO org.apache.hadoop.mapred.UserLogCleaner: Adding job_201307051611_0001 for user-log deletion with retainTimeStamp:1373472921775 2013-07-09 12:15:22,601 INFO org.apache.hadoop.mapred.TaskTracker:*Failed to get system directory *... 2013-07-09 12:15:25,164 INFO org.apache.hadoop.mapred.TaskTracker: Failed to get system directory... 2013-07-09 12:15:27,901 INFO org.apache.hadoop.mapred.TaskTracker: Failed to get system directory... 2013-07-09 12:15:30,144 INFO org.apache.hadoop.mapred.TaskTracker: Failed to get system directory... *But everything looks fine in the webUI. * When I run a job, I get the following error but the job completes anyways. I have* attached the* *screenshots* of the maptask failed error log in the UI. ** 13/07/09 12:29:37 INFO input.FileInputFormat: Total input paths to process : 2 13/07/09 12:29:37 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/07/09 12:29:37 WARN snappy.LoadSnappy: Snappy native library not loaded 13/07/09 12:29:37 INFO mapred.JobClient: Running job: job_201307091215_0001 13/07/09 12:29:38 INFO mapred.JobClient: map 0% reduce 0% 13/07/09 12:29:41 INFO mapred.JobClient: Task Id : attempt_201307091215_0001_m_01_0, Status : FAILED Error initializing attempt_201307091215_0001_m_01_0: ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:699) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:654) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344) at org.apache.hadoop.mapred.JobLocalizer.initializeJobLogDir(JobLocalizer.java:240) at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:205) at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1331) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149) at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1306) at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1221) at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2581) at java.lang.Thread.run(Thread.java:724) 13/07/09 12:29:41 WARN mapred.JobClient: Error reading task outputhttp://dmkd-1:50060/tasklog?plaintext=trueattemptid=attempt_201307091215_0001_m_01_0filter=stdout 13/07/09 12:29:41 WARN mapred.JobClient: Error reading task outputhttp://dmkd-1:50060/tasklog?plaintext=trueattemptid=attempt_201307091215_0001_m_01_0filter=stderr 13/07/09 12:29:45 INFO mapred.JobClient: map 50% reduce 0% 13/07/09 12:29:53 INFO mapred.JobClient: map 50% reduce 16% 13/07/09 12:30:38 INFO mapred.JobClient: Task Id : attempt_201307091215_0001_m_00_1, Status : FAILED Error initializing attempt_201307091215_0001_m_00_1: ENOENT: No such file or
RE: CompositeInputFormat
Hi Andrew, You could make use of hadoop data join classes to perform the join or you can refer these classes for better idea to perform join. http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-tools/hadoop-datajoin Thanks Devaraj k From: Botelho, Andrew [mailto:andrew.bote...@emc.com] Sent: 12 July 2013 03:33 To: user@hadoop.apache.org Subject: RE: CompositeInputFormat Sorry I should've specified that I need an example of CompositeInputFormat that uses the new API. The example linked below uses old API objects like JobConf. Any known examples of CompositeInputFormat using the new API? Thanks in advance, Andrew From: Jay Vyas [mailto:jayunit...@gmail.com] Sent: Thursday, July 11, 2013 5:10 PM To: common-u...@hadoop.apache.orgmailto:common-u...@hadoop.apache.org Subject: Re: CompositeInputFormat Map Side joins will use the CompositeInputFormat. They will only really be worth doing if one data set is small, and the other is large. This is a good example : http://www.congiu.com/joins-in-hadoop-using-compositeinputformat/ the trick is to google for CompositeInputFormat.compose() :) On Thu, Jul 11, 2013 at 5:02 PM, Botelho, Andrew andrew.bote...@emc.commailto:andrew.bote...@emc.com wrote: Hi, I want to perform a JOIN on two sets of data with Hadoop. I read that the class CompositeInputFormat can be used to perform joins on data, but I can't find any examples of how to do it. Could someone help me out? It would be much appreciated. :) Thanks in advance, Andrew -- Jay Vyas http://jayunit100.blogspot.com
RE: Staging directory ENOTDIR error.
Hi Jay, Here client is trying to create a staging directory in local file system, which actually should create in HDFS. Could you check whether do you have configured fs.defaultFS configuration in client with the HDFS. Thanks Devaraj k From: Jay Vyas [mailto:jayunit...@gmail.com] Sent: 12 July 2013 04:12 To: common-u...@hadoop.apache.org Subject: Staging directory ENOTDIR error. Hi , I'm getting an ungoogleable exception, never seen this before. This is on a hadoop 1.1. cluster... It appears that its permissions related... Any thoughts as to how this could crop up? I assume its a bug in my filesystem, but not sure. 13/07/11 18:39:43 ERROR security.UserGroupInformation: PriviledgedActionException as:root cause:ENOTDIR: Not a directory ENOTDIR: Not a directory at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:699) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:654) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116) -- Jay Vyas http://jayunit100.blogspot.com