Re: HADOOP-2536 supports Oracle too?
Enis ... I have tried JAR-ing with the library folder as you said. But still no luck. I keep getting the same ClassNotFoundException again and again. :,( Enis Soztutar-2 wrote: There is nothing special about the jdbc driver library. I guess that you have added the jar from the IDE(netbeans), but did not include the necessary libraries(jdbc driver in this case) in the TableAccess.jar. The standard way is to include the dependent jars in the project's jar under the lib directory. For example: example.jar - META-INF - com/... - lib/postgres.jar - lib/abc.jar If your classpath is correct, check whether you call DBConfiguration.configureDB() with the correct driver class and url. sandhiya wrote: Hi, I'm using postgresql and the driver is not getting detected. How do you run it in the first place? I just typed bin/hadoop jar /root/sandy/netbeans/TableAccess/dist/TableAccess.jar at the terminal without the quotes. I didnt copy any files from my local drives into the Hadoop file system. I get an error like this : java.lang.RuntimeException: java.lang.ClassNotFoundException: org.postgresql.Driver and then the complete stack trace Am i doing something wrong? I downloaded a jar file for postgresql jdbc support and included it in my Libraries folder (I'm using NetBeans). please help Fredrik Hedberg-3 wrote: Hi, Although it's not MySQL; this might be of use: http://svn.apache.org/repos/asf/hadoop/core/trunk/src/examples/org/apache/hadoop/examples/DBCountPageView.java Fredrik On Feb 16, 2009, at 8:33 AM, sandhiya wrote: @Amandeep Hi, I'm new to Hadoop and am trying to run a simple database connectivity program on it. Could you please tell me how u went about it?? my mail id is sandys_cr...@yahoo.com . A copy of your code that successfully connected to MySQL will also be helpful. Thanks, Sandhiya Enis Soztutar-2 wrote: From the exception : java.io.IOException: ORA-00933: SQL command not properly ended I would broadly guess that Oracle JDBC driver might be complaining that the statement does not end with ;, or something similar. you can 1. download the latest source code of hadoop 2. add a print statement printing the query (probably in DBInputFormat:119) 3. build hadoop jar 4. use the new hadoop jar to see the actual SQL query 5. run the query on Oracle if is gives an error. Enis Amandeep Khurana wrote: Ok. I created the same database in a MySQL database and ran the same hadoop job against it. It worked. So, that means there is some Oracle specific issue. It cant be an issue with the JDBC drivers since I am using the same drivers in a simple JDBC client. What could it be? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 4, 2009 at 10:26 AM, Amandeep Khurana ama...@gmail.com wrote: Ok. I'm not sure if I got it correct. Are you saying, I should test the statement that hadoop creates directly with the database? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 4, 2009 at 7:13 AM, Enis Soztutar enis@gmail.com wrote: Hadoop-2536 connects to the db via JDBC, so in theory it should work with proper jdbc drivers. It has been tested against MySQL, Hsqldb, and PostreSQL, but not Oracle. To answer your earlier question, the actual SQL statements might not be recognized by Oracle, so I suggest the best way to test this is to insert print statements, and run the actual SQL statements against Oracle to see if the syntax is accepted. We would appreciate if you publish your results. Enis Amandeep Khurana wrote: Does the patch HADOOP-2536 support connecting to Oracle databases as well? Or is it just limited to MySQL? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz -- View this message in context: http://www.nabble.com/HADOOP-2536-supports-Oracle-too--tp21823199p22032715.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/HADOOP-2536-supports-Oracle-too--tp21823199p22073986.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: HADOOP-2536 supports Oracle too?
It should either be in the jar or in the lib folder in the Hadoop installation. If none of them work, check the jar that you are including. Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 18, 2009 at 12:08 AM, sandhiya sandhiy...@gmail.com wrote: Enis ... I have tried JAR-ing with the library folder as you said. But still no luck. I keep getting the same ClassNotFoundException again and again. :,( Enis Soztutar-2 wrote: There is nothing special about the jdbc driver library. I guess that you have added the jar from the IDE(netbeans), but did not include the necessary libraries(jdbc driver in this case) in the TableAccess.jar. The standard way is to include the dependent jars in the project's jar under the lib directory. For example: example.jar - META-INF - com/... - lib/postgres.jar - lib/abc.jar If your classpath is correct, check whether you call DBConfiguration.configureDB() with the correct driver class and url. sandhiya wrote: Hi, I'm using postgresql and the driver is not getting detected. How do you run it in the first place? I just typed bin/hadoop jar /root/sandy/netbeans/TableAccess/dist/TableAccess.jar at the terminal without the quotes. I didnt copy any files from my local drives into the Hadoop file system. I get an error like this : java.lang.RuntimeException: java.lang.ClassNotFoundException: org.postgresql.Driver and then the complete stack trace Am i doing something wrong? I downloaded a jar file for postgresql jdbc support and included it in my Libraries folder (I'm using NetBeans). please help Fredrik Hedberg-3 wrote: Hi, Although it's not MySQL; this might be of use: http://svn.apache.org/repos/asf/hadoop/core/trunk/src/examples/org/apache/hadoop/examples/DBCountPageView.java Fredrik On Feb 16, 2009, at 8:33 AM, sandhiya wrote: @Amandeep Hi, I'm new to Hadoop and am trying to run a simple database connectivity program on it. Could you please tell me how u went about it?? my mail id is sandys_cr...@yahoo.com . A copy of your code that successfully connected to MySQL will also be helpful. Thanks, Sandhiya Enis Soztutar-2 wrote: From the exception : java.io.IOException: ORA-00933: SQL command not properly ended I would broadly guess that Oracle JDBC driver might be complaining that the statement does not end with ;, or something similar. you can 1. download the latest source code of hadoop 2. add a print statement printing the query (probably in DBInputFormat:119) 3. build hadoop jar 4. use the new hadoop jar to see the actual SQL query 5. run the query on Oracle if is gives an error. Enis Amandeep Khurana wrote: Ok. I created the same database in a MySQL database and ran the same hadoop job against it. It worked. So, that means there is some Oracle specific issue. It cant be an issue with the JDBC drivers since I am using the same drivers in a simple JDBC client. What could it be? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 4, 2009 at 10:26 AM, Amandeep Khurana ama...@gmail.com wrote: Ok. I'm not sure if I got it correct. Are you saying, I should test the statement that hadoop creates directly with the database? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 4, 2009 at 7:13 AM, Enis Soztutar enis@gmail.com wrote: Hadoop-2536 connects to the db via JDBC, so in theory it should work with proper jdbc drivers. It has been tested against MySQL, Hsqldb, and PostreSQL, but not Oracle. To answer your earlier question, the actual SQL statements might not be recognized by Oracle, so I suggest the best way to test this is to insert print statements, and run the actual SQL statements against Oracle to see if the syntax is accepted. We would appreciate if you publish your results. Enis Amandeep Khurana wrote: Does the patch HADOOP-2536 support connecting to Oracle databases as well? Or is it just limited to MySQL? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz -- View this message in context: http://www.nabble.com/HADOOP-2536-supports-Oracle-too--tp21823199p22032715.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/HADOOP-2536-supports-Oracle-too--tp21823199p22073986.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: HADOOP-2536 supports Oracle too?
Thanks a million!!! It worked. but its a little weird though. I have to put the Library with the jdbc jars in BOTH the executable jar file AND the lib folder in $HADOOP_HOME. Do all of you do the same thing or is it just my computer acting strange?? Anyway, thanks for the help. :clap: Amandeep Khurana wrote: It should either be in the jar or in the lib folder in the Hadoop installation. If none of them work, check the jar that you are including. Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 18, 2009 at 12:08 AM, sandhiya sandhiy...@gmail.com wrote: Enis ... I have tried JAR-ing with the library folder as you said. But still no luck. I keep getting the same ClassNotFoundException again and again. :,( Enis Soztutar-2 wrote: There is nothing special about the jdbc driver library. I guess that you have added the jar from the IDE(netbeans), but did not include the necessary libraries(jdbc driver in this case) in the TableAccess.jar. The standard way is to include the dependent jars in the project's jar under the lib directory. For example: example.jar - META-INF - com/... - lib/postgres.jar - lib/abc.jar If your classpath is correct, check whether you call DBConfiguration.configureDB() with the correct driver class and url. sandhiya wrote: Hi, I'm using postgresql and the driver is not getting detected. How do you run it in the first place? I just typed bin/hadoop jar /root/sandy/netbeans/TableAccess/dist/TableAccess.jar at the terminal without the quotes. I didnt copy any files from my local drives into the Hadoop file system. I get an error like this : java.lang.RuntimeException: java.lang.ClassNotFoundException: org.postgresql.Driver and then the complete stack trace Am i doing something wrong? I downloaded a jar file for postgresql jdbc support and included it in my Libraries folder (I'm using NetBeans). please help Fredrik Hedberg-3 wrote: Hi, Although it's not MySQL; this might be of use: http://svn.apache.org/repos/asf/hadoop/core/trunk/src/examples/org/apache/hadoop/examples/DBCountPageView.java Fredrik On Feb 16, 2009, at 8:33 AM, sandhiya wrote: @Amandeep Hi, I'm new to Hadoop and am trying to run a simple database connectivity program on it. Could you please tell me how u went about it?? my mail id is sandys_cr...@yahoo.com . A copy of your code that successfully connected to MySQL will also be helpful. Thanks, Sandhiya Enis Soztutar-2 wrote: From the exception : java.io.IOException: ORA-00933: SQL command not properly ended I would broadly guess that Oracle JDBC driver might be complaining that the statement does not end with ;, or something similar. you can 1. download the latest source code of hadoop 2. add a print statement printing the query (probably in DBInputFormat:119) 3. build hadoop jar 4. use the new hadoop jar to see the actual SQL query 5. run the query on Oracle if is gives an error. Enis Amandeep Khurana wrote: Ok. I created the same database in a MySQL database and ran the same hadoop job against it. It worked. So, that means there is some Oracle specific issue. It cant be an issue with the JDBC drivers since I am using the same drivers in a simple JDBC client. What could it be? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 4, 2009 at 10:26 AM, Amandeep Khurana ama...@gmail.com wrote: Ok. I'm not sure if I got it correct. Are you saying, I should test the statement that hadoop creates directly with the database? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 4, 2009 at 7:13 AM, Enis Soztutar enis@gmail.com wrote: Hadoop-2536 connects to the db via JDBC, so in theory it should work with proper jdbc drivers. It has been tested against MySQL, Hsqldb, and PostreSQL, but not Oracle. To answer your earlier question, the actual SQL statements might not be recognized by Oracle, so I suggest the best way to test this is to insert print statements, and run the actual SQL statements against Oracle to see if the syntax is accepted. We would appreciate if you publish your results. Enis Amandeep Khurana wrote: Does the patch HADOOP-2536 support connecting to Oracle databases as well? Or is it just limited to MySQL? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz -- View this message in context: http://www.nabble.com/HADOOP-2536-supports-Oracle-too--tp21823199p22032715.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- View
Re: Allowing other system users to use Haddoop
Nicholas, like Matei said, There is 2 possibility in terms of permissions: (any permissions command is just-like in linux) 1. Create a directory for a user. Make the user owner of that directory: hadoop dfs -chown ... (assuming hadoop doesn't need to have write access to any file outside user's home directory) 2. Convert group ownership of all files in HDFS to a group name which any user has. (hadoop dfs -chgrp -R groupname /). Then give group write access (hadoop dfs -chmod -R g+w /), again to all files. (here, any user runs jobs, hadoop creates automatically a separated home directory). This way is better for development environment, I think. Cheers, Rasit 2009/2/18 Matei Zaharia ma...@cloudera.com Other users should be able to submit jobs using the same commands (bin/hadoop ...). Are there errors you ran into? One thing is that you'll need to grant them permissions over any files in HDFS that you want them to read. You can do it using bin/hadoop fs -chmod, which works like chmod on Linux. You may need to run this as the root user (sudo bin/hadoop fs -chmod). Also, I don't remember exactly, but you may need to create home directories for them in HDFS as well (again create them as root, and then sudo bin/hadoop fs -chown them). On Tue, Feb 17, 2009 at 10:48 AM, Nicholas Loulloudes loulloude...@cs.ucy.ac.cy wrote: Hi all, I just installed Hadoop (Single Node) on a Linux Ubuntu distribution as per the instructions found in the following website: http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29 I followed the instructions of the website to create a hadoop system user and group and i was able to run a Map Reduce job successfully. What i want to do now is to create more system users which will be able to use Hadoop for running Map Reduce jobs. Is there any guide on how to achieve these?? Any suggestions will be highly appreciated. Thanks in advance, -- _ Nicholas Loulloudes High Performance Computing Systems Laboratory (HPCL) University of Cyprus, Nicosia, Cyprus -- M. Raşit ÖZDAŞ
GenericOptionsParser warning
Hi All I prepare my JobConf object in a java class, by calling various set apis in JobConf object. When I submit the jobconf object using JobClient.runJob(conf), I'm seeing the warning: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. From hadoop sources it looks like setting mapred.used.genericoptionsparser will prevent this warning. But if I set this flag to true, will it have some other side effects. Thanks Sandhya
Re: Hadoop User Group UK Meetup - April 14th
Registrations to the next Hadoop User Group UK meetup have now opened: http://huguk.eventwax.com/hadoop-user-group-uk-2 The preliminary schedule: 10.00 – 10.15: Arriving and chatting 10.15 – 11.15: Practical MapReduce (Tom White, Cloudera) 11.15 – 12.15: Introducing Apache Mahout (Isabel Drost, ASF) 12.15 – 13.15: Lunch 13.15 – 14.15: Terrier (Iadh Ounis and Craig Macdonald, University of Glasgow) 14.15 – 15.15: Having Fun with PageRank and MapReduce (Paolo Castagna, HP) 15.15 – 16.15: Apache HBase (Michael Stack, Powerset) 16.15 – 17.00: General chat, perhaps lightning talks (powered by Sun beer) 17.00 – 00.00: Discussions continues at a nearby pub The event is hosted by Sun in London, near Monument station, for more details see the event page or the blog: http://huguk.org/ /Johan Johan Oskarsson wrote: I've started organizing the next Hadoop meetup in London, UK. The date is April 14th and the presentations so far include: Michael Stack (Powerset): Apache HBase Isabel Drost (Neofonie): Introducing Apache Mahout Iadh Ounis and Craig Macdonalt (University of Glasgow): Terrier Paolo Castagna (HP): Having Fun with PageRank and MapReduce Keep an eye on the blog for updates: http://huguk.org/ Help in the form of sponsoring (venue, beer etc) would be much appreciated. Also let me know if you want to present. Personally I'd love to see presentations from other Hadoop related projects (pig, hive, hama etc). /Johan
Overriding mapred.tasktracker.map.tasks.maximum with -jobconf
I'm having trouble overriding the maximum number of map tasks that run on a given machine in my cluster. The default value of mapred.tasktracker.map.tasks.maximum is set to 2 in hadoop-default.xml. When running my job I passed -jobconf mapred.tasktracker.map.tasks.maximum=1 to limit map tasks to one per machine but each machine was still allocated 2 map tasks (simultaneously). The only way I was able to guarantee a maximum of one map task per machine was to change the value of the property in hadoop-site.xml. This is unsatisfactory since I'll often be changing the maximum on a per job basis. Any hints? On a different note, when I attempt to pass params via -D I get a usage message; when I use -jobconf the command goes through (and works in the case of mapred.reduce.tasks=0 for example) but I get a deprecation warning). Thanks, John
Re: GenericOptionsParser warning
Sandhya E wrote: Hi All I prepare my JobConf object in a java class, by calling various set apis in JobConf object. When I submit the jobconf object using JobClient.runJob(conf), I'm seeing the warning: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. From hadoop sources it looks like setting mapred.used.genericoptionsparser will prevent this warning. But if I set this flag to true, will it have some other side effects. Thanks Sandhya Seen this message too -and it annoys me; not tracked it down
Re: GenericOptionsParser warning
Hi, There is a JIRA issue about this problem, if I understand it correctly: https://issues.apache.org/jira/browse/HADOOP-3743 Strange, that I searched all source code, but there exists only this control in 2 places: if (!(job.getBoolean(mapred.used.genericoptionsparser, false))) { LOG.warn(Use GenericOptionsParser for parsing the arguments. + Applications should implement Tool for the same.); } Just an if block for logging, no extra controls. Am I missing something? If your class implements Tool, than there shouldn't be a warning. Cheers, Rasit 2009/2/18 Steve Loughran ste...@apache.org Sandhya E wrote: Hi All I prepare my JobConf object in a java class, by calling various set apis in JobConf object. When I submit the jobconf object using JobClient.runJob(conf), I'm seeing the warning: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. From hadoop sources it looks like setting mapred.used.genericoptionsparser will prevent this warning. But if I set this flag to true, will it have some other side effects. Thanks Sandhya Seen this message too -and it annoys me; not tracked it down -- M. Raşit ÖZDAŞ
Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf
John, did you try -D option instead of -jobconf, I had -D option in my code, I changed it with -jobconf, this is what I get: ... ... Options: -inputpath DFS input file(s) for the Map step -output path DFS output directory for the Reduce step -mapper cmd|JavaClassName The streaming command to run -combiner JavaClassName Combiner has to be a Java class -reducer cmd|JavaClassName The streaming command to run -file file File/dir to be shipped in the Job jar file -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional. -outputformat TextOutputFormat(default)|JavaClassName Optional. -partitioner JavaClassName Optional. -numReduceTasks num Optional. -inputreader spec Optional. -cmdenv n=vOptional. Pass env.var to streaming commands -mapdebug path Optional. To run this script when a map task fails -reducedebug path Optional. To run this script when a reduce task fails -verbose Generic options supported are -conf configuration file specify an application configuration file -D property=valueuse value for given property -fs local|namenode:port specify a namenode -jt local|jobtracker:portspecify a job tracker -files comma separated list of filesspecify comma separated files to be copied to the map reduce cluster -libjars comma separated list of jarsspecify comma separated jar files to include in the classpath. -archives comma separated list of archivesspecify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] For more details about these options: Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info I think -jobconf is not used in v.0.19 . 2009/2/18 S D sd.codewarr...@gmail.com I'm having trouble overriding the maximum number of map tasks that run on a given machine in my cluster. The default value of mapred.tasktracker.map.tasks.maximum is set to 2 in hadoop-default.xml. When running my job I passed -jobconf mapred.tasktracker.map.tasks.maximum=1 to limit map tasks to one per machine but each machine was still allocated 2 map tasks (simultaneously). The only way I was able to guarantee a maximum of one map task per machine was to change the value of the property in hadoop-site.xml. This is unsatisfactory since I'll often be changing the maximum on a per job basis. Any hints? On a different note, when I attempt to pass params via -D I get a usage message; when I use -jobconf the command goes through (and works in the case of mapred.reduce.tasks=0 for example) but I get a deprecation warning). Thanks, John -- M. Raşit ÖZDAŞ
Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf
Thanks for your response Rasit. You may have missed a portion of my post. On a different note, when I attempt to pass params via -D I get a usage message; when I use -jobconf the command goes through (and works in the case of mapred.reduce.tasks=0 for example) but I get a deprecation warning). I'm using Hadoop 0.19.0 and -D is not working. Are you using version 0.19.0 as well? John On Wed, Feb 18, 2009 at 9:14 AM, Rasit OZDAS rasitoz...@gmail.com wrote: John, did you try -D option instead of -jobconf, I had -D option in my code, I changed it with -jobconf, this is what I get: ... ... Options: -inputpath DFS input file(s) for the Map step -output path DFS output directory for the Reduce step -mapper cmd|JavaClassName The streaming command to run -combiner JavaClassName Combiner has to be a Java class -reducer cmd|JavaClassName The streaming command to run -file file File/dir to be shipped in the Job jar file -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional. -outputformat TextOutputFormat(default)|JavaClassName Optional. -partitioner JavaClassName Optional. -numReduceTasks num Optional. -inputreader spec Optional. -cmdenv n=vOptional. Pass env.var to streaming commands -mapdebug path Optional. To run this script when a map task fails -reducedebug path Optional. To run this script when a reduce task fails -verbose Generic options supported are -conf configuration file specify an application configuration file -D property=valueuse value for given property -fs local|namenode:port specify a namenode -jt local|jobtracker:portspecify a job tracker -files comma separated list of filesspecify comma separated files to be copied to the map reduce cluster -libjars comma separated list of jarsspecify comma separated jar files to include in the classpath. -archives comma separated list of archivesspecify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] For more details about these options: Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info I think -jobconf is not used in v.0.19 . 2009/2/18 S D sd.codewarr...@gmail.com I'm having trouble overriding the maximum number of map tasks that run on a given machine in my cluster. The default value of mapred.tasktracker.map.tasks.maximum is set to 2 in hadoop-default.xml. When running my job I passed -jobconf mapred.tasktracker.map.tasks.maximum=1 to limit map tasks to one per machine but each machine was still allocated 2 map tasks (simultaneously). The only way I was able to guarantee a maximum of one map task per machine was to change the value of the property in hadoop-site.xml. This is unsatisfactory since I'll often be changing the maximum on a per job basis. Any hints? On a different note, when I attempt to pass params via -D I get a usage message; when I use -jobconf the command goes through (and works in the case of mapred.reduce.tasks=0 for example) but I get a deprecation warning). Thanks, John -- M. Raşit ÖZDAŞ
Re: Finding small subset in very large dataset
Hi, The bloomfilter solution works great, but I still have to copy the data around sometimes. I'm still wondering if I can reduce the associated data to the keys to a reference or something small (the 100 KB of data are very big), with which I can then later fetch the data in the reduce step. In the past I was using hbase to store the associated data in it (but unfortunately hbase proved to be very unreliable in my case). I will probably also start to compress the data in the value store, which will probably increase sorting speed (as the data is there probably uncompressed). Is there something else I could do to speed this process up? Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Finding small subset in very large dataset
just re-represent the associated data as a bit vector and set of hash functions. you then just copy this around, rather than the raw items themselves. Miles 2009/2/18 Thibaut_ tbr...@blue.lu: Hi, The bloomfilter solution works great, but I still have to copy the data around sometimes. I'm still wondering if I can reduce the associated data to the keys to a reference or something small (the 100 KB of data are very big), with which I can then later fetch the data in the reduce step. In the past I was using hbase to store the associated data in it (but unfortunately hbase proved to be very unreliable in my case). I will probably also start to compress the data in the value store, which will probably increase sorting speed (as the data is there probably uncompressed). Is there something else I could do to speed this process up? Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf
The .maximum values are only loaded by the Tasktrackers at server start time at present, and any changes you make will be ignored. 2009/2/18 S D sd.codewarr...@gmail.com Thanks for your response Rasit. You may have missed a portion of my post. On a different note, when I attempt to pass params via -D I get a usage message; when I use -jobconf the command goes through (and works in the case of mapred.reduce.tasks=0 for example) but I get a deprecation warning). I'm using Hadoop 0.19.0 and -D is not working. Are you using version 0.19.0 as well? John On Wed, Feb 18, 2009 at 9:14 AM, Rasit OZDAS rasitoz...@gmail.com wrote: John, did you try -D option instead of -jobconf, I had -D option in my code, I changed it with -jobconf, this is what I get: ... ... Options: -inputpath DFS input file(s) for the Map step -output path DFS output directory for the Reduce step -mapper cmd|JavaClassName The streaming command to run -combiner JavaClassName Combiner has to be a Java class -reducer cmd|JavaClassName The streaming command to run -file file File/dir to be shipped in the Job jar file -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional. -outputformat TextOutputFormat(default)|JavaClassName Optional. -partitioner JavaClassName Optional. -numReduceTasks num Optional. -inputreader spec Optional. -cmdenv n=vOptional. Pass env.var to streaming commands -mapdebug path Optional. To run this script when a map task fails -reducedebug path Optional. To run this script when a reduce task fails -verbose Generic options supported are -conf configuration file specify an application configuration file -D property=valueuse value for given property -fs local|namenode:port specify a namenode -jt local|jobtracker:portspecify a job tracker -files comma separated list of filesspecify comma separated files to be copied to the map reduce cluster -libjars comma separated list of jarsspecify comma separated jar files to include in the classpath. -archives comma separated list of archivesspecify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] For more details about these options: Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info I think -jobconf is not used in v.0.19 . 2009/2/18 S D sd.codewarr...@gmail.com I'm having trouble overriding the maximum number of map tasks that run on a given machine in my cluster. The default value of mapred.tasktracker.map.tasks.maximum is set to 2 in hadoop-default.xml. When running my job I passed -jobconf mapred.tasktracker.map.tasks.maximum=1 to limit map tasks to one per machine but each machine was still allocated 2 map tasks (simultaneously). The only way I was able to guarantee a maximum of one map task per machine was to change the value of the property in hadoop-site.xml. This is unsatisfactory since I'll often be changing the maximum on a per job basis. Any hints? On a different note, when I attempt to pass params via -D I get a usage message; when I use -jobconf the command goes through (and works in the case of mapred.reduce.tasks=0 for example) but I get a deprecation warning). Thanks, John -- M. Raşit ÖZDAŞ
Re: Finding small subset in very large dataset
Hi Miles, I'm not following you. If I'm saving an associated hash or bit vector, how can I then quickly access the elements afterwards (the file with the data might be 100GB big and is on the DFS)? I could also directly save the offset of the data in the datafile as reference, and then on each reducer read in that big file only once. As all the keys are sorted, I can get all the needed values in one big read step (skipping those entries I don't need). Thibaut Miles Osborne wrote: just re-represent the associated data as a bit vector and set of hash functions. you then just copy this around, rather than the raw items themselves. Miles 2009/2/18 Thibaut_ tbr...@blue.lu: Hi, The bloomfilter solution works great, but I still have to copy the data around sometimes. I'm still wondering if I can reduce the associated data to the keys to a reference or something small (the 100 KB of data are very big), with which I can then later fetch the data in the reduce step. In the past I was using hbase to store the associated data in it (but unfortunately hbase proved to be very unreliable in my case). I will probably also start to compress the data in the value store, which will probably increase sorting speed (as the data is there probably uncompressed). Is there something else I could do to speed this process up? Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22082598.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Finding small subset in very large dataset
if i remember correctly you have two sets of data: --set A, which is very big --set B, which is small and you want to find all elements of A which are in B. right? represent A using a variant of a Bloom Filter which supports key-value pairs. A Bloomier Filter will do this for you. each mapper then loads-up A (represented using the Bloomier Filter) and works over B . whenever A is present in the representation of B, you look for the associated value in B and emit that. if even using a Bloomier Filter you still need too much memory then you could store it once using Hypertable see here for an explanation of Bloomier Filters applied to the task of storing lots of string,probability pairs Randomized Language Models via Perfect Hash Functions http://aclweb.org/anthology-new/P/P08/P08-1058.pdf Miles 2009/2/18 Thibaut_ tbr...@blue.lu: Hi Miles, I'm not following you. If I'm saving an associated hash or bit vector, how can I then quickly access the elements afterwards (the file with the data might be 100GB big and is on the DFS)? I could also directly save the offset of the data in the datafile as reference, and then on each reducer read in that big file only once. As all the keys are sorted, I can get all the needed values in one big read step (skipping those entries I don't need). Thibaut Miles Osborne wrote: just re-represent the associated data as a bit vector and set of hash functions. you then just copy this around, rather than the raw items themselves. Miles 2009/2/18 Thibaut_ tbr...@blue.lu: Hi, The bloomfilter solution works great, but I still have to copy the data around sometimes. I'm still wondering if I can reduce the associated data to the keys to a reference or something small (the 100 KB of data are very big), with which I can then later fetch the data in the reduce step. In the past I was using hbase to store the associated data in it (but unfortunately hbase proved to be very unreliable in my case). I will probably also start to compress the data in the value store, which will probably increase sorting speed (as the data is there probably uncompressed). Is there something else I could do to speed this process up? Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22082598.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf
Thanks Jason. That's useful information. Are you aware of plans to change this so that the maximum values can be changed without restarting the server? John 2009/2/18 jason hadoop jason.had...@gmail.com The .maximum values are only loaded by the Tasktrackers at server start time at present, and any changes you make will be ignored. 2009/2/18 S D sd.codewarr...@gmail.com Thanks for your response Rasit. You may have missed a portion of my post. On a different note, when I attempt to pass params via -D I get a usage message; when I use -jobconf the command goes through (and works in the case of mapred.reduce.tasks=0 for example) but I get a deprecation warning). I'm using Hadoop 0.19.0 and -D is not working. Are you using version 0.19.0 as well? John On Wed, Feb 18, 2009 at 9:14 AM, Rasit OZDAS rasitoz...@gmail.com wrote: John, did you try -D option instead of -jobconf, I had -D option in my code, I changed it with -jobconf, this is what I get: ... ... Options: -inputpath DFS input file(s) for the Map step -output path DFS output directory for the Reduce step -mapper cmd|JavaClassName The streaming command to run -combiner JavaClassName Combiner has to be a Java class -reducer cmd|JavaClassName The streaming command to run -file file File/dir to be shipped in the Job jar file -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional. -outputformat TextOutputFormat(default)|JavaClassName Optional. -partitioner JavaClassName Optional. -numReduceTasks num Optional. -inputreader spec Optional. -cmdenv n=vOptional. Pass env.var to streaming commands -mapdebug path Optional. To run this script when a map task fails -reducedebug path Optional. To run this script when a reduce task fails -verbose Generic options supported are -conf configuration file specify an application configuration file -D property=valueuse value for given property -fs local|namenode:port specify a namenode -jt local|jobtracker:portspecify a job tracker -files comma separated list of filesspecify comma separated files to be copied to the map reduce cluster -libjars comma separated list of jarsspecify comma separated jar files to include in the classpath. -archives comma separated list of archivesspecify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] For more details about these options: Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info I think -jobconf is not used in v.0.19 . 2009/2/18 S D sd.codewarr...@gmail.com I'm having trouble overriding the maximum number of map tasks that run on a given machine in my cluster. The default value of mapred.tasktracker.map.tasks.maximum is set to 2 in hadoop-default.xml. When running my job I passed -jobconf mapred.tasktracker.map.tasks.maximum=1 to limit map tasks to one per machine but each machine was still allocated 2 map tasks (simultaneously). The only way I was able to guarantee a maximum of one map task per machine was to change the value of the property in hadoop-site.xml. This is unsatisfactory since I'll often be changing the maximum on a per job basis. Any hints? On a different note, when I attempt to pass params via -D I get a usage message; when I use -jobconf the command goes through (and works in the case of mapred.reduce.tasks=0 for example) but I get a deprecation warning). Thanks, John -- M. Raşit ÖZDAŞ
Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf
I certainly hope it changes but I am unaware that it is in the todo queue at present. 2009/2/18 S D sd.codewarr...@gmail.com Thanks Jason. That's useful information. Are you aware of plans to change this so that the maximum values can be changed without restarting the server? John 2009/2/18 jason hadoop jason.had...@gmail.com The .maximum values are only loaded by the Tasktrackers at server start time at present, and any changes you make will be ignored. 2009/2/18 S D sd.codewarr...@gmail.com Thanks for your response Rasit. You may have missed a portion of my post. On a different note, when I attempt to pass params via -D I get a usage message; when I use -jobconf the command goes through (and works in the case of mapred.reduce.tasks=0 for example) but I get a deprecation warning). I'm using Hadoop 0.19.0 and -D is not working. Are you using version 0.19.0 as well? John On Wed, Feb 18, 2009 at 9:14 AM, Rasit OZDAS rasitoz...@gmail.com wrote: John, did you try -D option instead of -jobconf, I had -D option in my code, I changed it with -jobconf, this is what I get: ... ... Options: -inputpath DFS input file(s) for the Map step -output path DFS output directory for the Reduce step -mapper cmd|JavaClassName The streaming command to run -combiner JavaClassName Combiner has to be a Java class -reducer cmd|JavaClassName The streaming command to run -file file File/dir to be shipped in the Job jar file -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional. -outputformat TextOutputFormat(default)|JavaClassName Optional. -partitioner JavaClassName Optional. -numReduceTasks num Optional. -inputreader spec Optional. -cmdenv n=vOptional. Pass env.var to streaming commands -mapdebug path Optional. To run this script when a map task fails -reducedebug path Optional. To run this script when a reduce task fails -verbose Generic options supported are -conf configuration file specify an application configuration file -D property=valueuse value for given property -fs local|namenode:port specify a namenode -jt local|jobtracker:portspecify a job tracker -files comma separated list of filesspecify comma separated files to be copied to the map reduce cluster -libjars comma separated list of jarsspecify comma separated jar files to include in the classpath. -archives comma separated list of archivesspecify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] For more details about these options: Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info I think -jobconf is not used in v.0.19 . 2009/2/18 S D sd.codewarr...@gmail.com I'm having trouble overriding the maximum number of map tasks that run on a given machine in my cluster. The default value of mapred.tasktracker.map.tasks.maximum is set to 2 in hadoop-default.xml. When running my job I passed -jobconf mapred.tasktracker.map.tasks.maximum=1 to limit map tasks to one per machine but each machine was still allocated 2 map tasks (simultaneously). The only way I was able to guarantee a maximum of one map task per machine was to change the value of the property in hadoop-site.xml. This is unsatisfactory since I'll often be changing the maximum on a per job basis. Any hints? On a different note, when I attempt to pass params via -D I get a usage message; when I use -jobconf the command goes through (and works in the case of mapred.reduce.tasks=0 for example) but I get a deprecation warning). Thanks, John -- M. Raşit ÖZDAŞ
Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf
I see, John. I also use 0.19, just to note, -D option should come first, since it's one of generic options. I use it without any errors. Cheers, Rasit 2009/2/18 S D sd.codewarr...@gmail.com Thanks for your response Rasit. You may have missed a portion of my post. On a different note, when I attempt to pass params via -D I get a usage message; when I use -jobconf the command goes through (and works in the case of mapred.reduce.tasks=0 for example) but I get a deprecation warning). I'm using Hadoop 0.19.0 and -D is not working. Are you using version 0.19.0 as well? John On Wed, Feb 18, 2009 at 9:14 AM, Rasit OZDAS rasitoz...@gmail.com wrote: John, did you try -D option instead of -jobconf, I had -D option in my code, I changed it with -jobconf, this is what I get: ... ... Options: -inputpath DFS input file(s) for the Map step -output path DFS output directory for the Reduce step -mapper cmd|JavaClassName The streaming command to run -combiner JavaClassName Combiner has to be a Java class -reducer cmd|JavaClassName The streaming command to run -file file File/dir to be shipped in the Job jar file -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional. -outputformat TextOutputFormat(default)|JavaClassName Optional. -partitioner JavaClassName Optional. -numReduceTasks num Optional. -inputreader spec Optional. -cmdenv n=vOptional. Pass env.var to streaming commands -mapdebug path Optional. To run this script when a map task fails -reducedebug path Optional. To run this script when a reduce task fails -verbose Generic options supported are -conf configuration file specify an application configuration file -D property=valueuse value for given property -fs local|namenode:port specify a namenode -jt local|jobtracker:portspecify a job tracker -files comma separated list of filesspecify comma separated files to be copied to the map reduce cluster -libjars comma separated list of jarsspecify comma separated jar files to include in the classpath. -archives comma separated list of archivesspecify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] For more details about these options: Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info I think -jobconf is not used in v.0.19 . 2009/2/18 S D sd.codewarr...@gmail.com I'm having trouble overriding the maximum number of map tasks that run on a given machine in my cluster. The default value of mapred.tasktracker.map.tasks.maximum is set to 2 in hadoop-default.xml. When running my job I passed -jobconf mapred.tasktracker.map.tasks.maximum=1 to limit map tasks to one per machine but each machine was still allocated 2 map tasks (simultaneously). The only way I was able to guarantee a maximum of one map task per machine was to change the value of the property in hadoop-site.xml. This is unsatisfactory since I'll often be changing the maximum on a per job basis. Any hints? On a different note, when I attempt to pass params via -D I get a usage message; when I use -jobconf the command goes through (and works in the case of mapred.reduce.tasks=0 for example) but I get a deprecation warning). Thanks, John -- M. Raşit ÖZDAŞ -- M. Raşit ÖZDAŞ
Getting Started with AIX mahcines
I am attempting my first steps learning Hadoop on top of AIX machine. I have followed the installation description: http://hadoop.apache.org/core/docs/r0.19.0/quickstart.html The Stand-Alone Mode worked just well. However, I am failing when trying to execute the Psuedo-Distributed Mode: I have carried out the following steps: 1. update conf/hadoop-site.xml 2. exec bin/hadoop namenode -format 3. exec bin/start-all.sh 4. exec bin/hadoop fs -put conf input - the execution of step 2 (Formatting the NameNode) was succesful, corresponding to the expected result also shown in http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster ) - The exeuction of step 3 ( starting the Single-Node servers) , seems to be OK , although the output is not similiar to the one carried out by Ubuntu LINUX, It seems that the Local host shell is exited: starting namenode, logging to /usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-namenode-rcc-hrl-lpar-020.haifa.ibm.com.out localhost: starting datanode, logging to /usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-datanode-rcc-hrl-lpar-020.haifa.ibm.com.out localhost: Hasta la vista, baby *+ IT SEEMS that the local host shell exits ==* localhost: starting secondarynamenode, logging to /usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-secondarynamenode-rcc-hrl-lpar-020.haifa.ibm.com.out localhost: Hasta la vista, baby *+ IT SEEMS that the local host shell exits ==* starting jobtracker, logging to /usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-jobtracker-rcc-hrl-lpar-020.haifa.ibm.com.out localhost: starting tasktracker, logging to /usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-tasktracker-rcc-hrl-lpar-020.haifa.ibm.com.out localhost: Hasta la vista, baby *+ IT SEEMS that the local host shell exits ==* - The execution of step 4 fails, no data is copied to DFS input directory , recieving exception 09/02/18 12:14:24 INFO hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hdpuser/input/masters could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1270) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at java.lang.reflect.Method.invoke(Method.java:599) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) at org.apache.hadoop.ipc.Client.call(Client.java:696) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy0.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at java.lang.reflect.Method.invoke(Method.java:599) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy0.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183) 09/02/18 12:14:24 WARN hdfs.DFSClient: NotReplicatedYetException sleeping /user/hdpuser/input/masters retries left 4 . . 09/02/18 12:14:30 WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null 09/02/18 12:14:30 WARN hdfs.DFSClient: Could not get block locations. Aborting... put: java.io.IOException: File /user/hdpuser/input/masters could only be replicated to 0 nodes, instead of 1 Exception closing file /user/hdpuser/input/masters java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198) at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3084) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3053) at
RE: Getting Started with AIX mahcines
Refer to the following fix: Hadoop will not work under AIX without it. https://issues.apache.org/jira/browse/HADOOP-4546 Bill -Original Message- From: work.av...@gmail.com [mailto:work.av...@gmail.com] On Behalf Of Aviad sela Sent: Wednesday, February 18, 2009 12:14 PM To: Hadoop Users Support Subject: Getting Started with AIX mahcines I am attempting my first steps learning Hadoop on top of AIX machine. I have followed the installation description: http://hadoop.apache.org/core/docs/r0.19.0/quickstart.html The Stand-Alone Mode worked just well. However, I am failing when trying to execute the Psuedo-Distributed Mode: I have carried out the following steps: 1. update conf/hadoop-site.xml 2. exec bin/hadoop namenode -format 3. exec bin/start-all.sh 4. exec bin/hadoop fs -put conf input - the execution of step 2 (Formatting the NameNode) was succesful, corresponding to the expected result also shown in http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single- Node_Cluster ) - The exeuction of step 3 ( starting the Single-Node servers) , seems to be OK , although the output is not similiar to the one carried out by Ubuntu LINUX, It seems that the Local host shell is exited: starting namenode, logging to /usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-namenode-rcc-hrl-lp ar-020.haifa.ibm.com.out localhost: starting datanode, logging to /usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-datanode-rcc-hrl-lp ar-020.haifa.ibm.com.out localhost: Hasta la vista, baby *+ IT SEEMS that the local host shell exits ==* localhost: starting secondarynamenode, logging to /usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-secondarynamenode-r cc-hrl-lpar-020.haifa.ibm.com.out localhost: Hasta la vista, baby *+ IT SEEMS that the local host shell exits ==* starting jobtracker, logging to /usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-jobtracker-rcc-hrl- lpar-020.haifa.ibm.com.out localhost: starting tasktracker, logging to /usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-tasktracker-rcc-hrl -lpar-020.haifa.ibm.com.out localhost: Hasta la vista, baby *+ IT SEEMS that the local host shell exits ==* - The execution of step 4 fails, no data is copied to DFS input directory , recieving exception 09/02/18 12:14:24 INFO hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hdpuser/input/masters could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(F SNamesystem.java:1270) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:3 51) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav a:45) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:37) at java.lang.reflect.Method.invoke(Method.java:599) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) at org.apache.hadoop.ipc.Client.call(Client.java:696) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy0.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav a:45) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:37) at java.lang.reflect.Method.invoke(Method.java:599) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvo cationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocation Handler.java:59) at $Proxy0.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DF SClient.java:2815) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(D FSClient.java:2697) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.j ava:1997) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSCli ent.java:2183) 09/02/18 12:14:24 WARN hdfs.DFSClient: NotReplicatedYetException sleeping /user/hdpuser/input/masters retries left 4 . . 09/02/18 12:14:30 WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null 09/02/18 12:14:30 WARN hdfs.DFSClient: Could not get block locations. Aborting... put: java.io.IOException: File /user/hdpuser/input/masters could only be replicated to 0 nodes, instead of 1 Exception closing file /user/hdpuser/input/masters java.io.IOException: Filesystem closed at
Disabling Reporter Output?
I am currently trying Map/Reduce in Eclipse. The input comes from an hbase table. The performance of my jobs is terrible. Even when only done on a single row it takes around 10 seconds to complete the job. My current guess is that the reporting done to the eclipse console might play a role in here. I am looking for a way to disable the printing of status to the console. Or of course any other ideas what is going wrong here. This is a single node cluster, pretty common desktop hardware and writing to the hbase is a breeze. Thanks Philipp -- Jetzt 1 Monat kostenlos! GMX FreeDSL - Telefonanschluss + DSL für nur 17,95 Euro/mtl.!* http://dsl.gmx.de/?ac=OM.AD.PD003K11308T4569a
Re: Disabling Reporter Output?
There is a moderate a mount of setup and tear down in any hadoop job. It may be that your 10 seconds are primarily that. On Wed, Feb 18, 2009 at 11:29 AM, Philipp Dobrigkeit pdobrigk...@gmx.dewrote: I am currently trying Map/Reduce in Eclipse. The input comes from an hbase table. The performance of my jobs is terrible. Even when only done on a single row it takes around 10 seconds to complete the job. My current guess is that the reporting done to the eclipse console might play a role in here. I am looking for a way to disable the printing of status to the console. Or of course any other ideas what is going wrong here. This is a single node cluster, pretty common desktop hardware and writing to the hbase is a breeze. Thanks Philipp -- Jetzt 1 Monat kostenlos! GMX FreeDSL - Telefonanschluss + DSL für nur 17,95 Euro/mtl.!* http://dsl.gmx.de/?ac=OM.AD.PD003K11308T4569a
Re: Hadoop Write Performance
what is the hadoop version? You could check log on a datanode around that time. You could post any suspicious errors. For e.g. you can trace a particular block in client and datanode logs. Most likely it not a NameNode issue, but you can check NameNode log as well. Raghu. Xavier Stevens wrote: Does anyone have an expected or experienced write speed to HDFS outside of Map/Reduce? Any recommendations on properties to tweak in hadoop-site.xml? Currently I have a multi-threaded writer where each thread is writing to a different file. But after a while I get this: java.io.IOException: Could not get block locations. Aborting... at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFS Client.java:2081) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.ja va:1702) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClie nt.java:1818) Which is perhaps indicating that the namenode is overwhelmed? Thanks, -Xavier
RE: Hadoop Write Performance
Raghu, I was using 0.17.2.1, but I installed 0.18.3 a couple of days ago. I also separated out my secondarynamenode and jobtracker to another machine. In addition my network operations people had misconfigured some switches which ended up being my bottleneck. After all of that my writer and Hadoop is working great. -Xavier -Original Message- From: Raghu Angadi [mailto:rang...@yahoo-inc.com] Sent: Wednesday, February 18, 2009 11:49 AM To: core-user@hadoop.apache.org Subject: Re: Hadoop Write Performance what is the hadoop version? You could check log on a datanode around that time. You could post any suspicious errors. For e.g. you can trace a particular block in client and datanode logs. Most likely it not a NameNode issue, but you can check NameNode log as well. Raghu. Xavier Stevens wrote: Does anyone have an expected or experienced write speed to HDFS outside of Map/Reduce? Any recommendations on properties to tweak in hadoop-site.xml? Currently I have a multi-threaded writer where each thread is writing to a different file. But after a while I get this: java.io.IOException: Could not get block locations. Aborting... at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(D FS Client.java:2081) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient. ja va:1702) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSCl ie nt.java:1818) Which is perhaps indicating that the namenode is overwhelmed? Thanks, -Xavier
Probelms getting Eclipse Hadoop plugin to work.
I'm using Eclipse 3.3.2 and want to view my remote cluster using the Hadoop plugin. Everything shows up and I can see the map/reduce perspective but when trying to connect to a location I get: Error: Call failed on local exception I've set the host to for example xx0, where xx0 is a remote machine accessible from the terminal, and the ports to 50020/50040 for M/R master and DFS master respectively. Is there anything I'm missing to set for remote access to the Hadoop cluster? Regards Erik
Re: GenericOptionsParser warning
You should put this stub code in your program as the means to start your MapReduce job: public class Foo extends Configured implements Tool { public int run(String [] args) throws IOException { JobConf conf = new JobConf(getConf(), Foo.class); // run the job here. return 0; } public static void main(String [] args) throws Exception { int ret = ToolRunner.run(new Foo(), args); // calls your run() method. System.exit(ret); } } On Wed, Feb 18, 2009 at 7:09 AM, Rasit OZDAS rasitoz...@gmail.com wrote: Hi, There is a JIRA issue about this problem, if I understand it correctly: https://issues.apache.org/jira/browse/HADOOP-3743 Strange, that I searched all source code, but there exists only this control in 2 places: if (!(job.getBoolean(mapred.used.genericoptionsparser, false))) { LOG.warn(Use GenericOptionsParser for parsing the arguments. + Applications should implement Tool for the same.); } Just an if block for logging, no extra controls. Am I missing something? If your class implements Tool, than there shouldn't be a warning. Cheers, Rasit 2009/2/18 Steve Loughran ste...@apache.org Sandhya E wrote: Hi All I prepare my JobConf object in a java class, by calling various set apis in JobConf object. When I submit the jobconf object using JobClient.runJob(conf), I'm seeing the warning: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. From hadoop sources it looks like setting mapred.used.genericoptionsparser will prevent this warning. But if I set this flag to true, will it have some other side effects. Thanks Sandhya Seen this message too -and it annoys me; not tracked it down -- M. Raşit ÖZDAŞ
Persistent completed jobs status not showing in jobtracker UI
I have enabled persistent completed jobs status and can see them in HDFS. However, they are not listed in the jobtracker's UI after the jobtracker is restarted. I thought that jobtracker will automatically look in HDFS if it does not find a job in its memory cache. What am I missing? How to I retrieve the persistent completed job status? Bill
the question about the common pc?
hi: the hadoop distributes the data and processing across clusters of commonly available computers.the document said this. but what is the commonly available computers mean? 1U server? or the pc that people daily used on windows? -- View this message in context: http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22092022.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: the question about the common pc?
and ,the nodes is the pc that people daily used on windows or the 1u server? buddha1021 wrote: hi: the hadoop distributes the data and processing across clusters of commonly available computers.the document said this. but what is the commonly available computers mean? 1U server? or the pc that people daily used on windows? -- View this message in context: http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22092038.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
7Zip compression in Hadoop
Hi all! I'm working on the Sogou Corpus Mining with hadoop mapreduce. However, the files are compressed as 7zip format. Does Hadoop has a built-in support for 7zip files? or I need to write a codec? Regards Song Liu in Suzhou Universtiy, China.
RE: 7Zip compression in Hadoop
No, you will need to write one yourself. Zheng -Original Message- From: 柳松 [mailto:lamfeel...@126.com] Sent: Wednesday, February 18, 2009 6:19 PM To: core-user@hadoop.apache.org Subject: 7Zip compression in Hadoop Hi all! I'm working on the Sogou Corpus Mining with hadoop mapreduce. However, the files are compressed as 7zip format. Does Hadoop has a built-in support for 7zip files? or I need to write a codec? Regards Song Liu in Suzhou Universtiy, China.
Sogou Corpus Decoder/Codec for Hadoop
Dear all! Can any one provide me a decoder or cdoec for Sogou Corpus? I'm analyzing Sogou Corpus using hadoop, but I cannot decode the .7z files. I have tried LZMA, but Idont know why it is not able to uncompress and decode the Sogou Corpus. If there are some one who like me are analysing this largest internet corpus, please let me know and help me to figure out this problem! Thanks Song Liu in Suzhou University , China.
Re: Persistent completed jobs status not showing in jobtracker UI
Bill Au wrote: I have enabled persistent completed jobs status and can see them in HDFS. However, they are not listed in the jobtracker's UI after the jobtracker is restarted. I thought that jobtracker will automatically look in HDFS if it does not find a job in its memory cache. What am I missing? How to I retrieve the persistent completed job status? Bill JobTracker web ui doesn't look at persistence storage after a restart. You can access the old jobs from job history. History link is accesible from web ui. -Amareshwari
Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf
Yes. The configuration is read only when the taskTracker starts. You can see more discussion on jira HADOOP-5170 (http://issues.apache.org/jira/browse/HADOOP-5170) for making it per job. -Amareshwari jason hadoop wrote: I certainly hope it changes but I am unaware that it is in the todo queue at present. 2009/2/18 S D sd.codewarr...@gmail.com Thanks Jason. That's useful information. Are you aware of plans to change this so that the maximum values can be changed without restarting the server? John 2009/2/18 jason hadoop jason.had...@gmail.com The .maximum values are only loaded by the Tasktrackers at server start time at present, and any changes you make will be ignored. 2009/2/18 S D sd.codewarr...@gmail.com Thanks for your response Rasit. You may have missed a portion of my post. On a different note, when I attempt to pass params via -D I get a usage message; when I use -jobconf the command goes through (and works in the case of mapred.reduce.tasks=0 for example) but I get a deprecation warning). I'm using Hadoop 0.19.0 and -D is not working. Are you using version 0.19.0 as well? John On Wed, Feb 18, 2009 at 9:14 AM, Rasit OZDAS rasitoz...@gmail.com wrote: John, did you try -D option instead of -jobconf, I had -D option in my code, I changed it with -jobconf, this is what I get: ... ... Options: -inputpath DFS input file(s) for the Map step -output path DFS output directory for the Reduce step -mapper cmd|JavaClassName The streaming command to run -combiner JavaClassName Combiner has to be a Java class -reducer cmd|JavaClassName The streaming command to run -file file File/dir to be shipped in the Job jar file -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional. -outputformat TextOutputFormat(default)|JavaClassName Optional. -partitioner JavaClassName Optional. -numReduceTasks num Optional. -inputreader spec Optional. -cmdenv n=vOptional. Pass env.var to streaming commands -mapdebug path Optional. To run this script when a map task fails -reducedebug path Optional. To run this script when a reduce task fails -verbose Generic options supported are -conf configuration file specify an application configuration file -D property=valueuse value for given property -fs local|namenode:port specify a namenode -jt local|jobtracker:portspecify a job tracker -files comma separated list of filesspecify comma separated files to be copied to the map reduce cluster -libjars comma separated list of jarsspecify comma separated jar files to include in the classpath. -archives comma separated list of archivesspecify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] For more details about these options: Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info I think -jobconf is not used in v.0.19 . 2009/2/18 S D sd.codewarr...@gmail.com I'm having trouble overriding the maximum number of map tasks that run on a given machine in my cluster. The default value of mapred.tasktracker.map.tasks.maximum is set to 2 in hadoop-default.xml. When running my job I passed -jobconf mapred.tasktracker.map.tasks.maximum=1 to limit map tasks to one per machine but each machine was still allocated 2 map tasks (simultaneously). The only way I was able to guarantee a maximum of one map task per machine was to change the value of the property in hadoop-site.xml. This is unsatisfactory since I'll often be changing the maximum on a per job basis. Any hints? On a different note, when I attempt to pass params via -D I get a usage message; when I use -jobconf the command goes through (and works in the case of mapred.reduce.tasks=0 for example) but I get a deprecation warning). Thanks, John -- M. Raşit ÖZDAŞ
Re:Re: the question about the common pc?
Actually, there's a widely misunderstanding of this Common PC . Common PC doesn't means PCs which are daily used, It means the performance of each node, can be measured by common pc's computing power. In the matter of fact, we dont use Gb enthernet for daily pcs' communication, we dont use linux for our document process, and most importantly, Hadoop cannot run effectively on thoese daily pcs. Hadoop is designed for High performance computing equipment, but claimed to be fit for daily pcs. Hadoop for pcs? what a joke. -原始邮件- 发件人: buddha1021 buddha1...@yahoo.cn 发送时间: 2009年2月19日 星期四 收件人: core-user@hadoop.apache.org 抄送: 主题: Re: the question about the common pc? and ,the nodes is the pc that people daily used on windows or the 1u server? buddha1021 wrote: hi: the hadoop distributes the data and processing across clusters of commonly available computers.the document said this. but what is the commonly available computers mean? 1U server? or the pc that people daily used on windows? -- View this message in context: http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22092038.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: the question about the common pc?
On Feb 18, 2009, at 11:43 PM, 柳松 wrote: Actually, there's a widely misunderstanding of this Common PC . Common PC doesn't means PCs which are daily used, It means the performance of each node, can be measured by common pc's computing power. In the matter of fact, we dont use Gb enthernet for daily pcs' communication, I certainly do. we dont use linux for our document process, I do. and most importantly, Hadoop cannot run effectively on thoese daily pcs. Maybe your PC is under-spec'd? Hadoop is designed for High performance computing equipment, but claimed to be fit for daily pcs. Our students run it on Pentium III's with 20GB HDD. Try finding a new laptop with that low of specs. Hadoop for pcs? what a joke. The truth is that Hadoop scales to the gear you have. If you throw a bunch of Windows desktops, it'll perform like a bunch of Windows desktops. If you run it on the student test cluster, it'll perform like Java on PIIIs. If you run it on a new high-performance cluster ... well, you get the point. If you want to run Hadoop for development work, I'd say you want to use your desktop. If you want to run Hadoop for production work, I'd recommend a production environment - decently powered 1U linux servers with large disks (or whatever the recommendation is on the wiki). Brian -原始邮件- 发件人: buddha1021 buddha1...@yahoo.cn 发送时间: 2009年2月19日 星期四 收件人: core-user@hadoop.apache.org 抄送: 主题: Re: the question about the common pc? and ,the nodes is the pc that people daily used on windows or the 1u server? buddha1021 wrote: hi: the hadoop distributes the data and processing across clusters of commonly available computers.the document said this. but what is the commonly available computers mean? 1U server? or the pc that people daily used on windows? -- View this message in context: http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22092038.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Re:Re: the question about the common pc?
On Thu, 2009-02-19 at 13:43 +0800, 柳松 wrote: Hadoop is designed for High performance computing equipment, but claimed to be fit for daily pcs. The phrase High Performance Computing equipment makes me think of infiniband, fibre all over the place etc. Hadoop doesn't need that, it runs well on standard pc hardware - i.e. no special hardware you couldn't find in a standard pc. That doesn't mean you should run it on pcs that are being used for other things though. I found that hadoop ran ok on fairly old hardware - a load of old power-pc macs (running linux) churned through some jobs quickly, and I've actually run it on people's office machines during the nights (not on Windows). I did end up having to add an extra switch in for the part of the network that was only 100 mbps to get the throughput though. Of course ideally you would be running it on a rack of 1u servers, but that's still normally standard pc hardware.
Re: Re:Re: the question about the common pc?
when i saidpeople daily used on windows,I want to specify the commom handware (not OS)but don't mean hadoop run on windows! I mean that hadoop run on common pc's hardware .certainly linux as OS ! Tim Wintle wrote: On Thu, 2009-02-19 at 13:43 +0800, 柳松 wrote: Hadoop is designed for High performance computing equipment, but claimed to be fit for daily pcs. The phrase High Performance Computing equipment makes me think of infiniband, fibre all over the place etc. Hadoop doesn't need that, it runs well on standard pc hardware - i.e. no special hardware you couldn't find in a standard pc. That doesn't mean you should run it on pcs that are being used for other things though. I found that hadoop ran ok on fairly old hardware - a load of old power-pc macs (running linux) churned through some jobs quickly, and I've actually run it on people's office machines during the nights (not on Windows). I did end up having to add an extra switch in for the part of the network that was only 100 mbps to get the throughput though. Of course ideally you would be running it on a rack of 1u servers, but that's still normally standard pc hardware. -- View this message in context: http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22094601.html Sent from the Hadoop core-user mailing list archive at Nabble.com.