S3 Exception for a Map Reduce job on EC2

2009-10-28 Thread Harshit Kumar
Hi There is 1 GB of rdf/owl files that I am executing on EC2. Execution throws the following exception --- 08/11/19 16:08:27 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. org.apache.hadoop.fs.s3.S3Except

S3 Exception for a Map Reduce job on EC2

2009-10-28 Thread Harshit Kumar
Hi There is 1 GB of rdf/owl files that I am executing on EC2. Execution throws the following exception --- 08/11/19 16:08:27 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. org.apache.hadoop.fs.s3.S3Except

Re: Using Configuration instead of JobConf

2009-10-28 Thread Oliver B. Fischer
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello Tim, thank you. I found the examples and even the WordCount has been updated. Only the documentation is out of date. Bye, Oliver tim robertson schrieb: > The org.apache.hadoop.examples.SecondarySort in 0.20.1 is an example > using the org.ap

Error in configuring object

2009-10-28 Thread David Greer
Hi Everyone, I'm experimenting with Hadoop and trying to get it running in pseudo-distributed mode as described in Appendix A of Tom White's book Hadoop The Definitive Guide. I've got the MaxTemperature example working when there is *no* job or task tracking daemons running. I didn't actually kno

Outputting a Separate File for each Line of Output

2009-10-28 Thread Ryan Rosario
In Streaming tasks, how can I output a separate file with the key as the filename, for each line of output, instead of collecting it in a big file? Thanks, Ryan

TaskTracker's totalMemoryAllottedForTasks is -1

2009-10-28 Thread Hassaan Khan
Going back to the issue: http://markmail.org/search/?q=Multinode+cluster+setup+issues#query:Multinode%20cluster%20setup%20issues+page:1+mid:qi45trv4fdfwugwf+state:results And having gone over the steps pointed to in the responseI looked at the TaskTracker, and on startup I see the following err

Re: Passing Properties With Whitespace To Streaming

2009-10-28 Thread Brian Vargas
Todd, I am using CDH2, and solution 'a' fixed the problem. Thanks for the help, and I look forward to the next release! Brian Todd Lipcon wrote: > Hi Brian, > > Any chance you are using the Cloudera distribution? We did accidentally ship > a bug like this which will be ameliorated in our next

Re: Which FileInputFormat to use for fixed length records?

2009-10-28 Thread Aaron Kimball
I think these would be good to add to mapreduce in the {{org.apache.hadoop.mapreduce.lib.input}} package. Please file a JIRA and apply a patch! - Aaron On Wed, Oct 28, 2009 at 11:15 AM, yz5od2 wrote: > Hi all, > I am working on writing a FixedLengthInputFormat class and a corresponding > FixedLen

Re: How to give consecutive numbers to output records?

2009-10-28 Thread Mark Kerzner
I agree, Edward. One Reducer is what I have now, and it works. It is, or may become, a bottleneck. When it does, I can go back to using multiple reducers, and add a second MR job to renumber. That way, I can stick with the situation until it becomes necessary to optimize, and not introduce any chan

Re: How to give consecutive numbers to output records?

2009-10-28 Thread Edward Capriolo
On Wed, Oct 28, 2009 at 2:20 PM, Mark Kerzner wrote: > Brien, > > >   - I am on EC2, what would be the advantage of using Zookeeper over >   JavaSpaces? Either would have to be maintained by me, as they are not >   provided on EC2 directly; >   - pack that with a map-local counter into a global ID

Re: Passing Properties With Whitespace To Streaming

2009-10-28 Thread Todd Lipcon
Hi Brian, Any chance you are using the Cloudera distribution? We did accidentally ship a bug like this which will be ameliorated in our next release. The temporary workarounds are: a) edit /usr/bin/hadoop and change the $* to a "$@" (including the quotes!) or b) use /usr/lib/hadoop-0.20/bin/had

Passing Properties With Whitespace To Streaming

2009-10-28 Thread Brian Vargas
Hi, Using Hadoop 0.20 (CDH2) I'm trying to pass some JVM options to my child tasks on the command-line, like this: $ hadoop jar streaming.jar -D mapred.reduce.tasks=0 -D 'mapred.child.java.opts=-Xms200m -Xmx400m' -input foo.txt -output bar -mapper /bin/cat However, this fails with: ERROR streami

Re: How to give consecutive numbers to output records?

2009-10-28 Thread Mark Kerzner
Brien, - I am on EC2, what would be the advantage of using Zookeeper over JavaSpaces? Either would have to be maintained by me, as they are not provided on EC2 directly; - pack that with a map-local counter into a global ID - you mean, just take the global counter and make the loca

Re: Which FileInputFormat to use for fixed length records?

2009-10-28 Thread yz5od2
Hi all, I am working on writing a FixedLengthInputFormat class and a corresponding FixedLengthRecordReader. Would the Hadoop commons project have interest in these? Basically these are for reading inputs of textual record data, where each record is a fixed length, (no carriage returns or s

Re: How to give consecutive numbers to output records?

2009-10-28 Thread brien colwell
Another approach is to initialize each map task with an ID (using JavaSpaces, something like Zookeeper, or some aspect of the input data) and then pack that with a map-local counter into a global ID. This makes assumptions like the number of map tasks less than 2^10 and the number of records p

Re: How to give consecutive numbers to output records?

2009-10-28 Thread Mark Kerzner
Oh, I see. Very smart. In my case, I need consecutive numbers with no gaps, and I need them in the order of how Hadoop sorted the maps. So I don't see how I could apply this approach, but thank you - it is a great discussion, which was helpful to consider all issues around this, and which brought

Re: How to give consecutive numbers to output records?

2009-10-28 Thread Michael Klatt
Hi Mark, Each mapper (or reducer) has an environment variable "mapred_map_tasks" (or "mapred_reduce_tasks") which will describe how many tasks the map or reduce job was split into. It also has a variable "mapred_task_id" which contains a unique identifier for the task. Using these two togethe

Re: How to give consecutive numbers to output records?

2009-10-28 Thread Mark Kerzner
Michael, environmental variables are available in Java, but the environment itself is not shared between instances. I read your code - you are solving exactly the same problem I am interested in - but I did not see how it works in distributed environment. By the way, it occurs to me that JavaSpac

Re: How to give consecutive numbers to output records?

2009-10-28 Thread Michael Klatt
I posted an approach to this using streaming, but if the environment variables are available in standard Java interface, this may work for you. http://www.mail-archive.com/core-u...@hadoop.apache.org/msg09079.html You'll have to be able to tolerate some small gaps in the ids. Michael Mark K

Re: Multinode cluster setup issues

2009-10-28 Thread tim robertson
I've seen those errors when I was playing the values in the core-site.xml, dfs-site.xml and mapreduce-site.xml. Might be worth comparing your values to mine discussed in the thread http://www.mail-archive.com/common-user@hadoop.apache.org/msg02522.html which also represent 8G DN machines Cheers T

Multinode cluster setup issues

2009-10-28 Thread Hassaan Khan
I'm running Hadoop 0.20.1+133 (Cloudera distro) I tried setting up a multi-node Hadoop cluster and on executing the command: hadoop jar /usr/lib/hadoop/hadoop-0.20.1+133-examples.jar grep input output 'dfs[a-z.]+' I get: 09/10/27 20:39:21 INFO mapred.FileInputFormat: Total input paths to process :

Re: Distribution of data in nodes with different storage capacity

2009-10-28 Thread Amogh Vasekar
Hi, Rebalancer should help you : http://issues.apache.org/jira/browse/HADOOP-1652 Amogh On 10/28/09 2:54 PM, "Vibhooti Verma" wrote: Hi All, We are facing the issue with distribution of data in a cluster where nodes have differnt storage capacity. We have 4 nodes with 100G capacity and 1 node w

Distribution of data in nodes with different storage capacity

2009-10-28 Thread Vibhooti Verma
Hi All, We are facing the issue with distribution of data in a cluster where nodes have differnt storage capacity. We have 4 nodes with 100G capacity and 1 node with 2TB capacity. The storage of the high storage capacity is not being utilized where as all low storage capccity nodes are being full.

Re: Mount WebDav in Linux for HDFS-0.20.1

2009-10-28 Thread Zhang Bingjun (Eddy)
Dear Huy Phan, To follow up. Even though the performance of webdav+davfs2 is worse than fuse-dfs to access hdfs, but it is much more stable than fuse-dfs so far. As you have said, the memory leak of fuse-dfs is solvable, but it is hard to find all leaks so far, especially when the plain C/C++ code