Re: muti-thread mapreduce

2012-12-12 Thread Harsh J
Exactly - A job is already designed to be properly parallel w.r.t. its input, and this would just add additional overheads of job setup and scheduling. If your per-record processing requires threaded work, consider using the MultithreadedMapper/Reducer classes instead. On Wed, Dec 12, 2012 at

Sane max storage size for DN

2012-12-12 Thread Mohammad Tariq
Hello list, I don't know if this question makes any sense, but I would like to ask, does it make sense to store 500TB (or more) data in a single DN?If yes, then what should be the spec of other parameters *viz*. NN DN RAM, N/W etc?If no, what could be the alternative? Many thanks.

AUTO: Prabhat Pandey is out of the office

2012-12-12 Thread Prabhat Pandey
I am out of the office until 12/17/2012. I am out of the office until 12/17/2012. For any issues please contact Dispatcher:dispatcherdb...@us.ibm.com Thanks. Prabhat Pandey Note: This is an automated response to your message Hadoop 101 sent on 12/11/2012 17:49:45. This is the only

Re: Sane max storage size for DN

2012-12-12 Thread Ted Dunning
Yes it does make sense, depending on how much compute each byte of data will require on average. With ordinary Hadoop, it is reasonable to have half a dozen 2TB drives. With specialized versions of Hadoop considerably more can be supported. From what you say, it sounds like you are suggesting

Re: Sane max storage size for DN

2012-12-12 Thread Mohammad Tariq
Thank you so much for the valuable response Ted. No, there would be dedicated storage for NN as well. Any tips on RAM N/W? *Computations are not really frequent. Thanks again. Regards, Mohammad Tariq On Wed, Dec 12, 2012 at 9:14 PM, Ted Dunning tdunn...@maprtech.com wrote: Yes it

Re: Hadoop 101

2012-12-12 Thread Pat Ferrel
Yeah I found the TextInputFormat and TextKeyValueInputFormat and I know how to parse text--I'm just too lazy. I was hoping there was a Text equivalent of a SequenceFile that was hidden somewhere. As I said there is no mapper, this is running outside of hadoop M/R. So I at least need a line

Re: Sane max storage size for DN

2012-12-12 Thread Michael Segel
500 TB? How many nodes in the cluster? Is this attached storage or is it in an array? I mean if you have 4 nodes for a total of 2PB, what happens when you lose 1 node? On Dec 12, 2012, at 9:02 AM, Mohammad Tariq donta...@gmail.com wrote: Hello list, I don't know if this

Re: muti-thread mapreduce

2012-12-12 Thread Yang
but I do have run across some situations where I could benefit from multi-threading: if your hadoop mapper is prone to random access IO (such as looking up a TFile, or HBase, which ultimately makes a network call and then looks into a file segment), having multiple threads could utilize the CPU

Hive Action Failing in Oozie

2012-12-12 Thread Dave Cardwell
Hello there, I have an Oozie workflow that is failing on a Hive action with the following error: FAILED: SemanticException [Error 10001]: Table not found attempted_calls_import_raw_logs_named_route_name If I run the query file from the command line (as described in the map task log), it works

Re: Hive Action Failing in Oozie

2012-12-12 Thread Dave Cardwell
Thank you for the suggestion. From the log output javax.jdo.option.ConnectionDriverName appears to be set to com.mysql.jdbc.Driver, with the correct IP in javax.jdo.option.ConnectionURL. I have copied hive-site.xml from the local machine into Hadoop and instructed Oozie to use that, which it

RE: Hadoop 101

2012-12-12 Thread David Parks
Nothing that I'm aware of for text files, I'd just use standard unix utils to process it outside of Hadoop. As to getting a reader from any of the Input Formats, here's the typical example you'd follow to get the reader for a sequence file, you could extrapolate the example to access whichever

Shuffle's getMapOutput() fails with EofException, followed by IllegalStateException

2012-12-12 Thread David Parks
I'm having exactly this problem, and it's causing my job to fail when I try to process a larger amount of data (I'm attempting to process 30GB of compressed CSVs and the entire job fails every time). This issues is open for it: https://issues.apache.org/jira/browse/MAPREDUCE-5 Anyone have any

Re: Sane max storage size for DN

2012-12-12 Thread Mohammad Tariq
Hello Chris, Thank you so much for the valuable insights. I was actually using the same principle. I did the blunder and did the maths for entire (9*3)PB. Seems I am higher than you, that too without drinking ;) Many thanks. Regards, Mohammad Tariq On Thu, Dec 13, 2012 at 10:38

What is the difference between the branch-1 and branch-1-win

2012-12-12 Thread pengwenwu2008
Hi all, Could you help me What is the difference between the branch-1 and branch-1-win ? Regards, Wenwu,Peng

Hadoop-1.1.1 namenode and datanode version mismatch

2012-12-12 Thread Mark Grover
Hi all, I downloaded Hadoop-1.1.1 tar ball from one of the mirrors and configured it in psuedo-distributed mode. Namenode starts fine but datanode fails to start because of version mismatch. The value of hadoop.relaxed.worker.version.check property (related to

Re: Hadoop-1.1.1 namenode and datanode version mismatch

2012-12-12 Thread Zizon Qiu
the version is not match,as the log indicated: namenode:1.1.1 datanode:1.1.2-SNAPSHOT hadoop.relaxed.worker.version.check only works when version match(relax just revision check). you may have a try of hadoop.skip.worker.version.check. see https://issues.apache.org/jira/browse/HADOOP-8968 On

Re: which version should I take

2012-12-12 Thread Harsh J
If your production target is bit far away, I'd encourage setting up and using the 2.x based releases for its feature set that may aid you in your design. We'll be releasing 2.0.3 soon. However, if you want the older, stable code, go with the 1.x based releases. On Wed, Dec 12, 2012 at 6:58 PM,