Re: help on CombineFileInputFormat

2010-05-10 Thread Aaron Kimball
Zhenyu, It's a bit complicated and involves some layers of indirection. CombineFileRecordReader is a sort of shell RecordReader that passes the actual work of reading records to another child record reader. That's the class name provided in the third parameter. Instructing it to use

Speakers and Schedule for Berlin Buzzwords 2010 - Search, Store and Scale 7th/8th 2010

2010-05-10 Thread Isabel Drost
Hi folks, we proudly present the Berlin Buzzwords talks and presentations. As promised there are tracks specific to the three tags search, store and scale. We have a fantastic mixture of developers and users of open source software projects that make scaling data processing today possible. There

Fully distribute TextInputFormat...

2010-05-10 Thread Pierre ANCELOT
Hi folks :) I have one big file... I read it with FileInputFormat, this generates only one task and of course, this doesn't get distributed across the cluster nodes. Should I use an other Input class or do I have a bug in my implementation? The desired behavior is one task per line. Thanks.

Re: Fully distribute TextInputFormat...

2010-05-10 Thread Jeff Zhang
What's the format of this file ? gzip can been split. On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT pierre...@gmail.com wrote: Hi folks :) I have one big file... I read it with FileInputFormat, this generates only one task and of course, this doesn't get distributed across the cluster

Re: Fully distribute TextInputFormat...

2010-05-10 Thread Pierre ANCELOT
Simple and pure raw ascii text. One line == one treatment to do. On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang zjf...@gmail.com wrote: What's the format of this file ? gzip can been split. On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT pierre...@gmail.com wrote: Hi folks :) I have one

Re: Fully distribute TextInputFormat...

2010-05-10 Thread Pierre ANCELOT
Idea is, I want to share the lines of the file equally between nodes... On Mon, May 10, 2010 at 3:05 PM, Pierre ANCELOT pierre...@gmail.com wrote: Simple and pure raw ascii text. One line == one treatment to do. On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang zjf...@gmail.com wrote: What's

Re: Fully distribute TextInputFormat...

2010-05-10 Thread Ted Yu
NLineInputFormat seems a fit for your need. On Mon, May 10, 2010 at 6:05 AM, Pierre ANCELOT pierre...@gmail.com wrote: Simple and pure raw ascii text. One line == one treatment to do. On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang zjf...@gmail.com wrote: What's the format of this file ? gzip

This list's spam filter

2010-05-10 Thread Oscar Gothberg
Hi, I've been trying all morning to post a Hadoop question to this list but can't get through the spam filter. At a loss. Does anyone have any ideas what may trigger it? What can I do to not have it tag me? Thanks, / Oscar

Re: This list's spam filter

2010-05-10 Thread Todd Lipcon
Try sending plaintext email instead of rich - the spam scoring for HTML email is overly agressive on the apache listservs. -Todd On Mon, May 10, 2010 at 11:14 AM, Oscar Gothberg oscar.gothb...@gmail.com wrote: Hi, I've been trying all morning to post a Hadoop question to this list but can't

job executions fail with NotReplicatedYetException

2010-05-10 Thread Oscar Gothberg
Hi, I keep having jobs fail at the very end, with 100% complete map, 100% complete reduce, due to NotReplicatedYetException w.r.t the _temporary subdirectory of the job output directory. It doesn't happen 100% of the time, so it's not trivially reproducible, but it happens enough (10-20% of

Re: This list's spam filter

2010-05-10 Thread Oscar Gothberg
Thanks a lot! Didn't even notice that Gmail would default to HTML format. / Oscar On Mon, May 10, 2010 at 11:15 AM, Todd Lipcon t...@cloudera.com wrote: Try sending plaintext email instead of rich - the spam scoring for HTML email is overly agressive on the apache listservs. -Todd On Mon,

Re: Fully distribute TextInputFormat...

2010-05-10 Thread Edward Capriolo
If you curious, I found out this morning that NLineInputFormat is not ported to the new mapreduce api current yet. (It might be in trunk). So using NLineFormat forces you into the older mapred api. Edward On Mon, May 10, 2010 at 12:35 PM, Ted Yu yuzhih...@gmail.com wrote: NLineInputFormat

Any plans to provide 0.20.x patch for HADOOP-4584 - Slow generation of Block Report at DataNode causes delay of sending heartbeat to NameNode

2010-05-10 Thread Jon Graham
Hello Everyone, Is there a patch available for HADOOP-4584 that can be used on 0.20.2? Link https://issues.apache.org/jira/browse/HADOOP-4584 seems to indicate that a patch is available for 0.21 version but this version is not release yet. Block reports are taking several minutes on our cluster

Best Way Repartitioning a 100-150TB Table

2010-05-10 Thread Matias Silva
Hi Everyone, thanks for your time. What's the best way to repartition one table into 3 partitions using a replication factor of 3? We have anywhere between 100-150 TB in this table. I would like to avoid copying data over. Any suggestions? From what I understand that Hadoop/Hive is file

Questions about SequenceFiles

2010-05-10 Thread Ananth Sarathy
My team and I were working with sequence files and were using the LuceneDocumentWrapper. But when I try to get the valcall, i get a no such method exception from the ReflectionUtils, which is caused because it's trying to call a default constructor which doesn't exist for that class. So my

Re: Best Way Repartitioning a 100-150TB Table

2010-05-10 Thread Patrick Angeles
Matias, Hive partitions map to subdirectories in HDFS. You can do a 'mv' if you're lucky enough to have each partition in a distinct HDFS file that could be moved to the right partition subdirectory. Otherwise, you can run a MapReduce job to collate your data into separate files per partition.

Re: Fully distribute TextInputFormat...

2010-05-10 Thread himanshu chandola
Actually would you have a case when no splitting is needed. Just curious. It seems that you would use LZO or not use any compression at all. H - Original Message From: Alex Baranov alex.barano...@gmail.com To: common-user@hadoop.apache.org Sent: Mon, May 10, 2010 4:27:11 PM Subject:

Re: Fully distribute TextInputFormat...

2010-05-10 Thread Alex Baranov
I meant splitting of very huge file to distribute it over multiple Map jobs. Alex. http://sematext.com On Tue, May 11, 2010 at 6:13 AM, himanshu chandola himanshu_cool...@yahoo.com wrote: Actually would you have a case when no splitting is needed. Just curious. It seems that you would use