Re: help on CombineFileInputFormat
Zhenyu, It's a bit complicated and involves some layers of indirection. CombineFileRecordReader is a sort of shell RecordReader that passes the actual work of reading records to another child record reader. That's the class name provided in the third parameter. Instructing it to use CombineFileRecordReader again as its child RR doesn't tell it to do anything useful. You must give it the name of another RecordReader class that actually understands how to parse your particular records. Unfortunately, TextInputFormat's LineRecordReader and SequenceFileInputFormat's SequenceFileRecordReader both require the InputSplit to be a FileSplit. So you can't use them directly. (CombineFileInputFormat will pass a CombineFileSplit to the CombineFileRecordReader which is then passed along to the child RR that you specify.) In Sqoop I got around this by creating (another!) indirection class called CombineShimRecordReader. The export functionality of Sqoop uses CombineFileInputFormat to allow the user to specify the number of map tasks; it then organizes a set of input files into that many tasks. This instantiates a CombineFileRecordReader configured to forward its InputSplit to CombineShimRecordReader. CombineShimRecordReader then translates the CombineFileSplit into a regular FileSplit and forward thats to LineRecordReader (for text) or SequenceFileRecordReader (for SequenceFiles). The grandchild (LineRR or SequenceFileRR) is determined on a file-by-file basis by CombineShimRecordReader, by calling a static method of Sqoop's ExportJobBase. You can take a look at the source of theseclasses here: * http://github.com/cloudera/sqoop/blob/master/src/shims/common/org/apache/hadoop/sqoop/mapreduce/ExportInputFormat.java * http://github.com/cloudera/sqoop/blob/master/src/shims/common/org/apache/hadoop/sqoop/mapreduce/CombineShimRecordReader.java * http://github.com/cloudera/sqoop/blob/master/src/java/org/apache/hadoop/sqoop/mapreduce/ExportJobBase.java (apologies for the lengthy URLs; you could also just download the whole project's source at http://github.com/cloudera/sqoop) :) Cheers, - Aaron On Thu, May 6, 2010 at 7:32 AM, Zhenyu Zhong zhongresea...@gmail.comwrote: Hi, I tried to use CombineFileInputFormat in 0.20.2. It seems I need to extend it because it is an abstract class. However, I need to implement getRecordReader method in the extended class. May I ask how to implement this getRecordReader method? I tried to do something like this: public RecordReader getRecordReader(InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException { // TODO Auto-generated method stub reporter.setStatus(genericSplit.toString()); return new CombineFileRecordReader(job, (CombineFileSplit) genericSplit, reporter, CombineFileRecordReader.class); } It doesn't seem to be working. I would be very appreciated if someone can shed a light on this. thanks zhenyu
Speakers and Schedule for Berlin Buzzwords 2010 - Search, Store and Scale 7th/8th 2010
Hi folks, we proudly present the Berlin Buzzwords talks and presentations. As promised there are tracks specific to the three tags search, store and scale. We have a fantastic mixture of developers and users of open source software projects that make scaling data processing today possible. There is Steve Loughran, Aaron Kimball and Stefan Groschupf from the Apache Hadoop community. We have Grant Ingersoll, Robert Muir and the Generics Policeman Uwe Schindler from the Lucene community. For those interested in NoSQL databases there is Mathias Stearn from MongoDB, Jan Lehnardt from CouchDB and Eric Evans, the guy who coined the term NoSQL one year ago. We have just published the initial version of the schedule here: http://berlinbuzzwords.de/content/schedule-published It seems like we are having a fantastic set of talks and speakers for Buzzwords. Visit us at http://berlinbuzzwords.de and register for the conference - looking forward to seeing you in Berlin this summer! If you like the event, please tell your friends - help spread the word on Berlin Buzzwords. Thanks to Jan Lehnardt, Simon Willnauer and newthinking communications for co-organising the event. Isabel
Fully distribute TextInputFormat...
Hi folks :) I have one big file... I read it with FileInputFormat, this generates only one task and of course, this doesn't get distributed across the cluster nodes. Should I use an other Input class or do I have a bug in my implementation? The desired behavior is one task per line. Thanks. -- http://www.neko-consulting.com Ego sum quis ego servo Je suis ce que je protège I am what I protect
Re: Fully distribute TextInputFormat...
What's the format of this file ? gzip can been split. On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT pierre...@gmail.com wrote: Hi folks :) I have one big file... I read it with FileInputFormat, this generates only one task and of course, this doesn't get distributed across the cluster nodes. Should I use an other Input class or do I have a bug in my implementation? The desired behavior is one task per line. Thanks. -- http://www.neko-consulting.com Ego sum quis ego servo Je suis ce que je protège I am what I protect -- Best Regards Jeff Zhang
Re: Fully distribute TextInputFormat...
Simple and pure raw ascii text. One line == one treatment to do. On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang zjf...@gmail.com wrote: What's the format of this file ? gzip can been split. On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT pierre...@gmail.com wrote: Hi folks :) I have one big file... I read it with FileInputFormat, this generates only one task and of course, this doesn't get distributed across the cluster nodes. Should I use an other Input class or do I have a bug in my implementation? The desired behavior is one task per line. Thanks. -- http://www.neko-consulting.com Ego sum quis ego servo Je suis ce que je protège I am what I protect -- Best Regards Jeff Zhang -- http://www.neko-consulting.com Ego sum quis ego servo Je suis ce que je protège I am what I protect
Re: Fully distribute TextInputFormat...
Idea is, I want to share the lines of the file equally between nodes... On Mon, May 10, 2010 at 3:05 PM, Pierre ANCELOT pierre...@gmail.com wrote: Simple and pure raw ascii text. One line == one treatment to do. On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang zjf...@gmail.com wrote: What's the format of this file ? gzip can been split. On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT pierre...@gmail.com wrote: Hi folks :) I have one big file... I read it with FileInputFormat, this generates only one task and of course, this doesn't get distributed across the cluster nodes. Should I use an other Input class or do I have a bug in my implementation? The desired behavior is one task per line. Thanks. -- http://www.neko-consulting.com Ego sum quis ego servo Je suis ce que je protège I am what I protect -- Best Regards Jeff Zhang -- http://www.neko-consulting.com Ego sum quis ego servo Je suis ce que je protège I am what I protect -- http://www.neko-consulting.com Ego sum quis ego servo Je suis ce que je protège I am what I protect
Re: Fully distribute TextInputFormat...
NLineInputFormat seems a fit for your need. On Mon, May 10, 2010 at 6:05 AM, Pierre ANCELOT pierre...@gmail.com wrote: Simple and pure raw ascii text. One line == one treatment to do. On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang zjf...@gmail.com wrote: What's the format of this file ? gzip can been split. On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT pierre...@gmail.com wrote: Hi folks :) I have one big file... I read it with FileInputFormat, this generates only one task and of course, this doesn't get distributed across the cluster nodes. Should I use an other Input class or do I have a bug in my implementation? The desired behavior is one task per line. Thanks. -- http://www.neko-consulting.com Ego sum quis ego servo Je suis ce que je protège I am what I protect -- Best Regards Jeff Zhang -- http://www.neko-consulting.com Ego sum quis ego servo Je suis ce que je protège I am what I protect
This list's spam filter
Hi, I've been trying all morning to post a Hadoop question to this list but can't get through the spam filter. At a loss. Does anyone have any ideas what may trigger it? What can I do to not have it tag me? Thanks, / Oscar
Re: This list's spam filter
Try sending plaintext email instead of rich - the spam scoring for HTML email is overly agressive on the apache listservs. -Todd On Mon, May 10, 2010 at 11:14 AM, Oscar Gothberg oscar.gothb...@gmail.com wrote: Hi, I've been trying all morning to post a Hadoop question to this list but can't get through the spam filter. At a loss. Does anyone have any ideas what may trigger it? What can I do to not have it tag me? Thanks, / Oscar -- Todd Lipcon Software Engineer, Cloudera
job executions fail with NotReplicatedYetException
Hi, I keep having jobs fail at the very end, with 100% complete map, 100% complete reduce, due to NotReplicatedYetException w.r.t the _temporary subdirectory of the job output directory. It doesn't happen 100% of the time, so it's not trivially reproducible, but it happens enough (10-20% of runs) to make it a real pain. Any ideas, has anyone seen something similar? Part of the stack trace: NotReplicatedYetException: Not replicated yet:/test/out/dayperiod=14731/_temporary/_attempt_201005052338_0194_r_01_0/part-1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1253) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) ... Thanks, / Oscar
Re: This list's spam filter
Thanks a lot! Didn't even notice that Gmail would default to HTML format. / Oscar On Mon, May 10, 2010 at 11:15 AM, Todd Lipcon t...@cloudera.com wrote: Try sending plaintext email instead of rich - the spam scoring for HTML email is overly agressive on the apache listservs. -Todd On Mon, May 10, 2010 at 11:14 AM, Oscar Gothberg oscar.gothb...@gmail.com wrote: Hi, I've been trying all morning to post a Hadoop question to this list but can't get through the spam filter. At a loss. Does anyone have any ideas what may trigger it? What can I do to not have it tag me? Thanks, / Oscar -- Todd Lipcon Software Engineer, Cloudera
Re: Fully distribute TextInputFormat...
If you curious, I found out this morning that NLineInputFormat is not ported to the new mapreduce api current yet. (It might be in trunk). So using NLineFormat forces you into the older mapred api. Edward On Mon, May 10, 2010 at 12:35 PM, Ted Yu yuzhih...@gmail.com wrote: NLineInputFormat seems a fit for your need. On Mon, May 10, 2010 at 6:05 AM, Pierre ANCELOT pierre...@gmail.com wrote: Simple and pure raw ascii text. One line == one treatment to do. On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang zjf...@gmail.com wrote: What's the format of this file ? gzip can been split. On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT pierre...@gmail.com wrote: Hi folks :) I have one big file... I read it with FileInputFormat, this generates only one task and of course, this doesn't get distributed across the cluster nodes. Should I use an other Input class or do I have a bug in my implementation? The desired behavior is one task per line. Thanks. -- http://www.neko-consulting.com Ego sum quis ego servo Je suis ce que je protège I am what I protect -- Best Regards Jeff Zhang -- http://www.neko-consulting.com Ego sum quis ego servo Je suis ce que je protège I am what I protect
Any plans to provide 0.20.x patch for HADOOP-4584 - Slow generation of Block Report at DataNode causes delay of sending heartbeat to NameNode
Hello Everyone, Is there a patch available for HADOOP-4584 that can be used on 0.20.2? Link https://issues.apache.org/jira/browse/HADOOP-4584 seems to indicate that a patch is available for 0.21 version but this version is not release yet. Block reports are taking several minutes on our cluster and this causes time out conditions and lots of retry conditions. Thanks for your help, Jon
Best Way Repartitioning a 100-150TB Table
Hi Everyone, thanks for your time. What's the best way to repartition one table into 3 partitions using a replication factor of 3? We have anywhere between 100-150 TB in this table. I would like to avoid copying data over. Any suggestions? From what I understand that Hadoop/Hive is file based and is there any chance to define the new table with the new partitions and then perform a mv instead of cp on the data? Am I dreaming? Thanks, Matt
Questions about SequenceFiles
My team and I were working with sequence files and were using the LuceneDocumentWrapper. But when I try to get the valcall, i get a no such method exception from the ReflectionUtils, which is caused because it's trying to call a default constructor which doesn't exist for that class. So my question is whether there is documentation or limitations to the type of objects that can be used with a sequencefile other than the Writable interface? I want to know if maybe I am trying to read from the file in the wrong way. Ananth T Sarathy
Re: Best Way Repartitioning a 100-150TB Table
Matias, Hive partitions map to subdirectories in HDFS. You can do a 'mv' if you're lucky enough to have each partition in a distinct HDFS file that could be moved to the right partition subdirectory. Otherwise, you can run a MapReduce job to collate your data into separate files per partition. You can use MultipleOutputs to do this for you. Hive trunk also supports dynamic partitions which would allow you to do this with a Hive statement instead of writing Java MR code. Regards, - Patrick On Mon, May 10, 2010 at 7:55 PM, Matias Silva msi...@specificmedia.comwrote: Hi Everyone, thanks for your time. What's the best way to repartition one table into 3 partitions using a replication factor of 3? We have anywhere between 100-150 TB in this table. I would like to avoid copying data over. Any suggestions? From what I understand that Hadoop/Hive is file based and is there any chance to define the new table with the new partitions and then perform a mv instead of cp on the data? Am I dreaming? Thanks, Matt
Re: Fully distribute TextInputFormat...
Actually would you have a case when no splitting is needed. Just curious. It seems that you would use LZO or not use any compression at all. H - Original Message From: Alex Baranov alex.barano...@gmail.com To: common-user@hadoop.apache.org Sent: Mon, May 10, 2010 4:27:11 PM Subject: Re: Fully distribute TextInputFormat... If I'm not mistaken LZO compression better suits when splitting needed, not gzip. Alex Baranau http://sematext.com On Mon, May 10, 2010 at 3:52 PM, Jeff Zhang zjf...@gmail.com wrote: What's the format of this file ? gzip can been split. On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT pierre...@gmail.com wrote: Hi folks :) I have one big file... I read it with FileInputFormat, this generates only one task and of course, this doesn't get distributed across the cluster nodes. Should I use an other Input class or do I have a bug in my implementation? The desired behavior is one task per line. Thanks. -- http://www.neko-consulting.com Ego sum quis ego servo Je suis ce que je protège I am what I protect -- Best Regards Jeff Zhang
Re: Fully distribute TextInputFormat...
I meant splitting of very huge file to distribute it over multiple Map jobs. Alex. http://sematext.com On Tue, May 11, 2010 at 6:13 AM, himanshu chandola himanshu_cool...@yahoo.com wrote: Actually would you have a case when no splitting is needed. Just curious. It seems that you would use LZO or not use any compression at all. H - Original Message From: Alex Baranov alex.barano...@gmail.com To: common-user@hadoop.apache.org Sent: Mon, May 10, 2010 4:27:11 PM Subject: Re: Fully distribute TextInputFormat... If I'm not mistaken LZO compression better suits when splitting needed, not gzip. Alex Baranau http://sematext.com On Mon, May 10, 2010 at 3:52 PM, Jeff Zhang zjf...@gmail.com wrote: What's the format of this file ? gzip can been split. On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT pierre...@gmail.com wrote: Hi folks :) I have one big file... I read it with FileInputFormat, this generates only one task and of course, this doesn't get distributed across the cluster nodes. Should I use an other Input class or do I have a bug in my implementation? The desired behavior is one task per line. Thanks. -- http://www.neko-consulting.com Ego sum quis ego servo Je suis ce que je protège I am what I protect -- Best Regards Jeff Zhang