Re: help on CombineFileInputFormat

2010-05-10 Thread Aaron Kimball
Zhenyu,

It's a bit complicated and involves some layers of
indirection. CombineFileRecordReader is a sort of shell RecordReader that
passes the actual work of reading records to another child record reader.
That's the class name provided in the third parameter. Instructing it to use
CombineFileRecordReader again as its child RR doesn't tell it to do anything
useful. You must give it the name of another RecordReader class that
actually understands how to parse your particular records.

Unfortunately, TextInputFormat's LineRecordReader and
SequenceFileInputFormat's SequenceFileRecordReader both require the
InputSplit to be a FileSplit. So you can't use them directly.
(CombineFileInputFormat will pass a CombineFileSplit to the
CombineFileRecordReader which is then passed along to the child RR that you
specify.)

In Sqoop I got around this by creating (another!) indirection class called
CombineShimRecordReader.

The export functionality of Sqoop uses CombineFileInputFormat to allow the
user to specify the number of map tasks; it then organizes a set of input
files into that many tasks. This instantiates a CombineFileRecordReader
configured to forward its InputSplit to CombineShimRecordReader.
CombineShimRecordReader then translates the CombineFileSplit into a regular
FileSplit and forward thats to LineRecordReader (for text) or
SequenceFileRecordReader (for SequenceFiles). The grandchild (LineRR or
SequenceFileRR) is determined on a file-by-file basis by
CombineShimRecordReader, by calling a static method of Sqoop's
ExportJobBase.

You can take a look at the source of theseclasses here:
*
http://github.com/cloudera/sqoop/blob/master/src/shims/common/org/apache/hadoop/sqoop/mapreduce/ExportInputFormat.java

*
http://github.com/cloudera/sqoop/blob/master/src/shims/common/org/apache/hadoop/sqoop/mapreduce/CombineShimRecordReader.java
*
http://github.com/cloudera/sqoop/blob/master/src/java/org/apache/hadoop/sqoop/mapreduce/ExportJobBase.java

(apologies for the lengthy URLs; you could also just download the whole
project's source at http://github.com/cloudera/sqoop) :)

Cheers,
- Aaron


On Thu, May 6, 2010 at 7:32 AM, Zhenyu Zhong zhongresea...@gmail.comwrote:

 Hi,

 I tried to use CombineFileInputFormat in 0.20.2. It seems I need to extend
 it because it is an abstract class.
 However, I need to implement getRecordReader method in the extended class.

 May I ask how to implement this getRecordReader method?

 I tried to do something like this:

 public RecordReader getRecordReader(InputSplit genericSplit, JobConf job,

 Reporter reporter) throws IOException {

 // TODO Auto-generated method stub

 reporter.setStatus(genericSplit.toString());

 return new CombineFileRecordReader(job, (CombineFileSplit) genericSplit,
 reporter, CombineFileRecordReader.class);

 }

 It doesn't seem to be working. I would be very appreciated if someone can
 shed a light on this.

 thanks
 zhenyu



Speakers and Schedule for Berlin Buzzwords 2010 - Search, Store and Scale 7th/8th 2010

2010-05-10 Thread Isabel Drost
Hi folks,

we proudly present the Berlin Buzzwords talks and presentations.
As promised there are tracks specific to the three tags search, store
and scale. We have a fantastic mixture of developers and users of open
source software projects that make scaling data processing today
possible.

There is Steve Loughran, Aaron Kimball and Stefan Groschupf from the
Apache Hadoop community. We have Grant Ingersoll, Robert Muir and the
Generics Policeman Uwe Schindler from the Lucene community.

For those interested in NoSQL databases there is Mathias Stearn from
MongoDB, Jan Lehnardt from CouchDB and Eric Evans, the guy who coined
the term NoSQL one year ago.

We have just published the initial version of the schedule here:

http://berlinbuzzwords.de/content/schedule-published

It seems like we are having a fantastic set of talks and speakers for
Buzzwords.

Visit us at http://berlinbuzzwords.de and register for the conference -
looking forward to seeing you in Berlin this summer!

If you like the event, please tell your friends - help spread the word
on Berlin Buzzwords.

Thanks to Jan Lehnardt, Simon Willnauer and newthinking communications
for co-organising the event.


Isabel


Fully distribute TextInputFormat...

2010-05-10 Thread Pierre ANCELOT
Hi folks :)
I have one big file... I read it with FileInputFormat, this generates only
one task and of course, this doesn't get distributed across the cluster
nodes.
Should I use an other Input class or do I have a bug in my implementation?

The desired behavior is one task per line.

Thanks.



-- 
http://www.neko-consulting.com
Ego sum quis ego servo
Je suis ce que je protège
I am what I protect


Re: Fully distribute TextInputFormat...

2010-05-10 Thread Jeff Zhang
What's the format of this file ? gzip can been split.



On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT pierre...@gmail.com wrote:
 Hi folks :)
 I have one big file... I read it with FileInputFormat, this generates only
 one task and of course, this doesn't get distributed across the cluster
 nodes.
 Should I use an other Input class or do I have a bug in my implementation?

 The desired behavior is one task per line.

 Thanks.



 --
 http://www.neko-consulting.com
 Ego sum quis ego servo
 Je suis ce que je protège
 I am what I protect




-- 
Best Regards

Jeff Zhang


Re: Fully distribute TextInputFormat...

2010-05-10 Thread Pierre ANCELOT
Simple and pure raw ascii text. One line == one treatment to do.



On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang zjf...@gmail.com wrote:

 What's the format of this file ? gzip can been split.



 On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT pierre...@gmail.com
 wrote:
  Hi folks :)
  I have one big file... I read it with FileInputFormat, this generates
 only
  one task and of course, this doesn't get distributed across the cluster
  nodes.
  Should I use an other Input class or do I have a bug in my
 implementation?
 
  The desired behavior is one task per line.
 
  Thanks.
 
 
 
  --
  http://www.neko-consulting.com
  Ego sum quis ego servo
  Je suis ce que je protège
  I am what I protect
 



 --
 Best Regards

 Jeff Zhang




-- 
http://www.neko-consulting.com
Ego sum quis ego servo
Je suis ce que je protège
I am what I protect


Re: Fully distribute TextInputFormat...

2010-05-10 Thread Pierre ANCELOT
Idea is, I want to share the lines of the file equally between nodes...



On Mon, May 10, 2010 at 3:05 PM, Pierre ANCELOT pierre...@gmail.com wrote:

 Simple and pure raw ascii text. One line == one treatment to do.




 On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang zjf...@gmail.com wrote:

 What's the format of this file ? gzip can been split.



 On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT pierre...@gmail.com
 wrote:
  Hi folks :)
  I have one big file... I read it with FileInputFormat, this generates
 only
  one task and of course, this doesn't get distributed across the cluster
  nodes.
  Should I use an other Input class or do I have a bug in my
 implementation?
 
  The desired behavior is one task per line.
 
  Thanks.
 
 
 
  --
  http://www.neko-consulting.com
  Ego sum quis ego servo
  Je suis ce que je protège
  I am what I protect
 



 --
 Best Regards

 Jeff Zhang




 --
 http://www.neko-consulting.com
 Ego sum quis ego servo
 Je suis ce que je protège
 I am what I protect




-- 
http://www.neko-consulting.com
Ego sum quis ego servo
Je suis ce que je protège
I am what I protect


Re: Fully distribute TextInputFormat...

2010-05-10 Thread Ted Yu
NLineInputFormat seems a fit for your need.
On Mon, May 10, 2010 at 6:05 AM, Pierre ANCELOT pierre...@gmail.com wrote:

 Simple and pure raw ascii text. One line == one treatment to do.



 On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang zjf...@gmail.com wrote:

  What's the format of this file ? gzip can been split.
 
 
 
  On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT pierre...@gmail.com
  wrote:
   Hi folks :)
   I have one big file... I read it with FileInputFormat, this generates
  only
   one task and of course, this doesn't get distributed across the cluster
   nodes.
   Should I use an other Input class or do I have a bug in my
  implementation?
  
   The desired behavior is one task per line.
  
   Thanks.
  
  
  
   --
   http://www.neko-consulting.com
   Ego sum quis ego servo
   Je suis ce que je protège
   I am what I protect
  
 
 
 
  --
  Best Regards
 
  Jeff Zhang
 



 --
 http://www.neko-consulting.com
 Ego sum quis ego servo
 Je suis ce que je protège
 I am what I protect



This list's spam filter

2010-05-10 Thread Oscar Gothberg
Hi,

I've been trying all morning to post a Hadoop question to this list but
can't get through the spam filter. At a loss.

Does anyone have any ideas what may trigger it? What can I do to not have it
tag me?

Thanks,
/ Oscar


Re: This list's spam filter

2010-05-10 Thread Todd Lipcon
Try sending plaintext email instead of rich - the spam scoring for
HTML email is overly agressive on the apache listservs.

-Todd

On Mon, May 10, 2010 at 11:14 AM, Oscar Gothberg
oscar.gothb...@gmail.com wrote:
 Hi,

 I've been trying all morning to post a Hadoop question to this list but
 can't get through the spam filter. At a loss.

 Does anyone have any ideas what may trigger it? What can I do to not have it
 tag me?

 Thanks,
 / Oscar




-- 
Todd Lipcon
Software Engineer, Cloudera


job executions fail with NotReplicatedYetException

2010-05-10 Thread Oscar Gothberg
Hi,

I keep having jobs fail at the very end, with 100% complete map,
100% complete reduce,
due to NotReplicatedYetException w.r.t the _temporary subdirectory of
the job output directory.

It doesn't happen 100% of the time, so it's not trivially
reproducible, but it happens enough
(10-20% of runs) to make it a real pain.

Any ideas, has anyone seen something similar? Part of the stack trace:

NotReplicatedYetException: Not replicated
yet:/test/out/dayperiod=14731/_temporary/_attempt_201005052338_0194_r_01_0/part-1
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1253)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
...

Thanks,
/ Oscar


Re: This list's spam filter

2010-05-10 Thread Oscar Gothberg
Thanks a lot! Didn't even notice that Gmail would default to HTML format.

/ Oscar

On Mon, May 10, 2010 at 11:15 AM, Todd Lipcon t...@cloudera.com wrote:
 Try sending plaintext email instead of rich - the spam scoring for
 HTML email is overly agressive on the apache listservs.

 -Todd

 On Mon, May 10, 2010 at 11:14 AM, Oscar Gothberg
 oscar.gothb...@gmail.com wrote:
 Hi,

 I've been trying all morning to post a Hadoop question to this list but
 can't get through the spam filter. At a loss.

 Does anyone have any ideas what may trigger it? What can I do to not have it
 tag me?

 Thanks,
 / Oscar




 --
 Todd Lipcon
 Software Engineer, Cloudera



Re: Fully distribute TextInputFormat...

2010-05-10 Thread Edward Capriolo
If you curious, I found out this morning that NLineInputFormat is not ported
to the new mapreduce api current yet. (It might be in trunk). So using
NLineFormat forces you into the older mapred api.

Edward

On Mon, May 10, 2010 at 12:35 PM, Ted Yu yuzhih...@gmail.com wrote:

 NLineInputFormat seems a fit for your need.
 On Mon, May 10, 2010 at 6:05 AM, Pierre ANCELOT pierre...@gmail.com
 wrote:

  Simple and pure raw ascii text. One line == one treatment to do.
 
 
 
  On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang zjf...@gmail.com wrote:
 
   What's the format of this file ? gzip can been split.
  
  
  
   On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT pierre...@gmail.com
   wrote:
Hi folks :)
I have one big file... I read it with FileInputFormat, this generates
   only
one task and of course, this doesn't get distributed across the
 cluster
nodes.
Should I use an other Input class or do I have a bug in my
   implementation?
   
The desired behavior is one task per line.
   
Thanks.
   
   
   
--
http://www.neko-consulting.com
Ego sum quis ego servo
Je suis ce que je protège
I am what I protect
   
  
  
  
   --
   Best Regards
  
   Jeff Zhang
  
 
 
 
  --
  http://www.neko-consulting.com
  Ego sum quis ego servo
  Je suis ce que je protège
  I am what I protect
 



Any plans to provide 0.20.x patch for HADOOP-4584 - Slow generation of Block Report at DataNode causes delay of sending heartbeat to NameNode

2010-05-10 Thread Jon Graham
Hello Everyone,

Is there a patch available for HADOOP-4584 that can be used on 0.20.2?

Link https://issues.apache.org/jira/browse/HADOOP-4584 seems to indicate
that a patch is available for 0.21 version but this
version is not release yet.

Block reports are taking several minutes on our cluster and this causes time
out conditions and lots of retry conditions.

Thanks for your help,
Jon


Best Way Repartitioning a 100-150TB Table

2010-05-10 Thread Matias Silva
Hi Everyone, thanks for your time.  What's the best way to repartition one 
table into 3 partitions using a
replication factor of 3?  We have anywhere between 100-150 TB in this table.  I 
would like to avoid
copying data over. Any suggestions?

From what I understand that Hadoop/Hive is file based and is there any chance 
to define the new table
with the new partitions and then perform a mv instead of cp on the data?  Am I 
dreaming?

Thanks,
Matt









Questions about SequenceFiles

2010-05-10 Thread Ananth Sarathy
My team and I were working with sequence files and were using the
LuceneDocumentWrapper. But when I try to get the valcall, i get a no such
method exception from the ReflectionUtils, which is caused because it's
trying to call a default constructor which doesn't exist for that class.

So my question is  whether there is documentation or limitations to the type
of objects that can be used with a sequencefile other than the Writable
interface? I want to know if maybe I am trying to read from the file in the
wrong way.
Ananth T Sarathy


Re: Best Way Repartitioning a 100-150TB Table

2010-05-10 Thread Patrick Angeles
Matias,

Hive partitions map to subdirectories in HDFS. You can do a 'mv' if you're
lucky enough to have each partition in a distinct HDFS file that could be
moved to the right partition subdirectory. Otherwise, you can run a
MapReduce job to collate your data into separate files per partition. You
can use MultipleOutputs to do this for you. Hive trunk also supports dynamic
partitions which would allow you to do this with a Hive statement instead of
writing Java MR code.

Regards,

- Patrick

On Mon, May 10, 2010 at 7:55 PM, Matias Silva msi...@specificmedia.comwrote:

 Hi Everyone, thanks for your time.  What's the best way to repartition one
 table into 3 partitions using a
 replication factor of 3?  We have anywhere between 100-150 TB in this
 table.  I would like to avoid
 copying data over. Any suggestions?

 From what I understand that Hadoop/Hive is file based and is there any
 chance to define the new table
 with the new partitions and then perform a mv instead of cp on the data?
  Am I dreaming?

 Thanks,
 Matt










Re: Fully distribute TextInputFormat...

2010-05-10 Thread himanshu chandola
Actually would you have a case when no splitting is needed. Just curious.

It seems that you would use LZO or not use any compression at all.

H

- Original Message 
From: Alex Baranov alex.barano...@gmail.com
To: common-user@hadoop.apache.org
Sent: Mon, May 10, 2010 4:27:11 PM
Subject: Re: Fully distribute TextInputFormat...

If I'm not mistaken LZO compression better suits when splitting needed, not
gzip.

Alex Baranau

http://sematext.com

On Mon, May 10, 2010 at 3:52 PM, Jeff Zhang zjf...@gmail.com wrote:

 What's the format of this file ? gzip can been split.



 On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT pierre...@gmail.com
 wrote:
  Hi folks :)
  I have one big file... I read it with FileInputFormat, this generates
 only
  one task and of course, this doesn't get distributed across the cluster
  nodes.
  Should I use an other Input class or do I have a bug in my
 implementation?
 
  The desired behavior is one task per line.
 
  Thanks.
 
 
 
  --
  http://www.neko-consulting.com
  Ego sum quis ego servo
  Je suis ce que je protège
  I am what I protect
 



 --
 Best Regards

 Jeff Zhang







Re: Fully distribute TextInputFormat...

2010-05-10 Thread Alex Baranov
I meant splitting of very huge file to distribute it over multiple Map jobs.

Alex.

http://sematext.com

On Tue, May 11, 2010 at 6:13 AM, himanshu chandola 
himanshu_cool...@yahoo.com wrote:

 Actually would you have a case when no splitting is needed. Just curious.

 It seems that you would use LZO or not use any compression at all.

 H

 - Original Message 
 From: Alex Baranov alex.barano...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Mon, May 10, 2010 4:27:11 PM
 Subject: Re: Fully distribute TextInputFormat...

 If I'm not mistaken LZO compression better suits when splitting needed, not
 gzip.

 Alex Baranau

 http://sematext.com

 On Mon, May 10, 2010 at 3:52 PM, Jeff Zhang zjf...@gmail.com wrote:

  What's the format of this file ? gzip can been split.
 
 
 
  On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT pierre...@gmail.com
  wrote:
   Hi folks :)
   I have one big file... I read it with FileInputFormat, this generates
  only
   one task and of course, this doesn't get distributed across the cluster
   nodes.
   Should I use an other Input class or do I have a bug in my
  implementation?
  
   The desired behavior is one task per line.
  
   Thanks.
  
  
  
   --
   http://www.neko-consulting.com
   Ego sum quis ego servo
   Je suis ce que je protège
   I am what I protect
  
 
 
 
  --
  Best Regards
 
  Jeff Zhang