Re: IdentityReducer is called instead of my own

2011-02-02 Thread Christian Kunz
I don't know of a transition guide, but I found a tutorial based on the new api:

http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html

I did not use it, but it might be useful.

BTW, you could use the

@Override

annotation when overriding any method to let the compiler detect issues like 
yours.

-Christian

On Feb 2, 2011, at 2:06 PM, Marco Didonna wrote:

> On 02/02/2011 10:23 PM, Christian Kunz wrote:
>> Without seeing the source code of the reduce method of  the InvIndexReduce 
>> class
>> my best guess would be that the signature is incorrect. I saw this happen 
>> when migrating from old to new api:
>> 
>> protected void reduce(KEYIN key, Iterable values, Context context)
>> 
>> (Iterable, not Iterator as in the old api)
>> 
>> -Christian
>> 
> 
> My God you're right...are you a sensitive or what :) :)? I am pretty new
> to Hadoop and I have studied the Tom White's book which is based on the
> old api (too bad)...I find quite difficult the migration to the new
> api...Is there a transition guide?
> 
> Thank you very very much...I was going to call an exorcist since there
> was no warning, no error whatsoever...
> 
> Marco
> 



Re: IdentityReducer is called instead of my own

2011-02-02 Thread Marco Didonna
On 02/02/2011 10:23 PM, Christian Kunz wrote:
> Without seeing the source code of the reduce method of  the InvIndexReduce 
> class
> my best guess would be that the signature is incorrect. I saw this happen 
> when migrating from old to new api:
> 
> protected void reduce(KEYIN key, Iterable values, Context context)
> 
> (Iterable, not Iterator as in the old api)
> 
> -Christian
> 

My God you're right...are you a sensitive or what :) :)? I am pretty new
to Hadoop and I have studied the Tom White's book which is based on the
old api (too bad)...I find quite difficult the migration to the new
api...Is there a transition guide?

Thank you very very much...I was going to call an exorcist since there
was no warning, no error whatsoever...

Marco



Re: IdentityReducer is called instead of my own

2011-02-02 Thread James Seigel
Share code from your mapper?

Check to see if there are any errors on the job tracker reports that might 
indicate the inability to find the class.

James.

On 2011-02-02, at 2:23 PM, Christian Kunz wrote:

> Without seeing the source code of the reduce method of  the InvIndexReduce 
> class
> my best guess would be that the signature is incorrect. I saw this happen 
> when migrating from old to new api:
> 
> protected void reduce(KEYIN key, Iterable values, Context context)
> 
> (Iterable, not Iterator as in the old api)
> 
> -Christian
> 
> 
> On Feb 2, 2011, at 12:22 PM, Marco Didonna wrote:
> 
>> Hello everybody,
>> I am experiencing a weird issue: I have written a small hadoop program
>> and I am launching it using this https://gist.github.com/808297
>> JobDriver. Strangely InvIndexReducer is never invoked and the default
>> reducer kicks in. I really cannot understand where the problem could be:
>> as you can see I am using the new version of the api, (hadoop >= 0.20).
>> 
>> Any help is appreciated
>> 
>> MD
>> 
> 



Re: IdentityReducer is called instead of my own

2011-02-02 Thread Christian Kunz
Without seeing the source code of the reduce method of  the InvIndexReduce class
my best guess would be that the signature is incorrect. I saw this happen when 
migrating from old to new api:

protected void reduce(KEYIN key, Iterable values, Context context)

(Iterable, not Iterator as in the old api)

-Christian


On Feb 2, 2011, at 12:22 PM, Marco Didonna wrote:

> Hello everybody,
> I am experiencing a weird issue: I have written a small hadoop program
> and I am launching it using this https://gist.github.com/808297
> JobDriver. Strangely InvIndexReducer is never invoked and the default
> reducer kicks in. I really cannot understand where the problem could be:
> as you can see I am using the new version of the api, (hadoop >= 0.20).
> 
> Any help is appreciated
> 
> MD
> 



the overhead of HDFS

2011-02-02 Thread Da Zheng

Hello,

I have been using Hadoop on a cluster with AMD Opteron Processor 2212 
clocked at 2GMz and also a cluster with Atom N330 clocked at 1.6GHz. 
Both are dual cores. I always use HDFS for storing input data and output 
data and I observe high CPU consumption caused by HDFS in both clusters. 
In the AMD cluster, the bottleneck is the disk. I use TestDFSIO to test 
the performance. The writing throughput to HDFS is about 50MB/s when the 
replication factor is 1 and each node runs one mapper, but the CPU 
consumption is about 50% for DataNode and about 40% for the mapper of 
TestDFSIO. When I test the Atom cluster, the bottleneck is CPU. I used 
the same setting and I got the similar writing throughput, but the CPU 
consumption is close to 100% for DataNode and the mapper. Could anyone 
tell me what is the CPU usage in your cluster?


Thanks,
Da


Re: file:/// has no authority

2011-02-02 Thread danoomistmatiste

I managed to fix this issue.   It had to do with permissions on the default
directory.  

danoomistmatiste wrote:
> 
> Hi,  I have setup a Hadoop cluster as per the instructions for CDH3.  
> When I try to start the datanode on the slave, I get this error,
> 
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> java.lang.IllegalArgumentException: Invalid URI for NameNode address
> (check fs.defaultFS): file:/// has no authority.
> 
> I have setup the right parameters in core-site.xml
> where  is the IP address where the namenode is running
> 
> 
>  
>   fs.default.name
>   hdfs://:54310
> 
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/file%3Ahas-no-authority-tp30813534p30830359.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: custom writable classes

2011-02-02 Thread David Sinclair
You can easily make a custom Writable delegating to the existing writables.
For example, if your writable is just a bunch of strings, use the existing
Text writables in your class and use them in your read/write methods. For
example

class MyWritable implements Writable {
   private Text fieldA;
   private Text fieldB;

   

   public void write(DataOutput dataOutput) throws IOException {
 fieldA.write(dataOutput);
 fieldB.write(dataOutput);
   }

   public void readFields(DataInput dataInput) throws IOException {
 fieldA.readFields(dataInput);
 fieldB.readFields(dataInput);
   }
}

dave

On Wed, Feb 2, 2011 at 3:34 PM, Adeel Qureshi wrote:

> huh this interesting .. obviously I am not thinking about this whole thing
> right ..
>
> so in your mapper you parse the line into tokens and set the appropriate
> values on your writable by constructor or setters .. and let hadoop do all
> the serialization and deserialization .. and you tell hadoop how to do that
> by the read and write methods .. okay that makes more sense .. one last
> thing i still dont understand is what is the proper implementation of read
> and write methods .. if i have a bunch of strings in my writable then what
> should be the read method implementation ..
>
> I really appreciate the help from all you guys ..
>
> On Wed, Feb 2, 2011 at 12:52 PM, David Sinclair <
> dsincl...@chariotsolutions.com> wrote:
>
> > So create your writable as normal, and hadoop takes care of the
> > serialization/deserialization between mappers and reducers.
> >
> > For example, MyWritable is the same as you had previously, then in your
> > mapper output that writable
> >
> > class MyMapper extends Mapper MyWritable>
> > {
> >
> >private MyWritable writable =new MyWritable();
> >
> >protected void map(LongWritable key, Text value, Context context)
> throws
> > IOException, InterruptedException {
> >// parse text
> >writable.setCounter(parseddata);
> >writable.setTimestamp(parseddata);
> >
> >// don't know what your key is
> >context.write(key, writable);
> >}
> > }
> >
> > and make sure you set the key/value output
> >
> > job.setMapOutputKeyClass(LongWritable .class);
> > job.setMapOutputValueClass(MyWritable.class);
> >
> > dave
> >
> >
> > On Wed, Feb 2, 2011 at 1:39 PM, Adeel Qureshi  > >wrote:
> >
> > > i m reading text data and outputting text data so yeah its all text ..
> > the
> > > reason why i wanted to use custom writable classes is not for the
> mapper
> > > purposes .. you are right .. the easiest thing for is to receive the
> > > LongWritable and Text input in the mapper ... parse the text .. and
> deal
> > > with it .. but where I am having trouble is in passing the parsed
> > > information to the reducer .. right now I am putting a bunch of things
> as
> > > text and sending the same LongWritable and Text output to reducer but
> my
> > > text includes a bunch of things e.g. several fields separated by a
> > > delimiter
> > > .. this is the part that I am trying to improve .. instead of sending a
> > > bunch of delimited text I want to send an actual object to my reducer
> > >
> > > On Wed, Feb 2, 2011 at 12:33 PM, David Sinclair <
> > > dsincl...@chariotsolutions.com> wrote:
> > >
> > > > Are you storing your data as text or binary?
> > > >
> > > > If you are storing as text, your mapper is going to get Keys of
> > > > type LongWritable and values of type Text. Inside your mapper you
> would
> > > > parse out the strings and wouldn't be using your custom writable;
> that
> > is
> > > > unless you wanted your mapper/reducer to produce these.
> > > >
> > > > If you are storing as Binary, e.g. SequenceFiles, you use
> > > > the SequenceFileInputFormat and the sequence file reader will create
> > the
> > > > writables according to the mapper.
> > > >
> > > > dave
> > > >
> > > > On Wed, Feb 2, 2011 at 1:16 PM, Adeel Qureshi <
> adeelmahm...@gmail.com
> > > > >wrote:
> > > >
> > > > > okay so then the main question is how do I get the input line .. so
> > > that
> > > > I
> > > > > could parse it .. I am assuming it will then be passed to me in via
> > > data
> > > > > input stream ..
> > > > >
> > > > > So in my readFields function .. I am assuming I will get the whole
> > line
> > > > ..
> > > > > then I can parse it out and set my params .. something like this
> > > > >
> > > > > readFields(){
> > > > >  String line = in.readLine(); read the whole line
> > > > >
> > > > >  //now apply the regular expression to parse it out
> > > > >  data = pattern.group(1);
> > > > >  time = pattern.group(2);
> > > > >  user = pattern.group(3);
> > > > > }
> > > > >
> > > > > Is that right ???
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Feb 2, 2011 at 12:11 PM, Vijay  wrote:
> > > > >
> > > > > > Hadoop is not going to parse the line for you. Your mapper will
> > take
> > > > the
> > > > > > line, parse it and then turn it into your Writable so the next
> > phase
> > > > ca

Re: custom writable classes

2011-02-02 Thread Adeel Qureshi
huh this interesting .. obviously I am not thinking about this whole thing
right ..

so in your mapper you parse the line into tokens and set the appropriate
values on your writable by constructor or setters .. and let hadoop do all
the serialization and deserialization .. and you tell hadoop how to do that
by the read and write methods .. okay that makes more sense .. one last
thing i still dont understand is what is the proper implementation of read
and write methods .. if i have a bunch of strings in my writable then what
should be the read method implementation ..

I really appreciate the help from all you guys ..

On Wed, Feb 2, 2011 at 12:52 PM, David Sinclair <
dsincl...@chariotsolutions.com> wrote:

> So create your writable as normal, and hadoop takes care of the
> serialization/deserialization between mappers and reducers.
>
> For example, MyWritable is the same as you had previously, then in your
> mapper output that writable
>
> class MyMapper extends Mapper
> {
>
>private MyWritable writable =new MyWritable();
>
>protected void map(LongWritable key, Text value, Context context) throws
> IOException, InterruptedException {
>// parse text
>writable.setCounter(parseddata);
>writable.setTimestamp(parseddata);
>
>// don't know what your key is
>context.write(key, writable);
>}
> }
>
> and make sure you set the key/value output
>
> job.setMapOutputKeyClass(LongWritable .class);
> job.setMapOutputValueClass(MyWritable.class);
>
> dave
>
>
> On Wed, Feb 2, 2011 at 1:39 PM, Adeel Qureshi  >wrote:
>
> > i m reading text data and outputting text data so yeah its all text ..
> the
> > reason why i wanted to use custom writable classes is not for the mapper
> > purposes .. you are right .. the easiest thing for is to receive the
> > LongWritable and Text input in the mapper ... parse the text .. and deal
> > with it .. but where I am having trouble is in passing the parsed
> > information to the reducer .. right now I am putting a bunch of things as
> > text and sending the same LongWritable and Text output to reducer but my
> > text includes a bunch of things e.g. several fields separated by a
> > delimiter
> > .. this is the part that I am trying to improve .. instead of sending a
> > bunch of delimited text I want to send an actual object to my reducer
> >
> > On Wed, Feb 2, 2011 at 12:33 PM, David Sinclair <
> > dsincl...@chariotsolutions.com> wrote:
> >
> > > Are you storing your data as text or binary?
> > >
> > > If you are storing as text, your mapper is going to get Keys of
> > > type LongWritable and values of type Text. Inside your mapper you would
> > > parse out the strings and wouldn't be using your custom writable; that
> is
> > > unless you wanted your mapper/reducer to produce these.
> > >
> > > If you are storing as Binary, e.g. SequenceFiles, you use
> > > the SequenceFileInputFormat and the sequence file reader will create
> the
> > > writables according to the mapper.
> > >
> > > dave
> > >
> > > On Wed, Feb 2, 2011 at 1:16 PM, Adeel Qureshi  > > >wrote:
> > >
> > > > okay so then the main question is how do I get the input line .. so
> > that
> > > I
> > > > could parse it .. I am assuming it will then be passed to me in via
> > data
> > > > input stream ..
> > > >
> > > > So in my readFields function .. I am assuming I will get the whole
> line
> > > ..
> > > > then I can parse it out and set my params .. something like this
> > > >
> > > > readFields(){
> > > >  String line = in.readLine(); read the whole line
> > > >
> > > >  //now apply the regular expression to parse it out
> > > >  data = pattern.group(1);
> > > >  time = pattern.group(2);
> > > >  user = pattern.group(3);
> > > > }
> > > >
> > > > Is that right ???
> > > >
> > > >
> > > >
> > > > On Wed, Feb 2, 2011 at 12:11 PM, Vijay  wrote:
> > > >
> > > > > Hadoop is not going to parse the line for you. Your mapper will
> take
> > > the
> > > > > line, parse it and then turn it into your Writable so the next
> phase
> > > can
> > > > > just work with your object.
> > > > >
> > > > > Thanks,
> > > > > Vijay
> > > > > On Feb 2, 2011 9:51 AM, "Adeel Qureshi" 
> > > wrote:
> > > > > > thanks for your reply .. so lets say my input files are formatted
> > > like
> > > > > this
> > > > > >
> > > > > > each line looks like this
> > > > > > DATE TIME SERVER USER URL QUERY PORT ...
> > > > > >
> > > > > > so to read this I would create a writable mapper
> > > > > >
> > > > > > public class MyMapper implements Writable {
> > > > > > Date date
> > > > > > long time
> > > > > > String server
> > > > > > String user
> > > > > > String url
> > > > > > String query
> > > > > > int port
> > > > > >
> > > > > > readFields(){
> > > > > > date = readDate(in); //not concerned with the actual date reading
> > > > > function
> > > > > > time = readLong(in);
> > > > > > server = readText(in);
> > > > > > .
> > > > > > }
> > > > > > }
> > > > > >
> > > > > > but I still dont understand how is h

IdentityReducer is called instead of my own

2011-02-02 Thread Marco Didonna
Hello everybody,
I am experiencing a weird issue: I have written a small hadoop program
and I am launching it using this https://gist.github.com/808297
JobDriver. Strangely InvIndexReducer is never invoked and the default
reducer kicks in. I really cannot understand where the problem could be:
as you can see I am using the new version of the api, (hadoop >= 0.20).

Any help is appreciated

MD



Re: MRUnit and Herriot

2011-02-02 Thread Konstantin Boudnik
(Moving to common-user where this belongs)

Herriot is system test framework which runs against a real physical
cluster deployed with a specially crafted build of Hadoop. That
instrumented build of provides an extra APIs not available in Hadoop
otherwise. These APIs are created to facilitate cluster software
testability. Herriot isn't limited by MR but also covered (although in
a somewhat lesser extend) HDFS side of Hadoop.

MRunit is for MR job "unit" testing as in making sure that your MR job
is ok and/or to allow you to debug it locally before scale deployment.

So, long story short - they are very different ;) Herriot can do
intricate fault injection and can work closely with a deployed cluster
(say control Hadoop nodes and daemons); MRUnit is focused on MR jobs
testing.

Hope it helps.
--
  Take care,
Konstantin (Cos) Boudnik


On Wed, Feb 2, 2011 at 05:44, Edson Ramiro  wrote:
> Hi all,
>
> Plz, could you explain me the difference between MRUnit and Herriot?
>
> I've read the documentation of both and they seem very similar to me.
>
> Is Herriot an evolution of MRUnit?
>
> What can Herriot do that MRUnit can't?
>
> Thanks in Advance
>
> --
> Edson Ramiro Lucas Filho
> {skype, twitter, gtalk}: erlfilho
> http://www.inf.ufpr.br/erlf07/
>


Re: custom writable classes

2011-02-02 Thread David Sinclair
So create your writable as normal, and hadoop takes care of the
serialization/deserialization between mappers and reducers.

For example, MyWritable is the same as you had previously, then in your
mapper output that writable

class MyMapper extends Mapper
{

private MyWritable writable =new MyWritable();

protected void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
// parse text
writable.setCounter(parseddata);
writable.setTimestamp(parseddata);

// don't know what your key is
context.write(key, writable);
}
}

and make sure you set the key/value output

job.setMapOutputKeyClass(LongWritable .class);
job.setMapOutputValueClass(MyWritable.class);

dave


On Wed, Feb 2, 2011 at 1:39 PM, Adeel Qureshi wrote:

> i m reading text data and outputting text data so yeah its all text .. the
> reason why i wanted to use custom writable classes is not for the mapper
> purposes .. you are right .. the easiest thing for is to receive the
> LongWritable and Text input in the mapper ... parse the text .. and deal
> with it .. but where I am having trouble is in passing the parsed
> information to the reducer .. right now I am putting a bunch of things as
> text and sending the same LongWritable and Text output to reducer but my
> text includes a bunch of things e.g. several fields separated by a
> delimiter
> .. this is the part that I am trying to improve .. instead of sending a
> bunch of delimited text I want to send an actual object to my reducer
>
> On Wed, Feb 2, 2011 at 12:33 PM, David Sinclair <
> dsincl...@chariotsolutions.com> wrote:
>
> > Are you storing your data as text or binary?
> >
> > If you are storing as text, your mapper is going to get Keys of
> > type LongWritable and values of type Text. Inside your mapper you would
> > parse out the strings and wouldn't be using your custom writable; that is
> > unless you wanted your mapper/reducer to produce these.
> >
> > If you are storing as Binary, e.g. SequenceFiles, you use
> > the SequenceFileInputFormat and the sequence file reader will create the
> > writables according to the mapper.
> >
> > dave
> >
> > On Wed, Feb 2, 2011 at 1:16 PM, Adeel Qureshi  > >wrote:
> >
> > > okay so then the main question is how do I get the input line .. so
> that
> > I
> > > could parse it .. I am assuming it will then be passed to me in via
> data
> > > input stream ..
> > >
> > > So in my readFields function .. I am assuming I will get the whole line
> > ..
> > > then I can parse it out and set my params .. something like this
> > >
> > > readFields(){
> > >  String line = in.readLine(); read the whole line
> > >
> > >  //now apply the regular expression to parse it out
> > >  data = pattern.group(1);
> > >  time = pattern.group(2);
> > >  user = pattern.group(3);
> > > }
> > >
> > > Is that right ???
> > >
> > >
> > >
> > > On Wed, Feb 2, 2011 at 12:11 PM, Vijay  wrote:
> > >
> > > > Hadoop is not going to parse the line for you. Your mapper will take
> > the
> > > > line, parse it and then turn it into your Writable so the next phase
> > can
> > > > just work with your object.
> > > >
> > > > Thanks,
> > > > Vijay
> > > > On Feb 2, 2011 9:51 AM, "Adeel Qureshi" 
> > wrote:
> > > > > thanks for your reply .. so lets say my input files are formatted
> > like
> > > > this
> > > > >
> > > > > each line looks like this
> > > > > DATE TIME SERVER USER URL QUERY PORT ...
> > > > >
> > > > > so to read this I would create a writable mapper
> > > > >
> > > > > public class MyMapper implements Writable {
> > > > > Date date
> > > > > long time
> > > > > String server
> > > > > String user
> > > > > String url
> > > > > String query
> > > > > int port
> > > > >
> > > > > readFields(){
> > > > > date = readDate(in); //not concerned with the actual date reading
> > > > function
> > > > > time = readLong(in);
> > > > > server = readText(in);
> > > > > .
> > > > > }
> > > > > }
> > > > >
> > > > > but I still dont understand how is hadoop gonna know to parse my
> line
> > > > into
> > > > > these tokens .. instead of map be using the whole line as one token
> > > > >
> > > > >
> > > > > On Wed, Feb 2, 2011 at 11:42 AM, Harsh J 
> > > wrote:
> > > > >
> > > > >> See it this way:
> > > > >>
> > > > >> readFields(...) provides a DataInput stream that reads bytes from
> a
> > > > >> binary stream, and write(...) provides a DataOutput stream that
> > writes
> > > > >> bytes to a binary stream.
> > > > >>
> > > > >> Now your data-structure may be a complex one, perhaps an array of
> > > > >> items or a mapping of some, or just a set of different types of
> > > > >> objects. All you need to do is to think about how would you
> > > > >> _serialize_ your data structure into a binary stream, so that you
> > may
> > > > >> _de-serialize_ it back from the same stream when required.
> > > > >>
> > > > >> About what goes where, I think looking up the definition of
> > > > >> 'serialization' wil

Re: custom writable classes

2011-02-02 Thread Adeel Qureshi
i m reading text data and outputting text data so yeah its all text .. the
reason why i wanted to use custom writable classes is not for the mapper
purposes .. you are right .. the easiest thing for is to receive the
LongWritable and Text input in the mapper ... parse the text .. and deal
with it .. but where I am having trouble is in passing the parsed
information to the reducer .. right now I am putting a bunch of things as
text and sending the same LongWritable and Text output to reducer but my
text includes a bunch of things e.g. several fields separated by a delimiter
.. this is the part that I am trying to improve .. instead of sending a
bunch of delimited text I want to send an actual object to my reducer

On Wed, Feb 2, 2011 at 12:33 PM, David Sinclair <
dsincl...@chariotsolutions.com> wrote:

> Are you storing your data as text or binary?
>
> If you are storing as text, your mapper is going to get Keys of
> type LongWritable and values of type Text. Inside your mapper you would
> parse out the strings and wouldn't be using your custom writable; that is
> unless you wanted your mapper/reducer to produce these.
>
> If you are storing as Binary, e.g. SequenceFiles, you use
> the SequenceFileInputFormat and the sequence file reader will create the
> writables according to the mapper.
>
> dave
>
> On Wed, Feb 2, 2011 at 1:16 PM, Adeel Qureshi  >wrote:
>
> > okay so then the main question is how do I get the input line .. so that
> I
> > could parse it .. I am assuming it will then be passed to me in via data
> > input stream ..
> >
> > So in my readFields function .. I am assuming I will get the whole line
> ..
> > then I can parse it out and set my params .. something like this
> >
> > readFields(){
> >  String line = in.readLine(); read the whole line
> >
> >  //now apply the regular expression to parse it out
> >  data = pattern.group(1);
> >  time = pattern.group(2);
> >  user = pattern.group(3);
> > }
> >
> > Is that right ???
> >
> >
> >
> > On Wed, Feb 2, 2011 at 12:11 PM, Vijay  wrote:
> >
> > > Hadoop is not going to parse the line for you. Your mapper will take
> the
> > > line, parse it and then turn it into your Writable so the next phase
> can
> > > just work with your object.
> > >
> > > Thanks,
> > > Vijay
> > > On Feb 2, 2011 9:51 AM, "Adeel Qureshi" 
> wrote:
> > > > thanks for your reply .. so lets say my input files are formatted
> like
> > > this
> > > >
> > > > each line looks like this
> > > > DATE TIME SERVER USER URL QUERY PORT ...
> > > >
> > > > so to read this I would create a writable mapper
> > > >
> > > > public class MyMapper implements Writable {
> > > > Date date
> > > > long time
> > > > String server
> > > > String user
> > > > String url
> > > > String query
> > > > int port
> > > >
> > > > readFields(){
> > > > date = readDate(in); //not concerned with the actual date reading
> > > function
> > > > time = readLong(in);
> > > > server = readText(in);
> > > > .
> > > > }
> > > > }
> > > >
> > > > but I still dont understand how is hadoop gonna know to parse my line
> > > into
> > > > these tokens .. instead of map be using the whole line as one token
> > > >
> > > >
> > > > On Wed, Feb 2, 2011 at 11:42 AM, Harsh J 
> > wrote:
> > > >
> > > >> See it this way:
> > > >>
> > > >> readFields(...) provides a DataInput stream that reads bytes from a
> > > >> binary stream, and write(...) provides a DataOutput stream that
> writes
> > > >> bytes to a binary stream.
> > > >>
> > > >> Now your data-structure may be a complex one, perhaps an array of
> > > >> items or a mapping of some, or just a set of different types of
> > > >> objects. All you need to do is to think about how would you
> > > >> _serialize_ your data structure into a binary stream, so that you
> may
> > > >> _de-serialize_ it back from the same stream when required.
> > > >>
> > > >> About what goes where, I think looking up the definition of
> > > >> 'serialization' will help. It is all in the ordering. If you wrote A
> > > >> before B, you read A before B - simple as that.
> > > >>
> > > >> This, or you could use a neat serialization library like Apache Avro
> > > >> (http://avro.apache.org) and solve it in a simpler way with a
> schema.
> > > >> I'd recommend learning/using Avro for all
> > > >> serialization/de-serialization needs. Especially for Hadoop
> use-cases.
> > > >>
> > > >> On Wed, Feb 2, 2011 at 10:51 PM, Adeel Qureshi <
> > adeelmahm...@gmail.com>
> > > >> wrote:
> > > >> > I have been trying to understand how to write a simple custom
> > writable
> > > >> class
> > > >> > and I find the documentation available very vague and unclear
> about
> > > >> certain
> > > >> > things. okay so here is the sample writable implementation in
> > javadoc
> > > of
> > > >> > Writable interface
> > > >> >
> > > >> > public class MyWritable implements Writable {
> > > >> > // Some data
> > > >> > private int counter;
> > > >> > private long timestamp;
> > > >> >
> > > >> > *public void write(DataOutput o

Re: custom writable classes

2011-02-02 Thread David Sinclair
Are you storing your data as text or binary?

If you are storing as text, your mapper is going to get Keys of
type LongWritable and values of type Text. Inside your mapper you would
parse out the strings and wouldn't be using your custom writable; that is
unless you wanted your mapper/reducer to produce these.

If you are storing as Binary, e.g. SequenceFiles, you use
the SequenceFileInputFormat and the sequence file reader will create the
writables according to the mapper.

dave

On Wed, Feb 2, 2011 at 1:16 PM, Adeel Qureshi wrote:

> okay so then the main question is how do I get the input line .. so that I
> could parse it .. I am assuming it will then be passed to me in via data
> input stream ..
>
> So in my readFields function .. I am assuming I will get the whole line ..
> then I can parse it out and set my params .. something like this
>
> readFields(){
>  String line = in.readLine(); read the whole line
>
>  //now apply the regular expression to parse it out
>  data = pattern.group(1);
>  time = pattern.group(2);
>  user = pattern.group(3);
> }
>
> Is that right ???
>
>
>
> On Wed, Feb 2, 2011 at 12:11 PM, Vijay  wrote:
>
> > Hadoop is not going to parse the line for you. Your mapper will take the
> > line, parse it and then turn it into your Writable so the next phase can
> > just work with your object.
> >
> > Thanks,
> > Vijay
> > On Feb 2, 2011 9:51 AM, "Adeel Qureshi"  wrote:
> > > thanks for your reply .. so lets say my input files are formatted like
> > this
> > >
> > > each line looks like this
> > > DATE TIME SERVER USER URL QUERY PORT ...
> > >
> > > so to read this I would create a writable mapper
> > >
> > > public class MyMapper implements Writable {
> > > Date date
> > > long time
> > > String server
> > > String user
> > > String url
> > > String query
> > > int port
> > >
> > > readFields(){
> > > date = readDate(in); //not concerned with the actual date reading
> > function
> > > time = readLong(in);
> > > server = readText(in);
> > > .
> > > }
> > > }
> > >
> > > but I still dont understand how is hadoop gonna know to parse my line
> > into
> > > these tokens .. instead of map be using the whole line as one token
> > >
> > >
> > > On Wed, Feb 2, 2011 at 11:42 AM, Harsh J 
> wrote:
> > >
> > >> See it this way:
> > >>
> > >> readFields(...) provides a DataInput stream that reads bytes from a
> > >> binary stream, and write(...) provides a DataOutput stream that writes
> > >> bytes to a binary stream.
> > >>
> > >> Now your data-structure may be a complex one, perhaps an array of
> > >> items or a mapping of some, or just a set of different types of
> > >> objects. All you need to do is to think about how would you
> > >> _serialize_ your data structure into a binary stream, so that you may
> > >> _de-serialize_ it back from the same stream when required.
> > >>
> > >> About what goes where, I think looking up the definition of
> > >> 'serialization' will help. It is all in the ordering. If you wrote A
> > >> before B, you read A before B - simple as that.
> > >>
> > >> This, or you could use a neat serialization library like Apache Avro
> > >> (http://avro.apache.org) and solve it in a simpler way with a schema.
> > >> I'd recommend learning/using Avro for all
> > >> serialization/de-serialization needs. Especially for Hadoop use-cases.
> > >>
> > >> On Wed, Feb 2, 2011 at 10:51 PM, Adeel Qureshi <
> adeelmahm...@gmail.com>
> > >> wrote:
> > >> > I have been trying to understand how to write a simple custom
> writable
> > >> class
> > >> > and I find the documentation available very vague and unclear about
> > >> certain
> > >> > things. okay so here is the sample writable implementation in
> javadoc
> > of
> > >> > Writable interface
> > >> >
> > >> > public class MyWritable implements Writable {
> > >> > // Some data
> > >> > private int counter;
> > >> > private long timestamp;
> > >> >
> > >> > *public void write(DataOutput out) throws IOException {
> > >> > out.writeInt(counter);
> > >> > out.writeLong(timestamp);
> > >> > }*
> > >> >
> > >> > * public void readFields(DataInput in) throws IOException {
> > >> > counter = in.readInt();
> > >> > timestamp = in.readLong();
> > >> > }*
> > >> >
> > >> > public static MyWritable read(DataInput in) throws IOException {
> > >> > MyWritable w = new MyWritable();
> > >> > w.readFields(in);
> > >> > return w;
> > >> > }
> > >> > }
> > >> >
> > >> > so in readFields function we are simply saying read an int from the
> > >> > datainput and put that in counter .. and then read a long and put
> that
> > in
> > >> > timestamp variable .. what doesnt makes sense to me is what is the
> > format
> > >> of
> > >> > DataInput here .. what if there are multiple ints and multiple longs
> > ..
> > >> how
> > >> > is the correct int gonna go in counter .. what if the data I am
> > reading
> > >> in
> > >> > my mapper is a string line .. and I am using regular expression to
> > parse
> > >> the
> > >> > tokens .. how do I specify which field goes

Re: custom writable classes

2011-02-02 Thread Adeel Qureshi
okay so then the main question is how do I get the input line .. so that I
could parse it .. I am assuming it will then be passed to me in via data
input stream ..

So in my readFields function .. I am assuming I will get the whole line ..
then I can parse it out and set my params .. something like this

readFields(){
 String line = in.readLine(); read the whole line

 //now apply the regular expression to parse it out
 data = pattern.group(1);
 time = pattern.group(2);
 user = pattern.group(3);
}

Is that right ???



On Wed, Feb 2, 2011 at 12:11 PM, Vijay  wrote:

> Hadoop is not going to parse the line for you. Your mapper will take the
> line, parse it and then turn it into your Writable so the next phase can
> just work with your object.
>
> Thanks,
> Vijay
> On Feb 2, 2011 9:51 AM, "Adeel Qureshi"  wrote:
> > thanks for your reply .. so lets say my input files are formatted like
> this
> >
> > each line looks like this
> > DATE TIME SERVER USER URL QUERY PORT ...
> >
> > so to read this I would create a writable mapper
> >
> > public class MyMapper implements Writable {
> > Date date
> > long time
> > String server
> > String user
> > String url
> > String query
> > int port
> >
> > readFields(){
> > date = readDate(in); //not concerned with the actual date reading
> function
> > time = readLong(in);
> > server = readText(in);
> > .
> > }
> > }
> >
> > but I still dont understand how is hadoop gonna know to parse my line
> into
> > these tokens .. instead of map be using the whole line as one token
> >
> >
> > On Wed, Feb 2, 2011 at 11:42 AM, Harsh J  wrote:
> >
> >> See it this way:
> >>
> >> readFields(...) provides a DataInput stream that reads bytes from a
> >> binary stream, and write(...) provides a DataOutput stream that writes
> >> bytes to a binary stream.
> >>
> >> Now your data-structure may be a complex one, perhaps an array of
> >> items or a mapping of some, or just a set of different types of
> >> objects. All you need to do is to think about how would you
> >> _serialize_ your data structure into a binary stream, so that you may
> >> _de-serialize_ it back from the same stream when required.
> >>
> >> About what goes where, I think looking up the definition of
> >> 'serialization' will help. It is all in the ordering. If you wrote A
> >> before B, you read A before B - simple as that.
> >>
> >> This, or you could use a neat serialization library like Apache Avro
> >> (http://avro.apache.org) and solve it in a simpler way with a schema.
> >> I'd recommend learning/using Avro for all
> >> serialization/de-serialization needs. Especially for Hadoop use-cases.
> >>
> >> On Wed, Feb 2, 2011 at 10:51 PM, Adeel Qureshi 
> >> wrote:
> >> > I have been trying to understand how to write a simple custom writable
> >> class
> >> > and I find the documentation available very vague and unclear about
> >> certain
> >> > things. okay so here is the sample writable implementation in javadoc
> of
> >> > Writable interface
> >> >
> >> > public class MyWritable implements Writable {
> >> > // Some data
> >> > private int counter;
> >> > private long timestamp;
> >> >
> >> > *public void write(DataOutput out) throws IOException {
> >> > out.writeInt(counter);
> >> > out.writeLong(timestamp);
> >> > }*
> >> >
> >> > * public void readFields(DataInput in) throws IOException {
> >> > counter = in.readInt();
> >> > timestamp = in.readLong();
> >> > }*
> >> >
> >> > public static MyWritable read(DataInput in) throws IOException {
> >> > MyWritable w = new MyWritable();
> >> > w.readFields(in);
> >> > return w;
> >> > }
> >> > }
> >> >
> >> > so in readFields function we are simply saying read an int from the
> >> > datainput and put that in counter .. and then read a long and put that
> in
> >> > timestamp variable .. what doesnt makes sense to me is what is the
> format
> >> of
> >> > DataInput here .. what if there are multiple ints and multiple longs
> ..
> >> how
> >> > is the correct int gonna go in counter .. what if the data I am
> reading
> >> in
> >> > my mapper is a string line .. and I am using regular expression to
> parse
> >> the
> >> > tokens .. how do I specify which field goes where .. simply saying
> >> readInt
> >> > or readText .. how does that gets connected to the right stuff ..
> >> >
> >> > so in my case like I said I am reading from iis log files where my
> mapper
> >> > input is a log line which contains usual log information like data,
> time,
> >> > user, server, url, qry, responseTme etc .. I want to parse these into
> an
> >> > object that can be passed to reducer instead of dumping all that
> >> information
> >> > as text ..
> >> >
> >> > I would appreciate any help.
> >> > Thanks
> >> > Adeel
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >> www.harshj.com
> >>
>


Re: custom writable classes

2011-02-02 Thread Vijay
Hadoop is not going to parse the line for you. Your mapper will take the
line, parse it and then turn it into your Writable so the next phase can
just work with your object.

Thanks,
Vijay
On Feb 2, 2011 9:51 AM, "Adeel Qureshi"  wrote:
> thanks for your reply .. so lets say my input files are formatted like
this
>
> each line looks like this
> DATE TIME SERVER USER URL QUERY PORT ...
>
> so to read this I would create a writable mapper
>
> public class MyMapper implements Writable {
> Date date
> long time
> String server
> String user
> String url
> String query
> int port
>
> readFields(){
> date = readDate(in); //not concerned with the actual date reading function
> time = readLong(in);
> server = readText(in);
> .
> }
> }
>
> but I still dont understand how is hadoop gonna know to parse my line into
> these tokens .. instead of map be using the whole line as one token
>
>
> On Wed, Feb 2, 2011 at 11:42 AM, Harsh J  wrote:
>
>> See it this way:
>>
>> readFields(...) provides a DataInput stream that reads bytes from a
>> binary stream, and write(...) provides a DataOutput stream that writes
>> bytes to a binary stream.
>>
>> Now your data-structure may be a complex one, perhaps an array of
>> items or a mapping of some, or just a set of different types of
>> objects. All you need to do is to think about how would you
>> _serialize_ your data structure into a binary stream, so that you may
>> _de-serialize_ it back from the same stream when required.
>>
>> About what goes where, I think looking up the definition of
>> 'serialization' will help. It is all in the ordering. If you wrote A
>> before B, you read A before B - simple as that.
>>
>> This, or you could use a neat serialization library like Apache Avro
>> (http://avro.apache.org) and solve it in a simpler way with a schema.
>> I'd recommend learning/using Avro for all
>> serialization/de-serialization needs. Especially for Hadoop use-cases.
>>
>> On Wed, Feb 2, 2011 at 10:51 PM, Adeel Qureshi 
>> wrote:
>> > I have been trying to understand how to write a simple custom writable
>> class
>> > and I find the documentation available very vague and unclear about
>> certain
>> > things. okay so here is the sample writable implementation in javadoc
of
>> > Writable interface
>> >
>> > public class MyWritable implements Writable {
>> > // Some data
>> > private int counter;
>> > private long timestamp;
>> >
>> > *public void write(DataOutput out) throws IOException {
>> > out.writeInt(counter);
>> > out.writeLong(timestamp);
>> > }*
>> >
>> > * public void readFields(DataInput in) throws IOException {
>> > counter = in.readInt();
>> > timestamp = in.readLong();
>> > }*
>> >
>> > public static MyWritable read(DataInput in) throws IOException {
>> > MyWritable w = new MyWritable();
>> > w.readFields(in);
>> > return w;
>> > }
>> > }
>> >
>> > so in readFields function we are simply saying read an int from the
>> > datainput and put that in counter .. and then read a long and put that
in
>> > timestamp variable .. what doesnt makes sense to me is what is the
format
>> of
>> > DataInput here .. what if there are multiple ints and multiple longs ..
>> how
>> > is the correct int gonna go in counter .. what if the data I am reading
>> in
>> > my mapper is a string line .. and I am using regular expression to
parse
>> the
>> > tokens .. how do I specify which field goes where .. simply saying
>> readInt
>> > or readText .. how does that gets connected to the right stuff ..
>> >
>> > so in my case like I said I am reading from iis log files where my
mapper
>> > input is a log line which contains usual log information like data,
time,
>> > user, server, url, qry, responseTme etc .. I want to parse these into
an
>> > object that can be passed to reducer instead of dumping all that
>> information
>> > as text ..
>> >
>> > I would appreciate any help.
>> > Thanks
>> > Adeel
>> >
>>
>>
>>
>> --
>> Harsh J
>> www.harshj.com
>>


Re: custom writable classes

2011-02-02 Thread Harsh J
Hadoop isn't going to magically parse your Text line into anything.
You'd have to tokenize it yourself and use the tokens to create your
custom writable within your map call (A constructor, or a set of
setter methods). The "Writable" is for serialization and
de-serialization of itself only.

On Wed, Feb 2, 2011 at 11:20 PM, Adeel Qureshi  wrote:
> thanks for your reply .. so lets say my input files are formatted like this
>
> each line looks like this
> DATE TIME SERVER USER URL QUERY PORT ...
>
> so to read this I would create a writable mapper
>
> public class MyMapper implements Writable {
>  Date date
>  long time
>  String server
>  String user
>  String url
>  String query
>  int port
>
>  readFields(){
>  date = readDate(in); //not concerned with the actual date reading function
>  time = readLong(in);
>  server = readText(in);
>  .
>  }
> }
>
> but I still dont understand how is hadoop gonna know to parse my line into
> these tokens .. instead of map be using the whole line as one token
>
>
> On Wed, Feb 2, 2011 at 11:42 AM, Harsh J  wrote:
>
>> See it this way:
>>
>> readFields(...) provides a DataInput stream that reads bytes from a
>> binary stream, and write(...) provides a DataOutput stream that writes
>> bytes to a binary stream.
>>
>> Now your data-structure may be a complex one, perhaps an array of
>> items or a mapping of some, or just a set of different types of
>> objects. All you need to do is to think about how would you
>> _serialize_ your data structure into a binary stream, so that you may
>> _de-serialize_ it back from the same stream when required.
>>
>> About what goes where, I think looking up the definition of
>> 'serialization' will help. It is all in the ordering. If you wrote A
>> before B, you read A before B - simple as that.
>>
>> This, or you could use a neat serialization library like Apache Avro
>> (http://avro.apache.org) and solve it in a simpler way with a schema.
>> I'd recommend learning/using Avro for all
>> serialization/de-serialization needs. Especially for Hadoop use-cases.
>>
>> On Wed, Feb 2, 2011 at 10:51 PM, Adeel Qureshi 
>> wrote:
>> > I have been trying to understand how to write a simple custom writable
>> class
>> > and I find the documentation available very vague and unclear about
>> certain
>> > things. okay so here is the sample writable implementation in javadoc of
>> > Writable interface
>> >
>> > public class MyWritable implements Writable {
>> >       // Some data
>> >       private int counter;
>> >       private long timestamp;
>> >
>> >       *public void write(DataOutput out) throws IOException {
>> >         out.writeInt(counter);
>> >         out.writeLong(timestamp);
>> >       }*
>> >
>> >      * public void readFields(DataInput in) throws IOException {
>> >         counter = in.readInt();
>> >         timestamp = in.readLong();
>> >       }*
>> >
>> >       public static MyWritable read(DataInput in) throws IOException {
>> >         MyWritable w = new MyWritable();
>> >         w.readFields(in);
>> >         return w;
>> >       }
>> >     }
>> >
>> > so in readFields function we are simply saying read an int from the
>> > datainput and put that in counter .. and then read a long and put that in
>> > timestamp variable .. what doesnt makes sense to me is what is the format
>> of
>> > DataInput here .. what if there are multiple ints and multiple longs ..
>> how
>> > is the correct int gonna go in counter .. what if the data I am reading
>> in
>> > my mapper is a string line .. and I am using regular expression to parse
>> the
>> > tokens .. how do I specify which field goes where .. simply saying
>> readInt
>> > or readText .. how does that gets connected to the right stuff ..
>> >
>> > so in my case like I said I am reading from iis log files where my mapper
>> > input is a log line which contains usual log information like data, time,
>> > user, server, url, qry, responseTme etc .. I want to parse these into an
>> > object that can be passed to reducer instead of dumping all that
>> information
>> > as text ..
>> >
>> > I would appreciate any help.
>> > Thanks
>> > Adeel
>> >
>>
>>
>>
>> --
>> Harsh J
>> www.harshj.com
>>
>



-- 
Harsh J
www.harshj.com


Re: custom writable classes

2011-02-02 Thread Adeel Qureshi
thanks for your reply .. so lets say my input files are formatted like this

each line looks like this
DATE TIME SERVER USER URL QUERY PORT ...

so to read this I would create a writable mapper

public class MyMapper implements Writable {
 Date date
 long time
 String server
 String user
 String url
 String query
 int port

 readFields(){
  date = readDate(in); //not concerned with the actual date reading function
  time = readLong(in);
  server = readText(in);
  .
 }
}

but I still dont understand how is hadoop gonna know to parse my line into
these tokens .. instead of map be using the whole line as one token


On Wed, Feb 2, 2011 at 11:42 AM, Harsh J  wrote:

> See it this way:
>
> readFields(...) provides a DataInput stream that reads bytes from a
> binary stream, and write(...) provides a DataOutput stream that writes
> bytes to a binary stream.
>
> Now your data-structure may be a complex one, perhaps an array of
> items or a mapping of some, or just a set of different types of
> objects. All you need to do is to think about how would you
> _serialize_ your data structure into a binary stream, so that you may
> _de-serialize_ it back from the same stream when required.
>
> About what goes where, I think looking up the definition of
> 'serialization' will help. It is all in the ordering. If you wrote A
> before B, you read A before B - simple as that.
>
> This, or you could use a neat serialization library like Apache Avro
> (http://avro.apache.org) and solve it in a simpler way with a schema.
> I'd recommend learning/using Avro for all
> serialization/de-serialization needs. Especially for Hadoop use-cases.
>
> On Wed, Feb 2, 2011 at 10:51 PM, Adeel Qureshi 
> wrote:
> > I have been trying to understand how to write a simple custom writable
> class
> > and I find the documentation available very vague and unclear about
> certain
> > things. okay so here is the sample writable implementation in javadoc of
> > Writable interface
> >
> > public class MyWritable implements Writable {
> >   // Some data
> >   private int counter;
> >   private long timestamp;
> >
> >   *public void write(DataOutput out) throws IOException {
> > out.writeInt(counter);
> > out.writeLong(timestamp);
> >   }*
> >
> >  * public void readFields(DataInput in) throws IOException {
> > counter = in.readInt();
> > timestamp = in.readLong();
> >   }*
> >
> >   public static MyWritable read(DataInput in) throws IOException {
> > MyWritable w = new MyWritable();
> > w.readFields(in);
> > return w;
> >   }
> > }
> >
> > so in readFields function we are simply saying read an int from the
> > datainput and put that in counter .. and then read a long and put that in
> > timestamp variable .. what doesnt makes sense to me is what is the format
> of
> > DataInput here .. what if there are multiple ints and multiple longs ..
> how
> > is the correct int gonna go in counter .. what if the data I am reading
> in
> > my mapper is a string line .. and I am using regular expression to parse
> the
> > tokens .. how do I specify which field goes where .. simply saying
> readInt
> > or readText .. how does that gets connected to the right stuff ..
> >
> > so in my case like I said I am reading from iis log files where my mapper
> > input is a log line which contains usual log information like data, time,
> > user, server, url, qry, responseTme etc .. I want to parse these into an
> > object that can be passed to reducer instead of dumping all that
> information
> > as text ..
> >
> > I would appreciate any help.
> > Thanks
> > Adeel
> >
>
>
>
> --
> Harsh J
> www.harshj.com
>


Re: custom writable classes

2011-02-02 Thread Harsh J
See it this way:

readFields(...) provides a DataInput stream that reads bytes from a
binary stream, and write(...) provides a DataOutput stream that writes
bytes to a binary stream.

Now your data-structure may be a complex one, perhaps an array of
items or a mapping of some, or just a set of different types of
objects. All you need to do is to think about how would you
_serialize_ your data structure into a binary stream, so that you may
_de-serialize_ it back from the same stream when required.

About what goes where, I think looking up the definition of
'serialization' will help. It is all in the ordering. If you wrote A
before B, you read A before B - simple as that.

This, or you could use a neat serialization library like Apache Avro
(http://avro.apache.org) and solve it in a simpler way with a schema.
I'd recommend learning/using Avro for all
serialization/de-serialization needs. Especially for Hadoop use-cases.

On Wed, Feb 2, 2011 at 10:51 PM, Adeel Qureshi  wrote:
> I have been trying to understand how to write a simple custom writable class
> and I find the documentation available very vague and unclear about certain
> things. okay so here is the sample writable implementation in javadoc of
> Writable interface
>
> public class MyWritable implements Writable {
>       // Some data
>       private int counter;
>       private long timestamp;
>
>       *public void write(DataOutput out) throws IOException {
>         out.writeInt(counter);
>         out.writeLong(timestamp);
>       }*
>
>      * public void readFields(DataInput in) throws IOException {
>         counter = in.readInt();
>         timestamp = in.readLong();
>       }*
>
>       public static MyWritable read(DataInput in) throws IOException {
>         MyWritable w = new MyWritable();
>         w.readFields(in);
>         return w;
>       }
>     }
>
> so in readFields function we are simply saying read an int from the
> datainput and put that in counter .. and then read a long and put that in
> timestamp variable .. what doesnt makes sense to me is what is the format of
> DataInput here .. what if there are multiple ints and multiple longs .. how
> is the correct int gonna go in counter .. what if the data I am reading in
> my mapper is a string line .. and I am using regular expression to parse the
> tokens .. how do I specify which field goes where .. simply saying readInt
> or readText .. how does that gets connected to the right stuff ..
>
> so in my case like I said I am reading from iis log files where my mapper
> input is a log line which contains usual log information like data, time,
> user, server, url, qry, responseTme etc .. I want to parse these into an
> object that can be passed to reducer instead of dumping all that information
> as text ..
>
> I would appreciate any help.
> Thanks
> Adeel
>



-- 
Harsh J
www.harshj.com


custom writable classes

2011-02-02 Thread adeelmahmood

I have been trying to understand how to write a simple custom writable class
and I find the documentation available very vague and unclear about certain
things. okay so here is the sample writable implementation in javadoc of
Writable interface

public class MyWritable implements Writable {
   // Some data
   private int counter;
   private long timestamp;
  
   public void write(DataOutput out) throws IOException {
 out.writeInt(counter);
 out.writeLong(timestamp);
   }
  
   public void readFields(DataInput in) throws IOException {
 counter = in.readInt();
 timestamp = in.readLong();
   }
  
   public static MyWritable read(DataInput in) throws IOException {
 MyWritable w = new MyWritable();
 w.readFields(in);
 return w;
   }
 }

so in readFields function we are simply saying read an int from the
datainput and put that in counter .. and then read a long and put that in
timestamp variable .. what doesnt makes sense to me is what is the format of
DataInput here .. what if there are multiple ints and multiple longs .. how
is the correct int gonna go in counter .. what if the data I am reading in
my mapper is a string line .. and I am using regular expression to parse the
tokens .. how do I specify which field goes where .. simply saying readInt
or readText .. how does that gets connected to the right stuff ..

so in my case like I said I am reading from iis log files where my mapper
input is a log line which contains usual log information like data, time,
user, server, url, qry, responseTme etc .. I want to parse these into an
object that can be passed to reducer instead of dumping all that information
as text ..

I would appreciate any help.
Thanks
-- 
View this message in context: 
http://old.nabble.com/custom-writable-classes-tp30828079p30828079.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



custom writable classes

2011-02-02 Thread Adeel Qureshi
I have been trying to understand how to write a simple custom writable class
and I find the documentation available very vague and unclear about certain
things. okay so here is the sample writable implementation in javadoc of
Writable interface

public class MyWritable implements Writable {
   // Some data
   private int counter;
   private long timestamp;

   *public void write(DataOutput out) throws IOException {
 out.writeInt(counter);
 out.writeLong(timestamp);
   }*

  * public void readFields(DataInput in) throws IOException {
 counter = in.readInt();
 timestamp = in.readLong();
   }*

   public static MyWritable read(DataInput in) throws IOException {
 MyWritable w = new MyWritable();
 w.readFields(in);
 return w;
   }
 }

so in readFields function we are simply saying read an int from the
datainput and put that in counter .. and then read a long and put that in
timestamp variable .. what doesnt makes sense to me is what is the format of
DataInput here .. what if there are multiple ints and multiple longs .. how
is the correct int gonna go in counter .. what if the data I am reading in
my mapper is a string line .. and I am using regular expression to parse the
tokens .. how do I specify which field goes where .. simply saying readInt
or readText .. how does that gets connected to the right stuff ..

so in my case like I said I am reading from iis log files where my mapper
input is a log line which contains usual log information like data, time,
user, server, url, qry, responseTme etc .. I want to parse these into an
object that can be passed to reducer instead of dumping all that information
as text ..

I would appreciate any help.
Thanks
Adeel


Re: MRUnit and Herriot

2011-02-02 Thread Owen O'Malley
Please keep user questions off of general and use the user lists instead.
This is defined here .

MRUnit is for testing user's MapReduce applications. Herriot is for testing
the framework in the presence of failures.

-- Owen

On Wed, Feb 2, 2011 at 5:44 AM, Edson Ramiro  wrote:

> Hi all,
>
> Plz, could you explain me the difference between MRUnit and Herriot?
>
> I've read the documentation of both and they seem very similar to me.
>
> Is Herriot an evolution of MRUnit?
>
> What can Herriot do that MRUnit can't?
>
> Thanks in Advance
>
> --
> Edson Ramiro Lucas Filho
> {skype, twitter, gtalk}: erlfilho
> http://www.inf.ufpr.br/erlf07/
>


Best way to Merge small XML files

2011-02-02 Thread Shuja Rehman
Hi Folks,

I am having hundreds of small xml files coming each hour. The size varies
from 5 Mb to 15 Mb. As Hadoop did not work well with small files so i want
to merge these small files. So what is the best option to merge these xml
files?



-- 
Regards
Shuja-ur-Rehman Baig



Strange byte [] size conflict

2011-02-02 Thread Matthew John
Hi all,

I have a BytesWritable key that comes to the mapper.

If I give key.getLength(), it returns 32.

then I tried creating a new byte [] array initializing its size to 32. (byte
[] keybytes = new bytes [32];)

and I tried giving : keybytes = key.getBytes();

now keybytes.length (which should return 32) is returning 48 !

I dont understand why this is happening ! Please help me with this. !

Thanks,
Matthew