NullPointerException when trying to write mapper output

2013-10-24 Thread Marcelo Elias Del Valle
I am using hadoop 1.0.3 at Amazon EMR. I have a map / reduce job configured
like this:

private static final String TEMP_PATH_PREFIX =
System.getProperty("java.io.tmpdir") + "/dmp_processor_tmp";
...
private Job setupProcessorJobS3() throws IOException, DataGrinderException {
String inputFiles = System.getProperty(DGConfig.INPUT_FILES);
Job processorJob = new Job(getConf(), PROCESSOR_JOBNAME);
processorJob.setJarByClass(DgRunner.class);
processorJob.setMapperClass(EntityMapperS3.class);
processorJob.setReducerClass(SelectorReducer.class);
processorJob.setOutputKeyClass(Text.class);
processorJob.setOutputValueClass(Text.class);
FileOutputFormat.setOutputPath(processorJob, new Path(TEMP_PATH_PREFIX));
processorJob.setOutputFormatClass(TextOutputFormat.class);
 processorJob.setInputFormatClass(NLineInputFormat.class);
FileInputFormat.setInputPaths(processorJob, inputFiles);
NLineInputFormat.setNumLinesPerSplit(processorJob, 1);
 return processorJob;
}

In my mapper class, I have:

private Text outkey = new Text();
private Text outvalue = new Text();
...
outkey.set(entity.getEntityId().toString());
outvalue.set(input.getId().toString());
printLog("context write");
context.write(outkey, outvalue);

This last line (`context.write(outkey, outvalue);`), causes this exception.
Of course both `outkey` and `outvalue` are not null.

2013-10-24 05:48:48,422 INFO
com.s1mbi0se.grinder.core.mapred.EntityMapperCassandra (main): Current
Thread: Thread[main,5,main]Current timestamp: 1382593728422 context write
2013-10-24 05:48:48,422 ERROR
com.s1mbi0se.grinder.core.mapred.EntityMapperCassandra (main): Error on
entitymapper for input: 03a07858-4196-46dd-8a2c-23dd824d6e6e
java.lang.NullPointerException
at java.lang.System.arraycopy(Native Method)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1293)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1210)
at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
at org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264)
at org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244)
at org.apache.hadoop.io.Text.write(Text.java:281)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1077)
at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:698)
at
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at
com.s1mbi0se.grinder.core.mapred.EntityMapper.map(EntityMapper.java:78)
at
com.s1mbi0se.grinder.core.mapred.EntityMapperS3.map(EntityMapperS3.java:34)
at
com.s1mbi0se.grinder.core.mapred.EntityMapperS3.map(EntityMapperS3.java:14)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
2013-10-24 05:48:48,422 INFO
com.s1mbi0se.grinder.core.mapred.EntityMapperS3 (main): Current Thread:
Thread[main,5,main]Current timestamp: 1382593728422 Entity Mapper end

The first records on each task are just processed ok. In some point of the
task processing, I start to take this exception over and over, and then it
doesn't process a single record anymore for that task.

I tried to set `TEMP_PATH_PREFIX` to `"s3://mybucket/dmp_processor_tmp"`,
but same thing happened.

Any idea why is this happening? What could be making hadoop not being able
to write on it's output?


If we Open Source our platform, would it be interesting to you?

2013-02-20 Thread Marcelo Elias Del Valle
Hello All,

I’m sending this email because I think it may be interesting for Hadoop
users, as this project have a strong usage of Hadoop platform.

We are strongly considering opening the source of our DMP (Data Management
Platform), if it proves to be technically interesting to other developers /
companies.

More details: http://www.s1mbi0se.com/s1mbi0se_DMP.html

All comments, questions and critics happening at HN:
http://news.ycombinator.com/item?id=5251780

Please, feel free to send questions, comments and critics... We will try to
reply them all.

Regards,
Marcelo


Re: number of mapper tasks

2013-01-29 Thread Marcelo Elias Del Valle
Hello,

I have been able to make this work. I don't know why, but when but
input file is zipped (read as a input stream) it creates only 1 mapper.
However, when it's not zipped, it creates more mappers (running 3 instances
it created 4 mappers and running 5 instances, it created 8 mappers).
I really would like to know why this happens and even with this number
of mappers, I would like to know why more mappers aren't created. I was
reading part of the book "Hadoop - The definitive guide" (
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats)
which says:

"The JobClient calls the getSplits() method, passing the desired number of
map tasks as the numSplits argument. This number is treated as a hint, as
InputFormat implementations are free to return a different number of splits
to the number specified in numSplits. Having calculated the splits, the
client sends them to the jobtracker, which uses their storage locations to
schedule map tasks to process them on the tasktrackers. ..."

 I am not sure on how to get more info.

 Would you recommend me to try to find the answer on the book? Or
should I read hadoop source code directly?

Best regards,
Marcelo.


2013/1/29 Marcelo Elias Del Valle 

> I implemented my custom input format. Here is how I used it:
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java
>
> As you can see, I do:
> importerJob.setInputFormatClass(CSVNLineInputFormat.class);
>
> And here is the Input format and the linereader:
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java
>
> In this input format, I completely ignore these other parameters and get
> the splits by the number of lines. The amount of lines per map can be
> controlled by the same parameter used in NLineInputFormat:
>
> public static final String LINES_PER_MAP =
> "mapreduce.input.lineinputformat.linespermap";
> However, it has really no effect on the number of maps.
>
>
>
> 2013/1/29 Vinod Kumar Vavilapalli 
>
>>
>> Regarding your original question, you can use the min and max split
>> settings to control the number of maps:
>> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html.
>>  See #setMinInputSplitSize and #setMaxInputSplitSize. Or
>> use mapred.min.split.size directly.
>>
>> W.r.t your custom inputformat, are you sure you job is using this
>> InputFormat and not the default one?
>>
>>  HTH,
>> +Vinod Kumar Vavilapalli
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>> On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:
>>
>> Just to complement the last question, I have implemented the getSplits
>> method in my input format:
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>>
>> However, it still doesn't create more than 2 map tasks. Is there
>> something I could do about it to assure more map tasks are created?
>>
>> Thanks
>> Marcelo.
>>
>>
>> 2013/1/28 Marcelo Elias Del Valle 
>>
>>> Sorry for asking too many questions, but the answers are really
>>> happening.
>>>
>>>
>>> 2013/1/28 Harsh J 
>>>
>>>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>>>
>>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>>> .
>>>> This should let you spawn more maps as we, based on your N factor.
>>>>
>>>
>>> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
>>> Actually, I wrote my own InputFormat, to be able to process multiline
>>> CSVs: https://github.com/mvallebr/CSVInputFormat
>>> I could change it to read several lines at a time, but would this alone
>>> allow more tasks running in parallel?
>>>
>>>
>>>> Not really - "Slots" are capacities, rather than split factors
>>>> themselves. You can have N slots always available, but your job has to
>>>> supply as many map tasks (based on its input/needs/etc.) to use them
>>>> up.
>>>>
>>>
>>> But how can I do that (supply map tasks) in my job? changing its code?
>>> hadoop con

Re: number of mapper tasks

2013-01-29 Thread Marcelo Elias Del Valle
I implemented my custom input format. Here is how I used it:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java

As you can see, I do:
importerJob.setInputFormatClass(CSVNLineInputFormat.class);

And here is the Input format and the linereader:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java

In this input format, I completely ignore these other parameters and get
the splits by the number of lines. The amount of lines per map can be
controlled by the same parameter used in NLineInputFormat:

public static final String LINES_PER_MAP =
"mapreduce.input.lineinputformat.linespermap";
However, it has really no effect on the number of maps.



2013/1/29 Vinod Kumar Vavilapalli 

>
> Regarding your original question, you can use the min and max split
> settings to control the number of maps:
> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html.
>  See #setMinInputSplitSize and #setMaxInputSplitSize. Or
> use mapred.min.split.size directly.
>
> W.r.t your custom inputformat, are you sure you job is using this
> InputFormat and not the default one?
>
> HTH,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:
>
> Just to complement the last question, I have implemented the getSplits
> method in my input format:
>
> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>
> However, it still doesn't create more than 2 map tasks. Is there something
> I could do about it to assure more map tasks are created?
>
> Thanks
> Marcelo.
>
>
> 2013/1/28 Marcelo Elias Del Valle 
>
>> Sorry for asking too many questions, but the answers are really happening.
>>
>>
>> 2013/1/28 Harsh J 
>>
>>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>>
>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>> .
>>> This should let you spawn more maps as we, based on your N factor.
>>>
>>
>> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
>> Actually, I wrote my own InputFormat, to be able to process multiline
>> CSVs: https://github.com/mvallebr/CSVInputFormat
>> I could change it to read several lines at a time, but would this alone
>> allow more tasks running in parallel?
>>
>>
>>> Not really - "Slots" are capacities, rather than split factors
>>> themselves. You can have N slots always available, but your job has to
>>> supply as many map tasks (based on its input/needs/etc.) to use them
>>> up.
>>>
>>
>> But how can I do that (supply map tasks) in my job? changing its code?
>> hadoop config?
>>
>>
>>> Unless your job sets the number of reducers to 0 manually, 1 default
>>> reducer is always run that waits to see if it has any outputs from
>>> maps. If it does not receive any outputs after maps have all
>>> completed, it dies out with behavior equivalent to a NOP.
>>>
>> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
>> thanks!
>>
>>
>> --
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
>>
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>
>
>


-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: number of mapper tasks

2013-01-28 Thread Marcelo Elias Del Valle
Just to complement the last question, I have implemented the getSplits
method in my input format:
https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java

However, it still doesn't create more than 2 map tasks. Is there something
I could do about it to assure more map tasks are created?

Thanks
Marcelo.


2013/1/28 Marcelo Elias Del Valle 

> Sorry for asking too many questions, but the answers are really happening.
>
>
> 2013/1/28 Harsh J 
>
>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>> .
>> This should let you spawn more maps as we, based on your N factor.
>>
>
> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
> Actually, I wrote my own InputFormat, to be able to process multiline
> CSVs: https://github.com/mvallebr/CSVInputFormat
> I could change it to read several lines at a time, but would this alone
> allow more tasks running in parallel?
>
>
>> Not really - "Slots" are capacities, rather than split factors
>> themselves. You can have N slots always available, but your job has to
>> supply as many map tasks (based on its input/needs/etc.) to use them
>> up.
>>
>
> But how can I do that (supply map tasks) in my job? changing its code?
> hadoop config?
>
>
>> Unless your job sets the number of reducers to 0 manually, 1 default
>> reducer is always run that waits to see if it has any outputs from
>> maps. If it does not receive any outputs after maps have all
>> completed, it dies out with behavior equivalent to a NOP.
>>
> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
> thanks!
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: number of mapper tasks

2013-01-28 Thread Marcelo Elias Del Valle
Sorry for asking too many questions, but the answers are really happening.


2013/1/28 Harsh J 

> This seems CPU-oriented. You probably want the NLineInputFormat? See
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
> .
> This should let you spawn more maps as we, based on your N factor.
>

Indeed, CPU is my bottleneck. That's why I want more things in parallel.
Actually, I wrote my own InputFormat, to be able to process multiline CSVs:
https://github.com/mvallebr/CSVInputFormat
I could change it to read several lines at a time, but would this alone
allow more tasks running in parallel?


> Not really - "Slots" are capacities, rather than split factors
> themselves. You can have N slots always available, but your job has to
> supply as many map tasks (based on its input/needs/etc.) to use them
> up.
>

But how can I do that (supply map tasks) in my job? changing its code?
hadoop config?


> Unless your job sets the number of reducers to 0 manually, 1 default
> reducer is always run that waits to see if it has any outputs from
> maps. If it does not receive any outputs after maps have all
> completed, it dies out with behavior equivalent to a NOP.
>
Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
thanks!

-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: number of mapper tasks

2013-01-28 Thread Marcelo Elias Del Valle
Hello Harsh,

First of all, thanks for the answer!


2013/1/28 Harsh J 
>
> So depending on your implementation of the job here, you may or may
> not see it act in effect. Hope this helps.
>

Is there anything I can do in my job, my code or in my inputFormat so that
hadoop would choose to run more mappers? My text file and 10 million lines
and each mapper task process 1 line at a time, very fastly. I would like to
have 40 threads in parallel or even more processing those lines.


> > When I run my job with just 1 instance, I see it only creates 1
> mapper.
> > When I run my job with 5 instances (1 master and 4 cores), I can see
> only 2
> > mapper slots are used and 6 stay open.
>
> Perhaps the job itself launched with 2 total map tasks? You can check
> this on the JobTracker UI or whatever EMR offers as a job viewer.
>

I am trying to figure this out. Here is what I have from EMR:
http://mvalle.com/downloads/hadoop_monitor.png
I will try to get their support to understand this, but I didn't understand
what you said about the job being launched with 2 total map tasks... if I
have 8 slots, shouldn't all of them be filled always?


>
> This is a typical waiting reduce task log, what are you asking here
> specifically?
>

I have no reduce tasks. My map does the job without putting anything in the
output. Is it happening because reduce tasks receive nothing as input?

-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr