Re: A couple of Questions on InputFormat

Harsh J Mon, 23 Sep 2013 04:32:38 -0700

Hi,

(I'm assuming 1.0~ MR here)

On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis <lordjoe2...@gmail.com> wrote:
> Classes implementing InputFormat implement
>  public List<InputSplit> getSplits(JobContext job) which a List if
> InputSplits. for FileInputFormat the Splits have Path.start and End
>
> 1) When is this method called and on which JVM on Which Machine and is it
> called only once?

Called only at a client, i.e. your "hadoop jar" JVM. Called only once.

> 2) Do the number of Map task correspond to the number of splits returned by
> getSplits?

Yes, number of split objects == number of mappers.

> 3) InputFormat implements a method
>  RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext
> context ). Is this  executed within the JVM of the Mapper on the slave
> machine and does the RecordReader run within that JVM

RecordReaders are not created on the client side JVM. RecordReaders
are created on the Map task JVMs, and run inside it.

> 4) The default RecordReaders read a file from the start position to the end
> position emitting values in the order read. With such a reader, assume it is
> reading lines of text, is it reasonable to assume that the values the mapper
> received are in the same order they were found in a file? Would it, for
> example, be possible for WordCount to see a word that was hyphen-
> ated at the end of one line and append the first word of the next line it
> sees (ignoring the case where the word is at the end of a split)

If you speak of the LineRecordReader, each map() will simply read a
line, i.e. until \n. It is not language-aware to understand meaning of
hyphens, etc..

You can implement a custom reader to do this however - there should be
no problems so long as your logic covers the case of not having any
duplicate reads across multiple maps.

-- 
Harsh J

Re: A couple of Questions on InputFormat

Reply via email to