Hi,

I guess this thread is old. But I eventually need to raise the
question again as I am more into dfs now. Would a line be broken
between adjacent blocks in dfs? Can line be preserved in block level?

-Kevin



On Wed, Jul 16, 2008 at 4:57 PM, Chris Douglas <[EMAIL PROTECTED]> wrote:
> InputFormats don't have a concept of "blocks"; each FileSplit contains a
> list of locations that advise the framework where it should prefer to
> schedule the map (i.e. on the node that contains most of the data (in
> practice, IIRC this is the the location of the first byte of the block,
> which may not actually contain the bulk of the data)). For LineRecordReader,
> this means that it will open a stream, seek to its start position, read
> (opening up a connection to the node that contains that block, with luck a
> local read) to the first record delimiter, then return lines as Text records
> to the map until the end of that split precedes the start offset at the
> beginning of a read (i.e. the end of split A and the start of split B will
> likely be in the middle of a record, so A will emit that record and B will
> start from the end of that record).
>
> I think it's fair to say that blocks and records are orthogonal abstractions
> to HDFS and map/reduce. -C
>
> On Jul 15, 2008, at 5:07 PM, Kevin wrote:
>
>> Hi,
>>
>> I was trying to parse text input with line-based information in mapper
>> and this problem becomes an issue. I wonder if lines are preserved or
>> broken when a file is cut into blocks by dfs. Also, it looks that
>> although TextInputFormat breaks file into lines records, the
>> InputSplit passed to InputFormat may not preserve lines. If this is
>> the case, is it possible to restore the lines for mapper input, or I
>> have to drop broken lines? Thank you.
>>
>> Best,
>> -Kevin
>
>

Reply via email to