Hi, I guess this thread is old. But I eventually need to raise the question again as I am more into dfs now. Would a line be broken between adjacent blocks in dfs? Can line be preserved in block level?
-Kevin On Wed, Jul 16, 2008 at 4:57 PM, Chris Douglas <[EMAIL PROTECTED]> wrote: > InputFormats don't have a concept of "blocks"; each FileSplit contains a > list of locations that advise the framework where it should prefer to > schedule the map (i.e. on the node that contains most of the data (in > practice, IIRC this is the the location of the first byte of the block, > which may not actually contain the bulk of the data)). For LineRecordReader, > this means that it will open a stream, seek to its start position, read > (opening up a connection to the node that contains that block, with luck a > local read) to the first record delimiter, then return lines as Text records > to the map until the end of that split precedes the start offset at the > beginning of a read (i.e. the end of split A and the start of split B will > likely be in the middle of a record, so A will emit that record and B will > start from the end of that record). > > I think it's fair to say that blocks and records are orthogonal abstractions > to HDFS and map/reduce. -C > > On Jul 15, 2008, at 5:07 PM, Kevin wrote: > >> Hi, >> >> I was trying to parse text input with line-based information in mapper >> and this problem becomes an issue. I wonder if lines are preserved or >> broken when a file is cut into blocks by dfs. Also, it looks that >> although TextInputFormat breaks file into lines records, the >> InputSplit passed to InputFormat may not preserve lines. If this is >> the case, is it possible to restore the lines for mapper input, or I >> have to drop broken lines? Thank you. >> >> Best, >> -Kevin > >