thanks, Jay, it really helps. 2014-02-28 10:32 GMT+08:00 Jay Vyas <jayunit...@gmail.com>:
> -- method 1 -- > > You could, i think, just extend fileinputformat, with isSplittable = > false. Then each file wont be brokeen up into separate blocks, and > processed as a whole per mapper. This is probably the easiest thing to do > but if you have huge files, it wont perform very well. > > -- method 2 -- > > You can use Harsh's suggestion (thanks for that idea, i didnt know it). > > 1) In the setup method of a mapper, you can get the file path : using > > ((FileSplit) context.getInputSplit()).getPath(); > > > 2) Then , in the mappers "setup" method, you should be able open a file > input stream and call "seek(0)" to read the file header, as Harsh sais. > > 3) When you process the header, you can store the results in the Setup > method as a local variable, and the mapper can read from that variable and > proceed. > > > > > On Thu, Feb 27, 2014 at 9:09 PM, Fengyun RAO <raofeng...@gmail.com> wrote: > >> thanks, Harsh. >> >> could you specify more detail, or give some links or an example where I >> can start? >> >> >> >> 2014-02-27 22:17 GMT+08:00 Harsh J <ha...@cloudera.com>: >> >> A mapper's record reader implementation need not be restricted to >>> strictly only the input split boundary. It is a loose relationship - >>> you can always seek(0), read the lines you need to prepare, then >>> seek(offset) and continue reading. >>> >>> Apache Avro (http://avro.apache.org) has a similar format - header >>> contains the schema a reader needs to work. >>> >>> On Thu, Feb 27, 2014 at 1:59 AM, Fengyun RAO <raofeng...@gmail.com> >>> wrote: >>> > Below is a fake sample of Microsoft IIS log: >>> > #Software: Microsoft Internet Information Services 7.5 >>> > #Version: 1.0 >>> > #Date: 2013-07-04 20:00:00 >>> > #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port >>> > cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status >>> > time-taken >>> > 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 >>> someuserAgent 200 >>> > 0 0 390 >>> > 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3 >>> someuserAgent 200 >>> > 0 0 390 >>> > 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4 >>> someuserAgent 200 >>> > 0 0 390 >>> > ... >>> > >>> > The first four lines describe the file format, which is a must to >>> parse each >>> > log line. It means log file could NOT be simply splitted, otherwise the >>> > second split would lost the "file format" information. >>> > >>> > How could each mapper get the first few lines in the file? >>> >>> >>> >>> -- >>> Harsh J >>> >> >> > > > -- > Jay Vyas > http://jayunit100.blogspot.com >