Ok, so getting your position in to the file based on offset and a known fixed length format, er what you meant by structured, will give you a line number.
But lets look at the question from a more practical and wider application. In most applications where you have a single record per line, you will not have a fixed length record format, so you really don't have a good way to calculate your line number based on position in to the file. Lets also look at the issue of the importance of a line number in terms of practical use. Sort of like row_id in a partitioned table, line number loses meaning. If line number had specific meaning and the application ended their records with a '\n' (or cr nl), the an alternative would be to add a field that contained the line number. HTH -Mike PS. Wouldn't you call a record in XML structured? Yet of an unknown length? ;-) (Sorry, I haven't had my first cup of coffee yet. :-) ) > From: am...@yahoo-inc.com > To: common-user@hadoop.apache.org > Date: Tue, 6 Apr 2010 12:14:56 +0530 > Subject: Re: Get Line Number from InputFormat > > Hi, > If your records are structured / of equal size, then getting the line number > is straightforward. > If not, you'll need to construct your own sequence of numbers, someone's been > kind enough to publish on his blog: > > http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-parallel-program.html > > Amogh > > > On 4/5/10 7:59 PM, "Michael Segel" <michael_se...@hotmail.com> wrote: > > > > > > > Date: Mon, 5 Apr 2010 14:57:09 +0100 > > From: lamfeeli...@gmail.com > > To: common-user@hadoop.apache.org > > Subject: Get Line Number from InputFormat > > > > Dear all, > > TextInputFormat send the <Offset, Line> into the Mapper, however, the > > offset is sometime meaningless, and confusing. Is it possible to have a > > InputFormat which outputs <Line NO., line> into mapper? > > > > Thanks a lot. > > > > Song > > Song, > > I'm not sure what you want is realistic or even worthwhile. > > You have a file and its split in to chunks of 64MB (default) or something > larger based on your cloud settings. > You have map job that starts from a specific point in to the file, but that > does not mean that its starting at a specific line, or that Hadoop will know > which line in the file. (Your records are not always going to be based on the > end of a line, or one like per record. > > Does that make sense? > Offset has more meaning that an arbitrary Line NO. > > -Mike > > _________________________________________________________________ > The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with > Hotmail. > http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5 > _________________________________________________________________ The New Busy is not the old busy. Search, chat and e-mail from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3