I had that problem/question some time ago, too.

The quick fix is to just put the line number in the line itself. Go for it.

However, we worked out a solution for another distributed processing system, that did the following: Read each partition, count the lines, broadcast a map "partition->lineCount", re-read the data and attach the line-numbers. This is basically how distributed zipWithIndex works, that is available in Flink too.

But:

That only works if the data by both mapPartitions is read in the same order and if the partitions used by both are in the same boundaries. I don't now if you can get that guarantee in Flink without a range-partition and sortPartition on the byte offset. Doing that would work (I think), but it would add significant overhead, that can be completely avoided by adding the line-numbers into the lines in the first place.
I think it's just not worth it.

Am 4. Februar 2016 00:56:43 MEZ, schrieb Fabian Hueske <fhue...@gmail.com>:

   Hi Anastasiia,

   this is difficult because the input is usually read in parallel,
   i.e., an input file is split into several blogs which are
   independently read and processed by different threads (possibly on
   different machines). So it is difficult to have a sequential row
   number.

   If all rows have the same length (number of bytes), you could
   compute the row number from the byte offset. If this is not given,
   you can only read the input sequentially.
   Flink does not provide InputFormats for this. So you would need to
   implement a custom InputFormat.

   You can also keep track of the number of elements that you processed
   in a Mapper, but this is probably not what you are looking for.

   Best,
   Fabian

   2016-02-04 0:37 GMT+01:00 Анастасія Баша <nastja.ba...@mail.ru
   <mailto:nastja.ba...@mail.ru>>:

       Is there a way to get the current line number (or generally the
       number of element currently being processed) inside a mapper?
       The example is a matrix you read line-line by line from the file
       and need both the row and the column numbers. Column number is
       easy to get, but how to know the row number?
       Thanks a lot in advance,
       Anastasiia


Reply via email to