I would agree with Eugene. A simple application that does this is probably
what your looking for.

There are ways to make this work with parallel processing systems but its
quite a hassle and only worthwhile if your computation is very expensive
and want the additional computational power of multiple CPU cores. For
example, in a parallel processing you could read the records from the file
and remember the file offset / line number of each record. You could then
group them under a single key and use the sorting extension to sort using
the file offset / line number and then write out all the sorted records out
to a single file. Note that this will likely be a lot slower then a simple
program.

On Tue, Aug 21, 2018 at 8:02 AM Eugene Kirpichov <[email protected]>
wrote:

> It sounds like you want to sequentially read a file, sequentially process
> the records and sequentially write them. The best way to do this is likely
> without using Beam, just write some Java or Python code using standard file
> APIs (use Beam's FileSystem APIs if you need to access data on a non-local
> filesystem).
>
> On Tue, Aug 21, 2018 at 7:11 AM [email protected] <[email protected]>
> wrote:
>
>> Hi
>>
>> I have to process a big file and call several Pardo's to do some
>> transformations.  Records in file dont have any unique key.
>>
>> Lets say file 'testfile' has 1 million records.
>>
>> After processing , I want to generate only one output file same as my
>> input 'testfile' and also i have a requirement to write those 1 million
>> records in same order (after applying some Pardo's)
>>
>> What is best way to do it
>>
>> Thanks
>> Aniruddh
>>
>>
>>
>>

Reply via email to