+1 to what Ankur said. Beam is still good application for your use-case since it looks like you can parallelize processing of individual files even though you cannot parallelize processing a single file. What you can do with Beam is limited though due to your exact ordering requirement. Beam does not guarantee ordering of elements across steps so you'll have to perform reading data from files, processing, and writing in a single fused step (You cannot have GBKs between these steps for example). To read file names you can use FileIO.readMatches() which should be followed by a step that performs rest of the work.
Thanks, Cham On Tue, Aug 21, 2018 at 6:48 PM Ankur Goenka <[email protected]> wrote: > In case of multiple files, you can use Dataflow to parallelize processing > to individual files. However, as mentioned earlier, records within in a > single file is not worth parallelizing in this case. > > Your pipeline can start with a fixed set of file names followed by GroupBy > (to shuffle the file names) and then you should process complete file in > your ParDo based on the file name that you get as element. > You should still write the output directly (using Beam File System) as > write ordering is not guaranteed. > > On Tue, Aug 21, 2018 at 6:07 PM [email protected] <[email protected]> > wrote: > >> >> >> On 2018/08/21 16:20:13, Lukasz Cwik <[email protected]> wrote: >> > I would agree with Eugene. A simple application that does this is >> probably >> > what your looking for. >> > >> > There are ways to make this work with parallel processing systems but >> its >> > quite a hassle and only worthwhile if your computation is very expensive >> > and want the additional computational power of multiple CPU cores. For >> > example, in a parallel processing you could read the records from the >> file >> > and remember the file offset / line number of each record. You could >> then >> > group them under a single key and use the sorting extension to sort >> using >> > the file offset / line number and then write out all the sorted records >> out >> > to a single file. Note that this will likely be a lot slower then a >> simple >> > program. >> > >> > On Tue, Aug 21, 2018 at 8:02 AM Eugene Kirpichov <[email protected]> >> > wrote: >> > >> > > It sounds like you want to sequentially read a file, sequentially >> process >> > > the records and sequentially write them. The best way to do this is >> likely >> > > without using Beam, just write some Java or Python code using >> standard file >> > > APIs (use Beam's FileSystem APIs if you need to access data on a >> non-local >> > > filesystem). >> > > >> > > On Tue, Aug 21, 2018 at 7:11 AM [email protected] < >> [email protected]> >> > > wrote: >> > > >> > >> Hi >> > >> >> > >> I have to process a big file and call several Pardo's to do some >> > >> transformations. Records in file dont have any unique key. >> > >> >> > >> Lets say file 'testfile' has 1 million records. >> > >> >> > >> After processing , I want to generate only one output file same as my >> > >> input 'testfile' and also i have a requirement to write those 1 >> million >> > >> records in same order (after applying some Pardo's) >> > >> >> > >> What is best way to do it >> > >> >> > >> Thanks >> > >> Aniruddh >> > >> >> > >> >> > >> >> > >> >> > >> >> Thanks Eugene and Lukasz for revert. One further query please. Use case >> is to process 50,000 files together multiple times in a day , do some >> processing on records (which includes DLP calls) , then to generate same >> set of 50,000 output files and each output file containing same number of >> records in same order as of its input file (replica of input file's records >> preserving order). Is it an ideal use case for Dataflow . If yes , then is >> there any example to lookup for same scenario and if not then what is >> recommended way. Thanks a lot >> >
