Hi Christopher, Thanks for clarifying. Then can you just preprocess the PCollection with a custom FlatMapElements that converts each Document into one or more smaller documents, small enough to be written into individual files? Then pair it with a unique key and follow by FileIO.writeDynamic().by(the unique key).withNumShards(1) to produce 1 file per document.
On Tue, Dec 3, 2019 at 7:55 AM Christopher Larsen < christopher.lar...@quantiphi.com> wrote: > Hi Eugene, > > Yes I think you've got it correct. In our use case we need to write each > Document in the PCollection to a separate file as multiple Documents in a > file will cause compilation errors and/or incorrect code to be generated by > the Thrift compiler. > > Additionally there are some Documents that are so large that we would want > them to be split. > > On Mon, Dec 2, 2019 at 9:45 PM Eugene Kirpichov <j...@google.com> wrote: > >> Hi Christopher, >> >> So, you have a PCollection<Document>, and you're writing it to files. >> FileIO.write/writeDynamic will write several Document's to each file - >> however, in your use case some of the individual Document's are so large >> that you want instead each of those large documents to be split into >> several files. >> >> Before we continue, could you confirm whether my understanding is correct? >> >> Thanks. >> >> On Mon, Dec 2, 2019 at 7:08 PM Christopher Larsen < >> christopher.lar...@quantiphi.com> wrote: >> >>> Ideally each element (document) will be written to a .thrift file so >>> that it can be compiled without further manipulation. >>> >>> But in the case of an extremely large file I think it would be nice to >>> split into smaller files. As far as splitting points go I think it could be >>> split at a point in the list of definitions. Thoughts? >>> >>> On Mon, Dec 2, 2019 at 4:02 PM Reuven Lax <re...@google.com> wrote: >>> >>>> What do you mean by shard the output file? Can it be split at any byte >>>> location, or only at specific points? >>>> >>>> On Mon, Dec 2, 2019 at 2:05 PM Christopher Larsen < >>>> christopher.lar...@quantiphi.com> wrote: >>>> >>>>> Hi Reuven, >>>>> >>>>> We would like to write each element to one file but still allow the >>>>> runner to shard the output file which could yield more than one output >>>>> file >>>>> per element. >>>>> >>>>> On Mon, Dec 2, 2019 at 11:55 AM Reuven Lax <re...@google.com> wrote: >>>>> >>>>>> I'm not sure I completely understand the question. Are you saying >>>>>> that you want each element to write to only one file, guaranteeing that >>>>>> two >>>>>> elements are never written to the same file? >>>>>> >>>>>> On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen < >>>>>> christopher.lar...@quantiphi.com> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> TL/DR: can you extend FileIO.sink<T> to write one or more file per >>>>>>> element instead of one or more elements per file? >>>>>>> >>>>>>> In working with Thrift files we have found that since a .thrift file >>>>>>> needs to be compiled to generate code the order of the contents of the >>>>>>> file >>>>>>> are important (ie, the namespace and includes elements need to come >>>>>>> before >>>>>>> definitions are defined). >>>>>>> >>>>>>> The issue that we are facing is that by implementing >>>>>>> FileIO.sink<Document> we cannot determine how many Document objects are >>>>>>> written to a file since this is determined by the runner. This can >>>>>>> result >>>>>>> in more than one Document being written to a file which will cause >>>>>>> compilation errors. >>>>>>> >>>>>>> We know that this can be controlled by writeDynamic but since we >>>>>>> believe the default behavior for the connector should be to output a >>>>>>> Document to one or more files (depending on sharding) we were wondering >>>>>>> how >>>>>>> to best accomplish this. >>>>>>> >>>>>>> Best, >>>>>>> Chris >>>>>>> >>>>>>> *This message contains information that may be privileged or >>>>>>> confidential and is the property of the Quantiphi Inc and/or its >>>>>>> affiliates**. >>>>>>> It is intended only for the person to whom it is addressed. **If >>>>>>> you are not the intended recipient, any review, dissemination, >>>>>>> distribution, copying, storage or other use of all or any portion of >>>>>>> this >>>>>>> message is strictly prohibited. If you received this message in error, >>>>>>> please immediately notify the sender by reply e-mail and delete this >>>>>>> message in its **entirety* >>>>>>> >>>>>> >>>>> *This message contains information that may be privileged or >>>>> confidential and is the property of the Quantiphi Inc and/or its >>>>> affiliates**. >>>>> It is intended only for the person to whom it is addressed. **If you >>>>> are not the intended recipient, any review, dissemination, distribution, >>>>> copying, storage or other use of all or any portion of this message is >>>>> strictly prohibited. If you received this message in error, please >>>>> immediately notify the sender by reply e-mail and delete this message in >>>>> its **entirety* >>>>> >>>> -- >>> *Regards,* >>> >>> ___________________________________________ >>> >>> *Chris Larsen* >>> >>> Data Engineer | Quantiphi Inc. | US and India >>> >>> http://www.quantiphi.com | Analytics is in our DNA >>> >>> USA: +1 760 504 8477 <(760)%20504-8477> >>> ____________________________________________ >>> >>> >>> *This message contains information that may be privileged or >>> confidential and is the property of the Quantiphi Inc and/or its >>> affiliates**. >>> It is intended only for the person to whom it is addressed. **If you >>> are not the intended recipient, any review, dissemination, distribution, >>> copying, storage or other use of all or any portion of this message is >>> strictly prohibited. If you received this message in error, please >>> immediately notify the sender by reply e-mail and delete this message in >>> its **entirety* >>> >> > *This message contains information that may be privileged or confidential > and is the property of the Quantiphi Inc and/or its affiliates**. It is > intended only for the person to whom it is addressed. **If you are not > the intended recipient, any review, dissemination, distribution, copying, > storage or other use of all or any portion of this message is strictly > prohibited. If you received this message in error, please immediately > notify the sender by reply e-mail and delete this message in its * > *entirety* >