Hi Christopher,

Thanks for clarifying. Then can you just preprocess the PCollection with a
custom FlatMapElements that converts each Document into one or more smaller
documents, small enough to be written into individual files? Then pair it
with a unique key and follow by FileIO.writeDynamic().by(the unique
key).withNumShards(1) to produce 1 file per document.

On Tue, Dec 3, 2019 at 7:55 AM Christopher Larsen <
christopher.lar...@quantiphi.com> wrote:

> Hi Eugene,
>
> Yes I think you've got it correct. In our use case we need to write each
> Document in the PCollection to a separate file as multiple Documents in a
> file will cause compilation errors and/or incorrect code to be generated by
> the Thrift compiler.
>
> Additionally there are some Documents that are so large that we would want
> them to be split.
>
> On Mon, Dec 2, 2019 at 9:45 PM Eugene Kirpichov <j...@google.com> wrote:
>
>> Hi Christopher,
>>
>> So, you have a PCollection<Document>, and you're writing it to files.
>> FileIO.write/writeDynamic will write several Document's to each file -
>> however, in your use case some of the individual Document's are so large
>> that you want instead each of those large documents to be split into
>> several files.
>>
>> Before we continue, could you confirm whether my understanding is correct?
>>
>> Thanks.
>>
>> On Mon, Dec 2, 2019 at 7:08 PM Christopher Larsen <
>> christopher.lar...@quantiphi.com> wrote:
>>
>>> Ideally each element (document) will be written to a .thrift file so
>>> that it can be compiled without further manipulation.
>>>
>>> But in the case of an extremely large file I think it would be nice to
>>> split into smaller files. As far as splitting points go I think it could be
>>> split at a point in the list of definitions. Thoughts?
>>>
>>> On Mon, Dec 2, 2019 at 4:02 PM Reuven Lax <re...@google.com> wrote:
>>>
>>>> What do you mean by shard the output file? Can it be split at any byte
>>>> location, or only at specific points?
>>>>
>>>> On Mon, Dec 2, 2019 at 2:05 PM Christopher Larsen <
>>>> christopher.lar...@quantiphi.com> wrote:
>>>>
>>>>> Hi Reuven,
>>>>>
>>>>> We would like to write each element to one file but still allow the
>>>>> runner to shard the output file which could yield more than one output 
>>>>> file
>>>>> per element.
>>>>>
>>>>> On Mon, Dec 2, 2019 at 11:55 AM Reuven Lax <re...@google.com> wrote:
>>>>>
>>>>>> I'm not sure I completely understand the question. Are you saying
>>>>>> that you want each element to write to only one file, guaranteeing that 
>>>>>> two
>>>>>> elements are never written to the same file?
>>>>>>
>>>>>> On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen <
>>>>>> christopher.lar...@quantiphi.com> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> TL/DR: can you extend FileIO.sink<T> to write one or more file per
>>>>>>> element instead of one or more elements per file?
>>>>>>>
>>>>>>> In working with Thrift files we have found that since a .thrift file
>>>>>>> needs to be compiled to generate code the order of the contents of the 
>>>>>>> file
>>>>>>> are important (ie, the namespace and includes elements need to come 
>>>>>>> before
>>>>>>> definitions are defined).
>>>>>>>
>>>>>>> The issue that we are facing is that by implementing
>>>>>>> FileIO.sink<Document> we cannot determine how many Document objects are
>>>>>>> written to a file since this is determined by the runner. This can 
>>>>>>> result
>>>>>>> in more than one Document being written to a file which will cause
>>>>>>> compilation errors.
>>>>>>>
>>>>>>> We know that this can be controlled by writeDynamic but since we
>>>>>>> believe the default behavior for the connector should be to output a
>>>>>>> Document to one or more files (depending on sharding) we were wondering 
>>>>>>> how
>>>>>>> to best accomplish this.
>>>>>>>
>>>>>>> Best,
>>>>>>> Chris
>>>>>>>
>>>>>>> *This message contains information that may be privileged or
>>>>>>> confidential and is the property of the Quantiphi Inc and/or its 
>>>>>>> affiliates**.
>>>>>>> It is intended only for the person to whom it is addressed. **If
>>>>>>> you are not the intended recipient, any review, dissemination,
>>>>>>> distribution, copying, storage or other use of all or any portion of 
>>>>>>> this
>>>>>>> message is strictly prohibited. If you received this message in error,
>>>>>>> please immediately notify the sender by reply e-mail and delete this
>>>>>>> message in its **entirety*
>>>>>>>
>>>>>>
>>>>> *This message contains information that may be privileged or
>>>>> confidential and is the property of the Quantiphi Inc and/or its 
>>>>> affiliates**.
>>>>> It is intended only for the person to whom it is addressed. **If you
>>>>> are not the intended recipient, any review, dissemination, distribution,
>>>>> copying, storage or other use of all or any portion of this message is
>>>>> strictly prohibited. If you received this message in error, please
>>>>> immediately notify the sender by reply e-mail and delete this message in
>>>>> its **entirety*
>>>>>
>>>> --
>>> *Regards,*
>>>
>>> ___________________________________________
>>>
>>> *Chris Larsen*
>>>
>>> Data Engineer | Quantiphi Inc. | US and India
>>>
>>> http://www.quantiphi.com | Analytics is in our DNA
>>>
>>> USA: +1 760 504 8477 <(760)%20504-8477>
>>> ____________________________________________
>>>
>>>
>>> *This message contains information that may be privileged or
>>> confidential and is the property of the Quantiphi Inc and/or its 
>>> affiliates**.
>>> It is intended only for the person to whom it is addressed. **If you
>>> are not the intended recipient, any review, dissemination, distribution,
>>> copying, storage or other use of all or any portion of this message is
>>> strictly prohibited. If you received this message in error, please
>>> immediately notify the sender by reply e-mail and delete this message in
>>> its **entirety*
>>>
>>
> *This message contains information that may be privileged or confidential
> and is the property of the Quantiphi Inc and/or its affiliates**. It is
> intended only for the person to whom it is addressed. **If you are not
> the intended recipient, any review, dissemination, distribution, copying,
> storage or other use of all or any portion of this message is strictly
> prohibited. If you received this message in error, please immediately
> notify the sender by reply e-mail and delete this message in its *
> *entirety*
>

Reply via email to