Hm I am not very familiar with POI, but if its transforms are able to take
in a file descriptor, you should be able to use FileIO.match()[0] to find
your files (local, or in GCS/S3/HDFS); and FileIO.readMatches()[1] to get
file descriptors for these files.

If the POI libraries require the files to be local in your machine, you may
need to use FileSystems.copy[2] to move your files locally, and then
analyze them.

Let me know if those are some useful building blocks for your pipeline,
Best
-P.

[0]
https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileIO.html#match--

[1]
https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileIO.html#readMatches--
[2]
https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/FileSystems.html#copy-java.util.List-java.util.List-org.apache.beam.sdk.io.fs.MoveOptions...-


On Mon, Apr 15, 2019 at 6:20 PM Henrique Molina <[email protected]>
wrote:

> Hi Pablo ,
> Thanks for your attention,
> I so sorry, my bad written "Cs extension " I did means .csv extension !
> The example like this: load-csv-file-from-google-cloud-storage
> <https://kontext.tech/docs/DataAndBusinessIntelligence/p/load-csv-file-from-google-cloud-storage-to-bigquery-using-dataflow>
>
> I was think Using apache POI to read each row from sheet  throwing to next
> ParDo an CellRow rows
> same like that:
> .apply("xlsxToMap", ParDo.of(new DoFn<CellRow, Map<String,String>() {.....
>
> I don't know if it is more ellegant...
>
> If your have some Idea ! let me know . it will be welcome!!
>
>
> On Mon, Apr 15, 2019 at 6:01 PM Pablo Estrada <[email protected]> wrote:
>
>> Hello Henrique,
>>
>> I am not aware of existing Beam transforms specifically used for reading
>> in XLSX data. Can you share what you mean by "examples related with Cs
>> extension"?
>>
>> I am aware of some Python libraries foir this sort of thing[1]. You could
>> use the FileIO transforms in the Python SDK to find each file, and then
>> write a DoFn that is able to read in data from these files. Check out this
>> unit test using FileIO to read CSV files[2].
>>
>> Let me know if that helps, or if I went on the wrong direction of what
>> you needed.
>> Best
>> -P.
>>
>> [1] https://openpyxl.readthedocs.io/en/stable/
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio_test.py#L128-L148
>>
>> On Mon, Apr 15, 2019 at 12:47 PM Henrique Molina <
>> [email protected]> wrote:
>>
>>> Hello
>>>
>>> I would like to use best practices from Apache Beams to read Xlsx.
>>> however I found examples only related with Cs extension.
>>> someone there is sample using ParDo to Collect all columns and sheets
>>> from Excel xlsx ?
>>> Afterwards I will put into google Big query.
>>>
>>> Thanks & Regards
>>>
>>>
>>

Reply via email to