Hi Rafi,

I have a similar use case where I want to read parquet files in the dataset
and want to perform some transformation and similarly want to write the
result using year month day partitioned.

I am stuck at first step only where how to read and write Parquet files
using hadoop-Compatability.

Please help me with this and also if u find the solution for how to write
data in partitioned.

Thanks,
Anuj


On Thu, Oct 25, 2018 at 5:35 PM Andrey Zagrebin <and...@data-artisans.com>
wrote:

> Hi Rafi,
>
> At the moment I do not see any support of Parquet in DataSet API
> except HadoopOutputFormat, mentioned in stack overflow question. I have
> cc’ed Fabian and Aljoscha, maybe they could provide more information.
>
> Best,
> Andrey
>
> On 25 Oct 2018, at 13:08, Rafi Aroch <rafi.ar...@gmail.com> wrote:
>
> Hi,
>
> I'm writing a Batch job which reads Parquet, does some aggregations and
> writes back as Parquet files.
> I would like the output to be partitioned by year, month, day by event
> time. Similarly to the functionality of the BucketingSink.
>
> I was able to achieve the reading/writing to/from Parquet by using the
> hadoop-compatibility features.
> I couldn't find a way to partition the data by year, month, day to create
> a folder hierarchy accordingly. Everything is written to a single directory.
>
> I could find an unanswered question about this issue:
> https://stackoverflow.com/questions/52204034/apache-flink-does-dataset-api-support-writing-output-to-individual-file-partit
>
> Can anyone suggest a way to achieve this? Maybe there's a way to integrate
> the BucketingSink with the DataSet API? Another solution?
>
> Rafi
>
>
>

-- 
Thanks & Regards,
Anuj Jain
Mob. : +91- 8588817877
Skype : anuj.jain07
<http://www.oracle.com/>


<http://www.cse.iitm.ac.in/%7Eanujjain/>

Reply via email to