Do you want to read the data once or monitor a directory and process new files as they appear?
Reading from S3 with Flink's current MonitoringFileSource implementation is not working reliably due to S3's eventual consistent list operation (see FLINK-9940 [1]). Reading a directory also has some issues as it won't work with checkpointing enabled. These limitations could be worked around with custom source implementations. Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-9940 2018-08-07 19:45 GMT+02:00 srimugunthan dhandapani < srimugunthan.dhandap...@gmail.com>: > Thanks for the reply. I was mainly thinking of the usecase of streaming > job. > In the approach to port to Flink's SQL API, is it possible to read parquet > data from S3 and register table in flink? > > > On Tue, Aug 7, 2018 at 1:05 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Mugunthan, >> >> this depends on the type of your job. Is it a batch or a streaming job? >> Some queries could be ported to Flink's SQL API as suggested by the link >> that Hequn posted. In that case, the query would be executed in Flink. >> >> Other options are to use a JDBC InputFormat or persisting the result to >> files and ingesting it with a monitoring file sink. >> These options would mean to run the query in Hive/Presto and just >> ingesting the result (via JDBC or files). >> >> It depends on the details, which solution works best for you. >> >> Best, Fabian >> >> 2018-08-07 3:28 GMT+02:00 Hequn Cheng <chenghe...@gmail.com>: >> >>> Hi srimugunthan, >>> >>> I found a related link[1]. Hope it helps. >>> >>> [1] https://stackoverflow.com/questions/41683108/flink-1-1-3 >>> -interact-with-hive-2-1-0 >>> >>> On Tue, Aug 7, 2018 at 2:35 AM, srimugunthan dhandapani < >>> srimugunthan.dhandap...@gmail.com> wrote: >>> >>>> Hi all, >>>> I read the Flink documentation and came across the connectors supported >>>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/ >>>> dev/connectors/index.html#bundled-connectors >>>> >>>> We have some data that resides in Hive/Presto that needs to be made >>>> available to the flink job. The data in the hive or presto can be updated >>>> once in a day or less than that. >>>> >>>> Ideally we will connect to the hive or presto , run the query and get >>>> back the results and use it in a flink job. >>>> What are the options to achieve something like that? >>>> >>>> Thanks, >>>> mugunthan >>>> >>> >>> >> >