Re: Filter columns of a csv file with Flink

Hequn Cheng Fri, 06 Jul 2018 07:54:02 -0700

Hi francois,

If I understand correctly, you can use sql or table-api to solve you
problem.
As you want to project part of columns from source, a columnar storage like
parquet/orc would be efficient. Currently, ORC table source is supported in
flink, you can find more details here[1]. Also, there are many other table
sources[2] you can choose. With a TableSource, you can read the data and
register it as a Table and do table operations through sql[3] or
table-api[4].


To make a json string from several columns, you can write a user defined
function[5].

I also find a OrcTableSourceITCase[6] which I think may be helpful for you.

Best, Hequn

[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sourceSinks.html#orctablesource
[2]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sourceSinks.html#table-sources-sinks
[3]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql.html
[4]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/tableApi.html
[5]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/udfs.html
[6]
https://github.com/apache/flink/blob/master/flink-connectors/flink-orc/src/test/java/org/apache/flink/orc/OrcTableSourceITCase.java


On Fri, Jul 6, 2018 at 9:48 PM, françois lacombe <
francois.laco...@dcbrain.com> wrote:

> Hi all,
>
> I'm a new user to Flink community. This tool sounds great to achieve some
> data loading of millions-rows files into a pgsql db for a new project.
>
> As I read docs and examples, a proper use case of csv loading into pgsql
> can't be found.
> The file I want to load isn't following the same structure than the table,
> I have to delete some columns and make a json string from several others
> too prior to load to pgsql
>
> I plan to use Flink 1.5 Java API and a batch process.
> Does the DataSet class is able to strip some columns out of the records I
> load or should I iterate over each record to delete the columns?
>
> Same question to make a json string from several columns of the same
> record?
> E.g json_column =3D {"field1":col1, "field2":col2...}
>
> I work with 20 millions length files and it sounds pretty ineffective to
> iterate over each records.
> Can someone tell me if it's possible or if I have to change my mind about
> this?
>
>
> Thanks in advance, all the best
>
> François Lacombe
>
>

Re: Filter columns of a csv file with Flink

Reply via email to