Re: How to load multiple same-format files with single batch job?

2019-02-19 Thread françois lacombe
Hi Fabian,

After a bit more documentation reading I have a better understanding of how
InputFormat interface works.
Indeed I've better to wrap a custom InputFormat implementation in my source.
This article helps a lot
https://brewing.codes/2017/02/06/implementing-flink-batch-data-connector/

connect() will be for a next sprint

All the best

François

Le ven. 15 févr. 2019 à 09:37, Fabian Hueske  a écrit :

> H François,
>
> The TableEnvironment.connect() method can only be used if you provide
> (quite a bit) more code.
> It requires a TableSourceFactory and handling of all the properties that
> are defined in the other builder methods. See [1].
>
> I would recommend to either register the BatchTableSource directly
> (tEnv.registerTableSource()) or get a DataSet (via env.createSource()) and
> register the DataSet as a Table (tEnv.registerDataSet()).
>
> Best, Fabian
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/table/sourceSinks.html#define-a-tablefactory
>
>
> Am Mo., 11. Feb. 2019 um 21:09 Uhr schrieb françois lacombe <
> francois.laco...@dcbrain.com>:
>
>> Hi Fabian,
>>
>> I've got issues for a custom InputFormat implementation with my existing
>> code.
>>
>> Is this can be used in combination with a BatchTableSource custom source?
>> As I understand your solution, I should move my source to implementations
>> like :
>>
>> tableEnvironment
>>   .connect(...)
>>   .withFormat(...)
>>   .withSchema(...)
>>   .inAppendMode()
>>   .registerTableSource("MyTable")
>>
>> right?
>>
>> I currently have a BatchTableSource class which produce a DataSet
>> from a single geojson file.
>> This doesn't sound compatible with a custom InputFormat, don't you?
>>
>> Thanks in advance for any addition hint, all the best
>>
>> François
>>
>> Le lun. 4 févr. 2019 à 12:10, Fabian Hueske  a écrit :
>>
>>> Hi,
>>>
>>> The files will be read in a streaming fashion.
>>> Typically files are broken down into processing splits that are
>>> distributed to tasks for reading.
>>> How a task reads a file split depends on the implementation, but usually
>>> the format reads the split as a stream and does not read the split as a
>>> whole before emitting records.
>>>
>>> Best,
>>> Fabian
>>>
>>> Am Mo., 4. Feb. 2019 um 12:06 Uhr schrieb françois lacombe <
>>> francois.laco...@dcbrain.com>:
>>>
 Hi Fabian,

 Thank you for this input.
 This is interesting.

 With such an input format, will all the file will be loaded in memory
 before to be processed or will all be streamed?

 All the best
 François

 Le mar. 29 janv. 2019 à 22:20, Fabian Hueske  a
 écrit :

> Hi,
>
> You can point a file-based input format to a directory and the input
> format should read all files in that directory.
> That works as well for TableSources that are internally use file-based
> input formats.
> Is that what you are looking for?
>
> Best, Fabian
>
> Am Mo., 28. Jan. 2019 um 17:22 Uhr schrieb françois lacombe <
> francois.laco...@dcbrain.com>:
>
>> Hi all,
>>
>> I'm wondering if it's possible and what's the best way to achieve the
>> loading of multiple files with a Json source to a JDBC sink ?
>> I'm running Flink 1.7.0
>>
>> Let's say I have about 1500 files with the same structure (same
>> format, schema, everything) and I want to load them with a *batch* job
>> Can Flink handle the loading of one and each file in a single source
>> and send data to my JDBC sink?
>> I wish I can provide the URL of the directory containing my thousand
>> files to the batch source to make it load all of them sequentially.
>> My sources and sinks are currently available for BatchTableSource, I
>> guess the cost to make them available for streaming would be quite
>> expensive for me for the moment.
>>
>> Have someone ever done this?
>> Am I wrong to expect doing so with a batch job?
>>
>> All the best
>>
>> François Lacombe
>>
>>
>> 
>> 
>> 
>> 
>>
>> [image: Arbre vert.jpg] Pensez à la planète, imprimer ce papier que
>> si nécessaire
>>
>

    

 

 [image: Arbre vert.jpg] Pensez à la planète, imprimer ce papier que si
 nécessaire

>>>
>>
>>    
>> 
>> 
>>
>> [image: Arbre vert.jpg] Pensez à la planète, imprimer ce papier que si
>> nécessaire
>>
>

-- 

    

Re: How to load multiple same-format files with single batch job?

2019-02-15 Thread Fabian Hueske
H François,

The TableEnvironment.connect() method can only be used if you provide
(quite a bit) more code.
It requires a TableSourceFactory and handling of all the properties that
are defined in the other builder methods. See [1].

I would recommend to either register the BatchTableSource directly
(tEnv.registerTableSource()) or get a DataSet (via env.createSource()) and
register the DataSet as a Table (tEnv.registerDataSet()).

Best, Fabian

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/table/sourceSinks.html#define-a-tablefactory


Am Mo., 11. Feb. 2019 um 21:09 Uhr schrieb françois lacombe <
francois.laco...@dcbrain.com>:

> Hi Fabian,
>
> I've got issues for a custom InputFormat implementation with my existing
> code.
>
> Is this can be used in combination with a BatchTableSource custom source?
> As I understand your solution, I should move my source to implementations
> like :
>
> tableEnvironment
>   .connect(...)
>   .withFormat(...)
>   .withSchema(...)
>   .inAppendMode()
>   .registerTableSource("MyTable")
>
> right?
>
> I currently have a BatchTableSource class which produce a DataSet
> from a single geojson file.
> This doesn't sound compatible with a custom InputFormat, don't you?
>
> Thanks in advance for any addition hint, all the best
>
> François
>
> Le lun. 4 févr. 2019 à 12:10, Fabian Hueske  a écrit :
>
>> Hi,
>>
>> The files will be read in a streaming fashion.
>> Typically files are broken down into processing splits that are
>> distributed to tasks for reading.
>> How a task reads a file split depends on the implementation, but usually
>> the format reads the split as a stream and does not read the split as a
>> whole before emitting records.
>>
>> Best,
>> Fabian
>>
>> Am Mo., 4. Feb. 2019 um 12:06 Uhr schrieb françois lacombe <
>> francois.laco...@dcbrain.com>:
>>
>>> Hi Fabian,
>>>
>>> Thank you for this input.
>>> This is interesting.
>>>
>>> With such an input format, will all the file will be loaded in memory
>>> before to be processed or will all be streamed?
>>>
>>> All the best
>>> François
>>>
>>> Le mar. 29 janv. 2019 à 22:20, Fabian Hueske  a
>>> écrit :
>>>
 Hi,

 You can point a file-based input format to a directory and the input
 format should read all files in that directory.
 That works as well for TableSources that are internally use file-based
 input formats.
 Is that what you are looking for?

 Best, Fabian

 Am Mo., 28. Jan. 2019 um 17:22 Uhr schrieb françois lacombe <
 francois.laco...@dcbrain.com>:

> Hi all,
>
> I'm wondering if it's possible and what's the best way to achieve the
> loading of multiple files with a Json source to a JDBC sink ?
> I'm running Flink 1.7.0
>
> Let's say I have about 1500 files with the same structure (same
> format, schema, everything) and I want to load them with a *batch* job
> Can Flink handle the loading of one and each file in a single source
> and send data to my JDBC sink?
> I wish I can provide the URL of the directory containing my thousand
> files to the batch source to make it load all of them sequentially.
> My sources and sinks are currently available for BatchTableSource, I
> guess the cost to make them available for streaming would be quite
> expensive for me for the moment.
>
> Have someone ever done this?
> Am I wrong to expect doing so with a batch job?
>
> All the best
>
> François Lacombe
>
>
>    
>
> 
>
> [image: Arbre vert.jpg] Pensez à la planète, imprimer ce papier que
> si nécessaire
>

>>>
>>>    
>>> 
>>> 
>>>
>>> [image: Arbre vert.jpg] Pensez à la planète, imprimer ce papier que si
>>> nécessaire
>>>
>>
>
>    
> 
> 
>
> [image: Arbre vert.jpg] Pensez à la planète, imprimer ce papier que si
> nécessaire
>


Re: How to load multiple same-format files with single batch job?

2019-02-11 Thread françois lacombe
Hi Fabian,

I've got issues for a custom InputFormat implementation with my existing
code.

Is this can be used in combination with a BatchTableSource custom source?
As I understand your solution, I should move my source to implementations
like :

tableEnvironment
  .connect(...)
  .withFormat(...)
  .withSchema(...)
  .inAppendMode()
  .registerTableSource("MyTable")

right?

I currently have a BatchTableSource class which produce a DataSet from
a single geojson file.
This doesn't sound compatible with a custom InputFormat, don't you?

Thanks in advance for any addition hint, all the best

François

Le lun. 4 févr. 2019 à 12:10, Fabian Hueske  a écrit :

> Hi,
>
> The files will be read in a streaming fashion.
> Typically files are broken down into processing splits that are
> distributed to tasks for reading.
> How a task reads a file split depends on the implementation, but usually
> the format reads the split as a stream and does not read the split as a
> whole before emitting records.
>
> Best,
> Fabian
>
> Am Mo., 4. Feb. 2019 um 12:06 Uhr schrieb françois lacombe <
> francois.laco...@dcbrain.com>:
>
>> Hi Fabian,
>>
>> Thank you for this input.
>> This is interesting.
>>
>> With such an input format, will all the file will be loaded in memory
>> before to be processed or will all be streamed?
>>
>> All the best
>> François
>>
>> Le mar. 29 janv. 2019 à 22:20, Fabian Hueske  a
>> écrit :
>>
>>> Hi,
>>>
>>> You can point a file-based input format to a directory and the input
>>> format should read all files in that directory.
>>> That works as well for TableSources that are internally use file-based
>>> input formats.
>>> Is that what you are looking for?
>>>
>>> Best, Fabian
>>>
>>> Am Mo., 28. Jan. 2019 um 17:22 Uhr schrieb françois lacombe <
>>> francois.laco...@dcbrain.com>:
>>>
 Hi all,

 I'm wondering if it's possible and what's the best way to achieve the
 loading of multiple files with a Json source to a JDBC sink ?
 I'm running Flink 1.7.0

 Let's say I have about 1500 files with the same structure (same format,
 schema, everything) and I want to load them with a *batch* job
 Can Flink handle the loading of one and each file in a single source
 and send data to my JDBC sink?
 I wish I can provide the URL of the directory containing my thousand
 files to the batch source to make it load all of them sequentially.
 My sources and sinks are currently available for BatchTableSource, I
 guess the cost to make them available for streaming would be quite
 expensive for me for the moment.

 Have someone ever done this?
 Am I wrong to expect doing so with a batch job?

 All the best

 François Lacombe


    

 

 [image: Arbre vert.jpg] Pensez à la planète, imprimer ce papier que si
 nécessaire

>>>
>>
>>    
>> 
>> 
>>
>> [image: Arbre vert.jpg] Pensez à la planète, imprimer ce papier que si
>> nécessaire
>>
>

-- 

       
   



 Pensez à la 
planète, imprimer ce papier que si nécessaire 


Re: How to load multiple same-format files with single batch job?

2019-02-05 Thread françois lacombe
Thank you Fabian,

That's good, I'll go for a custom File input stream.

All the best

François

Le lun. 4 févr. 2019 à 12:10, Fabian Hueske  a écrit :

> Hi,
>
> The files will be read in a streaming fashion.
> Typically files are broken down into processing splits that are
> distributed to tasks for reading.
> How a task reads a file split depends on the implementation, but usually
> the format reads the split as a stream and does not read the split as a
> whole before emitting records.
>
> Best,
> Fabian
>
> Am Mo., 4. Feb. 2019 um 12:06 Uhr schrieb françois lacombe <
> francois.laco...@dcbrain.com>:
>
>> Hi Fabian,
>>
>> Thank you for this input.
>> This is interesting.
>>
>> With such an input format, will all the file will be loaded in memory
>> before to be processed or will all be streamed?
>>
>> All the best
>> François
>>
>> Le mar. 29 janv. 2019 à 22:20, Fabian Hueske  a
>> écrit :
>>
>>> Hi,
>>>
>>> You can point a file-based input format to a directory and the input
>>> format should read all files in that directory.
>>> That works as well for TableSources that are internally use file-based
>>> input formats.
>>> Is that what you are looking for?
>>>
>>> Best, Fabian
>>>
>>> Am Mo., 28. Jan. 2019 um 17:22 Uhr schrieb françois lacombe <
>>> francois.laco...@dcbrain.com>:
>>>
 Hi all,

 I'm wondering if it's possible and what's the best way to achieve the
 loading of multiple files with a Json source to a JDBC sink ?
 I'm running Flink 1.7.0

 Let's say I have about 1500 files with the same structure (same format,
 schema, everything) and I want to load them with a *batch* job
 Can Flink handle the loading of one and each file in a single source
 and send data to my JDBC sink?
 I wish I can provide the URL of the directory containing my thousand
 files to the batch source to make it load all of them sequentially.
 My sources and sinks are currently available for BatchTableSource, I
 guess the cost to make them available for streaming would be quite
 expensive for me for the moment.

 Have someone ever done this?
 Am I wrong to expect doing so with a batch job?

 All the best

 François Lacombe


    

 

 [image: Arbre vert.jpg] Pensez à la planète, imprimer ce papier que si
 nécessaire

>>>
>>
>>    
>> 
>> 
>>
>> [image: Arbre vert.jpg] Pensez à la planète, imprimer ce papier que si
>> nécessaire
>>
>

-- 

       
   



 Pensez à la 
planète, imprimer ce papier que si nécessaire 


Re: How to load multiple same-format files with single batch job?

2019-02-04 Thread Fabian Hueske
Hi,

The files will be read in a streaming fashion.
Typically files are broken down into processing splits that are distributed
to tasks for reading.
How a task reads a file split depends on the implementation, but usually
the format reads the split as a stream and does not read the split as a
whole before emitting records.

Best,
Fabian

Am Mo., 4. Feb. 2019 um 12:06 Uhr schrieb françois lacombe <
francois.laco...@dcbrain.com>:

> Hi Fabian,
>
> Thank you for this input.
> This is interesting.
>
> With such an input format, will all the file will be loaded in memory
> before to be processed or will all be streamed?
>
> All the best
> François
>
> Le mar. 29 janv. 2019 à 22:20, Fabian Hueske  a écrit :
>
>> Hi,
>>
>> You can point a file-based input format to a directory and the input
>> format should read all files in that directory.
>> That works as well for TableSources that are internally use file-based
>> input formats.
>> Is that what you are looking for?
>>
>> Best, Fabian
>>
>> Am Mo., 28. Jan. 2019 um 17:22 Uhr schrieb françois lacombe <
>> francois.laco...@dcbrain.com>:
>>
>>> Hi all,
>>>
>>> I'm wondering if it's possible and what's the best way to achieve the
>>> loading of multiple files with a Json source to a JDBC sink ?
>>> I'm running Flink 1.7.0
>>>
>>> Let's say I have about 1500 files with the same structure (same format,
>>> schema, everything) and I want to load them with a *batch* job
>>> Can Flink handle the loading of one and each file in a single source and
>>> send data to my JDBC sink?
>>> I wish I can provide the URL of the directory containing my thousand
>>> files to the batch source to make it load all of them sequentially.
>>> My sources and sinks are currently available for BatchTableSource, I
>>> guess the cost to make them available for streaming would be quite
>>> expensive for me for the moment.
>>>
>>> Have someone ever done this?
>>> Am I wrong to expect doing so with a batch job?
>>>
>>> All the best
>>>
>>> François Lacombe
>>>
>>>
>>>    
>>> 
>>> 
>>>
>>> [image: Arbre vert.jpg] Pensez à la planète, imprimer ce papier que si
>>> nécessaire
>>>
>>
>
>    
> 
> 
>
> [image: Arbre vert.jpg] Pensez à la planète, imprimer ce papier que si
> nécessaire
>


Re: How to load multiple same-format files with single batch job?

2019-02-04 Thread françois lacombe
Hi Fabian,

Thank you for this input.
This is interesting.

With such an input format, will all the file will be loaded in memory
before to be processed or will all be streamed?

All the best
François

Le mar. 29 janv. 2019 à 22:20, Fabian Hueske  a écrit :

> Hi,
>
> You can point a file-based input format to a directory and the input
> format should read all files in that directory.
> That works as well for TableSources that are internally use file-based
> input formats.
> Is that what you are looking for?
>
> Best, Fabian
>
> Am Mo., 28. Jan. 2019 um 17:22 Uhr schrieb françois lacombe <
> francois.laco...@dcbrain.com>:
>
>> Hi all,
>>
>> I'm wondering if it's possible and what's the best way to achieve the
>> loading of multiple files with a Json source to a JDBC sink ?
>> I'm running Flink 1.7.0
>>
>> Let's say I have about 1500 files with the same structure (same format,
>> schema, everything) and I want to load them with a *batch* job
>> Can Flink handle the loading of one and each file in a single source and
>> send data to my JDBC sink?
>> I wish I can provide the URL of the directory containing my thousand
>> files to the batch source to make it load all of them sequentially.
>> My sources and sinks are currently available for BatchTableSource, I
>> guess the cost to make them available for streaming would be quite
>> expensive for me for the moment.
>>
>> Have someone ever done this?
>> Am I wrong to expect doing so with a batch job?
>>
>> All the best
>>
>> François Lacombe
>>
>>
>>    
>> 
>> 
>>
>> [image: Arbre vert.jpg] Pensez à la planète, imprimer ce papier que si
>> nécessaire
>>
>

-- 

       
   



 Pensez à la 
planète, imprimer ce papier que si nécessaire 


Re: How to load multiple same-format files with single batch job?

2019-01-29 Thread Fabian Hueske
Hi,

You can point a file-based input format to a directory and the input format
should read all files in that directory.
That works as well for TableSources that are internally use file-based
input formats.
Is that what you are looking for?

Best, Fabian

Am Mo., 28. Jan. 2019 um 17:22 Uhr schrieb françois lacombe <
francois.laco...@dcbrain.com>:

> Hi all,
>
> I'm wondering if it's possible and what's the best way to achieve the
> loading of multiple files with a Json source to a JDBC sink ?
> I'm running Flink 1.7.0
>
> Let's say I have about 1500 files with the same structure (same format,
> schema, everything) and I want to load them with a *batch* job
> Can Flink handle the loading of one and each file in a single source and
> send data to my JDBC sink?
> I wish I can provide the URL of the directory containing my thousand files
> to the batch source to make it load all of them sequentially.
> My sources and sinks are currently available for BatchTableSource, I guess
> the cost to make them available for streaming would be quite expensive for
> me for the moment.
>
> Have someone ever done this?
> Am I wrong to expect doing so with a batch job?
>
> All the best
>
> François Lacombe
>
>
>    
> 
> 
>
> [image: Arbre vert.jpg] Pensez à la planète, imprimer ce papier que si
> nécessaire
>