Re: [PROPOSAL] Snowflake Java Connector for Apache Beam

2020-04-10 Thread Dariusz Aniszewski
Hello

It's been a while since my last activity on beam dev-list ;) Happy to be
back!

Few days ago Kasia created a JIRA issue for adding SnowflakeIO:
https://issues.apache.org/jira/browse/BEAM-9722

Today, I'm happy to share the first PR with you with SnowflakeIO.Read:
https://github.com/apache/beam/pull/11360

Subsequent PRs (with Write and other parts) will come later, after this one
is approved and merged, as reviewing the whole thing at once would be very
hard.

We're looking forward to seeing all your reviews!

Cheers,
Dariusz





On Thu, Mar 26, 2020 at 4:58 PM Katarzyna Kucharczyk <
ka.kucharc...@gmail.com> wrote:

> Hi,
> Thank you for your enthusiasm and for so many questions/comments :) I hope
> to address them all.
>
> Alexey, as far as I know, copy methods have better performance than
> inserts/selects. I think currently in Beam's JDBC loading and unloading is
> provided by selects and inserts as well. But I saw copy command in Postgres
> JDBC, maybe it's something worth investigating in the future?
> As it comes to other cloud storages, we thought GCP is a good starting
> point. It makes also sense in case of using Dataflow as a runner, so the
> user would have expenses only on one provider. But I think it would be
> great to add other storages in the future. As Ismaël mentioned it would be
> good to know if S3 works fine with FileIO as well.
> We didn't think about using Beam Schema in the IO, but it might be worth
> checking in case of creating table with specified schema.
>
> Cham, thanks for advice about SDF. I wonder how it might influence whole
> IO. I guess it can be helpful while staging files and splitting in
> pipeline. The COPY operation is called once for all staged files. It should
> be optimised on Snowflake side. I have to research it and check how it's
> done in other IOs.
>
> Ismaël, unfortunately there is no such thing as embedded Snowflake :( What
> we currently plan is to create fake Snowflake service for unit testing.
>
> Indeed, this is interesting that there are many tool with similar copy
> pattern. I am curious if it could be shared functionality in Beam.
>
> Thanks again for all comments and suggestions - those are extremely
> helpful,
> Kasia
>
> On Tue, Mar 24, 2020 at 10:28 AM Ismaël Mejía  wrote:
>
>> Forgot to mention that one particularly pesky issue we found in the work
>> on
>> Redshift is to be able to write unit tests on this.
>>
>> Is there an embedded version of SnowFlake to run those. I would like also
>> if
>> possible to get some ideas on how to test this use case.
>>
>> Also we should probably ensure that the FileIO part is generic enough so
>> we can
>> use S3 too because users can be using Snowflake in AWS too.
>>
>>
>> On Tue, Mar 24, 2020 at 10:10 AM Ismaël Mejía  wrote:
>>
>>> Great !
>>> It seems this pattern (COPY + parallel file read) is becoming a standard
>>> for
>>> 'data warehouses' we are using something similar too in the AWS Redshift
>>> PR (WIP)
>>> for details: https://github.com/apache/beam/pull/10206
>>>
>>> Maybe worth for all of us to check and se eif we can converge the
>>> implementations as
>>> much as possible to provide users a consistent experience.
>>>
>>>
>>> On Tue, Mar 24, 2020 at 10:02 AM Elias Djurfeldt <
>>> elias.djurfe...@mirado.com> wrote:
>>>
 Awesome job! I'm very interested in the cross-language support.

 Cheers,

 On Tue, 24 Mar 2020 at 01:20, Chamikara Jayalath 
 wrote:

> Sounds great. Looks like operation of the Snowflake source will be
> similar to BigQuery source (export files to GCS and read files). This will
> allow you to better parallelize reading (current JDBC source is limited to
> one worker when reading).
>
> Seems like you already support initial splitting using files -
> https://github.com/PolideaInternal/beam/blob/snowflake-io/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L374
> Prob. also consider supporting dynamic work rebalancing when runners
> support this through SDF.
>
> Thanks,
> Cham
>
>
>
>
> On Mon, Mar 23, 2020 at 9:49 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
>
>> Great! This is always welcomed to have more IOs in Beam. I’d be happy
>> to take look on your PR once it will be created.
>>
>> Just a couple of questions for now.
>>
>> 1) Afaik, you can connect to Snowflake using standard JDBC driver. Do
>> you plan to compare a performance between this SnowflakeIO and Beam 
>> JdbcIO?
>> 2) Are you going to support staging in other locations, like S3 and
>> Azure?
>> 3) Does “ withSchema()” allows to infer Snowflake schema to Beam
>> schema?
>>
>> On 23 Mar 2020, at 15:23, Katarzyna Kucharczyk <
>> ka.kucharc...@gmail.com> wrote:
>>
>> Hi all,
>>
>> Me and my colleagues have developed a new Java connector for
>> Snowflake that we would like to add 

Re: [PROPOSAL] Snowflake Java Connector for Apache Beam

2020-03-26 Thread Katarzyna Kucharczyk
Hi,
Thank you for your enthusiasm and for so many questions/comments :) I hope
to address them all.

Alexey, as far as I know, copy methods have better performance than
inserts/selects. I think currently in Beam's JDBC loading and unloading is
provided by selects and inserts as well. But I saw copy command in Postgres
JDBC, maybe it's something worth investigating in the future?
As it comes to other cloud storages, we thought GCP is a good starting
point. It makes also sense in case of using Dataflow as a runner, so the
user would have expenses only on one provider. But I think it would be
great to add other storages in the future. As Ismaël mentioned it would be
good to know if S3 works fine with FileIO as well.
We didn't think about using Beam Schema in the IO, but it might be worth
checking in case of creating table with specified schema.

Cham, thanks for advice about SDF. I wonder how it might influence whole
IO. I guess it can be helpful while staging files and splitting in
pipeline. The COPY operation is called once for all staged files. It should
be optimised on Snowflake side. I have to research it and check how it's
done in other IOs.

Ismaël, unfortunately there is no such thing as embedded Snowflake :( What
we currently plan is to create fake Snowflake service for unit testing.

Indeed, this is interesting that there are many tool with similar copy
pattern. I am curious if it could be shared functionality in Beam.

Thanks again for all comments and suggestions - those are extremely helpful,
Kasia

On Tue, Mar 24, 2020 at 10:28 AM Ismaël Mejía  wrote:

> Forgot to mention that one particularly pesky issue we found in the work on
> Redshift is to be able to write unit tests on this.
>
> Is there an embedded version of SnowFlake to run those. I would like also
> if
> possible to get some ideas on how to test this use case.
>
> Also we should probably ensure that the FileIO part is generic enough so
> we can
> use S3 too because users can be using Snowflake in AWS too.
>
>
> On Tue, Mar 24, 2020 at 10:10 AM Ismaël Mejía  wrote:
>
>> Great !
>> It seems this pattern (COPY + parallel file read) is becoming a standard
>> for
>> 'data warehouses' we are using something similar too in the AWS Redshift
>> PR (WIP)
>> for details: https://github.com/apache/beam/pull/10206
>>
>> Maybe worth for all of us to check and se eif we can converge the
>> implementations as
>> much as possible to provide users a consistent experience.
>>
>>
>> On Tue, Mar 24, 2020 at 10:02 AM Elias Djurfeldt <
>> elias.djurfe...@mirado.com> wrote:
>>
>>> Awesome job! I'm very interested in the cross-language support.
>>>
>>> Cheers,
>>>
>>> On Tue, 24 Mar 2020 at 01:20, Chamikara Jayalath 
>>> wrote:
>>>
 Sounds great. Looks like operation of the Snowflake source will be
 similar to BigQuery source (export files to GCS and read files). This will
 allow you to better parallelize reading (current JDBC source is limited to
 one worker when reading).

 Seems like you already support initial splitting using files -
 https://github.com/PolideaInternal/beam/blob/snowflake-io/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L374
 Prob. also consider supporting dynamic work rebalancing when runners
 support this through SDF.

 Thanks,
 Cham




 On Mon, Mar 23, 2020 at 9:49 AM Alexey Romanenko <
 aromanenko@gmail.com> wrote:

> Great! This is always welcomed to have more IOs in Beam. I’d be happy
> to take look on your PR once it will be created.
>
> Just a couple of questions for now.
>
> 1) Afaik, you can connect to Snowflake using standard JDBC driver. Do
> you plan to compare a performance between this SnowflakeIO and Beam 
> JdbcIO?
> 2) Are you going to support staging in other locations, like S3 and
> Azure?
> 3) Does “ withSchema()” allows to infer Snowflake schema to Beam
> schema?
>
> On 23 Mar 2020, at 15:23, Katarzyna Kucharczyk <
> ka.kucharc...@gmail.com> wrote:
>
> Hi all,
>
> Me and my colleagues have developed a new Java connector for Snowflake
> that we would like to add to Beam.
>
> Snowflake is an analytic data warehouse provided as
> Software-as-a-Service (SaaS). It uses a new SQL database engine with a
> unique architecture designed for the cloud. To read more details please
> check [1] and [2].
>
> Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are
> batch write and batch read that use the Snowflake COPY [4] operation
> underneath. In both cases ParDo IOs load files on a stage and then they 
> are
> inserted into the Snowflake table of choice using the COPY API. The
> currently supported stage is Google Cloud Storage[5].
>
> The schema how Snowflake Read IO works (write operation works
> similarly but in opposite direction):
> Here is an Apache 

Re: [PROPOSAL] Snowflake Java Connector for Apache Beam

2020-03-24 Thread Ismaël Mejía
Forgot to mention that one particularly pesky issue we found in the work on
Redshift is to be able to write unit tests on this.

Is there an embedded version of SnowFlake to run those. I would like also if
possible to get some ideas on how to test this use case.

Also we should probably ensure that the FileIO part is generic enough so we
can
use S3 too because users can be using Snowflake in AWS too.


On Tue, Mar 24, 2020 at 10:10 AM Ismaël Mejía  wrote:

> Great !
> It seems this pattern (COPY + parallel file read) is becoming a standard
> for
> 'data warehouses' we are using something similar too in the AWS Redshift
> PR (WIP)
> for details: https://github.com/apache/beam/pull/10206
>
> Maybe worth for all of us to check and se eif we can converge the
> implementations as
> much as possible to provide users a consistent experience.
>
>
> On Tue, Mar 24, 2020 at 10:02 AM Elias Djurfeldt <
> elias.djurfe...@mirado.com> wrote:
>
>> Awesome job! I'm very interested in the cross-language support.
>>
>> Cheers,
>>
>> On Tue, 24 Mar 2020 at 01:20, Chamikara Jayalath 
>> wrote:
>>
>>> Sounds great. Looks like operation of the Snowflake source will be
>>> similar to BigQuery source (export files to GCS and read files). This will
>>> allow you to better parallelize reading (current JDBC source is limited to
>>> one worker when reading).
>>>
>>> Seems like you already support initial splitting using files -
>>> https://github.com/PolideaInternal/beam/blob/snowflake-io/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L374
>>> Prob. also consider supporting dynamic work rebalancing when runners
>>> support this through SDF.
>>>
>>> Thanks,
>>> Cham
>>>
>>>
>>>
>>>
>>> On Mon, Mar 23, 2020 at 9:49 AM Alexey Romanenko <
>>> aromanenko@gmail.com> wrote:
>>>
 Great! This is always welcomed to have more IOs in Beam. I’d be happy
 to take look on your PR once it will be created.

 Just a couple of questions for now.

 1) Afaik, you can connect to Snowflake using standard JDBC driver. Do
 you plan to compare a performance between this SnowflakeIO and Beam JdbcIO?
 2) Are you going to support staging in other locations, like S3 and
 Azure?
 3) Does “ withSchema()” allows to infer Snowflake schema to Beam schema?

 On 23 Mar 2020, at 15:23, Katarzyna Kucharczyk 
 wrote:

 Hi all,

 Me and my colleagues have developed a new Java connector for Snowflake
 that we would like to add to Beam.

 Snowflake is an analytic data warehouse provided as
 Software-as-a-Service (SaaS). It uses a new SQL database engine with a
 unique architecture designed for the cloud. To read more details please
 check [1] and [2].

 Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are
 batch write and batch read that use the Snowflake COPY [4] operation
 underneath. In both cases ParDo IOs load files on a stage and then they are
 inserted into the Snowflake table of choice using the COPY API. The
 currently supported stage is Google Cloud Storage[5].

 The schema how Snowflake Read IO works (write operation works similarly
 but in opposite direction):
 Here is an Apache Beam fork [6] with current work of the Snowflake IO.

 In the near future we would like to also add IO for writing streams
 which will use SnowPipe - Snowflake mechanism for continuous loading[7].
 Also, we would like to use cross language to provide Python connectors as
 well.

 We are open for all opinions and suggestions. In case of any
 questions/comments please do not hesitate to post them.

 In case of no objection I will create jira tickets and share them in
 this thread. Cheers, Kasia

 [1] https://www.snowflake.com
 [2]
 https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html
 [3] https://docs.snowflake.net/manuals/user-guide/jdbc.html
 [4]
 https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
 [5]
 https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake

 [6] https://cloud.google.com/storage
 [7]
 https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html



>>
>> --
>> Elias Djurfeldt
>> Mirado Consulting
>>
>


Re: [PROPOSAL] Snowflake Java Connector for Apache Beam

2020-03-24 Thread Ismaël Mejía
Great !
It seems this pattern (COPY + parallel file read) is becoming a standard for
'data warehouses' we are using something similar too in the AWS Redshift PR
(WIP)
for details: https://github.com/apache/beam/pull/10206

Maybe worth for all of us to check and se eif we can converge the
implementations as
much as possible to provide users a consistent experience.


On Tue, Mar 24, 2020 at 10:02 AM Elias Djurfeldt 
wrote:

> Awesome job! I'm very interested in the cross-language support.
>
> Cheers,
>
> On Tue, 24 Mar 2020 at 01:20, Chamikara Jayalath 
> wrote:
>
>> Sounds great. Looks like operation of the Snowflake source will be
>> similar to BigQuery source (export files to GCS and read files). This will
>> allow you to better parallelize reading (current JDBC source is limited to
>> one worker when reading).
>>
>> Seems like you already support initial splitting using files -
>> https://github.com/PolideaInternal/beam/blob/snowflake-io/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L374
>> Prob. also consider supporting dynamic work rebalancing when runners
>> support this through SDF.
>>
>> Thanks,
>> Cham
>>
>>
>>
>>
>> On Mon, Mar 23, 2020 at 9:49 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> Great! This is always welcomed to have more IOs in Beam. I’d be happy to
>>> take look on your PR once it will be created.
>>>
>>> Just a couple of questions for now.
>>>
>>> 1) Afaik, you can connect to Snowflake using standard JDBC driver. Do
>>> you plan to compare a performance between this SnowflakeIO and Beam JdbcIO?
>>> 2) Are you going to support staging in other locations, like S3 and
>>> Azure?
>>> 3) Does “ withSchema()” allows to infer Snowflake schema to Beam schema?
>>>
>>> On 23 Mar 2020, at 15:23, Katarzyna Kucharczyk 
>>> wrote:
>>>
>>> Hi all,
>>>
>>> Me and my colleagues have developed a new Java connector for Snowflake
>>> that we would like to add to Beam.
>>>
>>> Snowflake is an analytic data warehouse provided as
>>> Software-as-a-Service (SaaS). It uses a new SQL database engine with a
>>> unique architecture designed for the cloud. To read more details please
>>> check [1] and [2].
>>>
>>> Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are batch
>>> write and batch read that use the Snowflake COPY [4] operation underneath.
>>> In both cases ParDo IOs load files on a stage and then they are inserted
>>> into the Snowflake table of choice using the COPY API. The currently
>>> supported stage is Google Cloud Storage[5].
>>>
>>> The schema how Snowflake Read IO works (write operation works similarly
>>> but in opposite direction):
>>> Here is an Apache Beam fork [6] with current work of the Snowflake IO.
>>>
>>> In the near future we would like to also add IO for writing streams
>>> which will use SnowPipe - Snowflake mechanism for continuous loading[7].
>>> Also, we would like to use cross language to provide Python connectors as
>>> well.
>>>
>>> We are open for all opinions and suggestions. In case of any
>>> questions/comments please do not hesitate to post them.
>>>
>>> In case of no objection I will create jira tickets and share them in
>>> this thread. Cheers, Kasia
>>>
>>> [1] https://www.snowflake.com
>>> [2]
>>> https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html
>>> [3] https://docs.snowflake.net/manuals/user-guide/jdbc.html
>>> [4] https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
>>>
>>> [5]
>>> https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake
>>>
>>> [6] https://cloud.google.com/storage
>>> [7]
>>> https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html
>>>
>>>
>>>
>
> --
> Elias Djurfeldt
> Mirado Consulting
>


Re: [PROPOSAL] Snowflake Java Connector for Apache Beam

2020-03-24 Thread Elias Djurfeldt
Awesome job! I'm very interested in the cross-language support.

Cheers,

On Tue, 24 Mar 2020 at 01:20, Chamikara Jayalath 
wrote:

> Sounds great. Looks like operation of the Snowflake source will be similar
> to BigQuery source (export files to GCS and read files). This will allow
> you to better parallelize reading (current JDBC source is limited to one
> worker when reading).
>
> Seems like you already support initial splitting using files -
> https://github.com/PolideaInternal/beam/blob/snowflake-io/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L374
> Prob. also consider supporting dynamic work rebalancing when runners
> support this through SDF.
>
> Thanks,
> Cham
>
>
>
>
> On Mon, Mar 23, 2020 at 9:49 AM Alexey Romanenko 
> wrote:
>
>> Great! This is always welcomed to have more IOs in Beam. I’d be happy to
>> take look on your PR once it will be created.
>>
>> Just a couple of questions for now.
>>
>> 1) Afaik, you can connect to Snowflake using standard JDBC driver. Do you
>> plan to compare a performance between this SnowflakeIO and Beam JdbcIO?
>> 2) Are you going to support staging in other locations, like S3 and Azure?
>> 3) Does “ withSchema()” allows to infer Snowflake schema to Beam schema?
>>
>> On 23 Mar 2020, at 15:23, Katarzyna Kucharczyk 
>> wrote:
>>
>> Hi all,
>>
>> Me and my colleagues have developed a new Java connector for Snowflake
>> that we would like to add to Beam.
>>
>> Snowflake is an analytic data warehouse provided as Software-as-a-Service
>> (SaaS). It uses a new SQL database engine with a unique architecture
>> designed for the cloud. To read more details please check [1] and [2].
>>
>> Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are batch
>> write and batch read that use the Snowflake COPY [4] operation underneath.
>> In both cases ParDo IOs load files on a stage and then they are inserted
>> into the Snowflake table of choice using the COPY API. The currently
>> supported stage is Google Cloud Storage[5].
>>
>> The schema how Snowflake Read IO works (write operation works similarly
>> but in opposite direction):
>> Here is an Apache Beam fork [6] with current work of the Snowflake IO.
>>
>> In the near future we would like to also add IO for writing streams which
>> will use SnowPipe - Snowflake mechanism for continuous loading[7]. Also, we
>> would like to use cross language to provide Python connectors as well.
>>
>> We are open for all opinions and suggestions. In case of any
>> questions/comments please do not hesitate to post them.
>>
>> In case of no objection I will create jira tickets and share them in this
>> thread. Cheers, Kasia
>>
>> [1] https://www.snowflake.com
>> [2] https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html
>>
>> [3] https://docs.snowflake.net/manuals/user-guide/jdbc.html
>> [4] https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
>> [5]
>> https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake
>>
>> [6] https://cloud.google.com/storage
>> [7] https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html
>>
>>
>>
>>

-- 
Elias Djurfeldt
Mirado Consulting


Re: [PROPOSAL] Snowflake Java Connector for Apache Beam

2020-03-23 Thread Chamikara Jayalath
Sounds great. Looks like operation of the Snowflake source will be similar
to BigQuery source (export files to GCS and read files). This will allow
you to better parallelize reading (current JDBC source is limited to one
worker when reading).

Seems like you already support initial splitting using files -
https://github.com/PolideaInternal/beam/blob/snowflake-io/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L374
Prob. also consider supporting dynamic work rebalancing when runners
support this through SDF.

Thanks,
Cham




On Mon, Mar 23, 2020 at 9:49 AM Alexey Romanenko 
wrote:

> Great! This is always welcomed to have more IOs in Beam. I’d be happy to
> take look on your PR once it will be created.
>
> Just a couple of questions for now.
>
> 1) Afaik, you can connect to Snowflake using standard JDBC driver. Do you
> plan to compare a performance between this SnowflakeIO and Beam JdbcIO?
> 2) Are you going to support staging in other locations, like S3 and Azure?
> 3) Does “ withSchema()” allows to infer Snowflake schema to Beam schema?
>
> On 23 Mar 2020, at 15:23, Katarzyna Kucharczyk 
> wrote:
>
> Hi all,
>
> Me and my colleagues have developed a new Java connector for Snowflake
> that we would like to add to Beam.
>
> Snowflake is an analytic data warehouse provided as Software-as-a-Service
> (SaaS). It uses a new SQL database engine with a unique architecture
> designed for the cloud. To read more details please check [1] and [2].
>
> Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are batch
> write and batch read that use the Snowflake COPY [4] operation underneath.
> In both cases ParDo IOs load files on a stage and then they are inserted
> into the Snowflake table of choice using the COPY API. The currently
> supported stage is Google Cloud Storage[5].
>
> The schema how Snowflake Read IO works (write operation works similarly
> but in opposite direction):
> Here is an Apache Beam fork [6] with current work of the Snowflake IO.
>
> In the near future we would like to also add IO for writing streams which
> will use SnowPipe - Snowflake mechanism for continuous loading[7]. Also, we
> would like to use cross language to provide Python connectors as well.
>
> We are open for all opinions and suggestions. In case of any
> questions/comments please do not hesitate to post them.
>
> In case of no objection I will create jira tickets and share them in this
> thread. Cheers, Kasia
>
> [1] https://www.snowflake.com
> [2] https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html
> [3] https://docs.snowflake.net/manuals/user-guide/jdbc.html
> [4] https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
> [5]
> https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake
>
> [6] https://cloud.google.com/storage
> [7] https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html
>
>
>


Re: [PROPOSAL] Snowflake Java Connector for Apache Beam

2020-03-23 Thread Alexey Romanenko
Great! This is always welcomed to have more IOs in Beam. I’d be happy to take 
look on your PR once it will be created.

Just a couple of questions for now.

1) Afaik, you can connect to Snowflake using standard JDBC driver. Do you plan 
to compare a performance between this SnowflakeIO and Beam JdbcIO?
2) Are you going to support staging in other locations, like S3 and Azure?
3) Does “ withSchema()” allows to infer Snowflake schema to Beam schema?

> On 23 Mar 2020, at 15:23, Katarzyna Kucharczyk  
> wrote:
> 
> Hi all,
> 
> Me and my colleagues have developed a new Java connector for Snowflake that 
> we would like to add to Beam.
> 
> Snowflake is an analytic data warehouse provided as Software-as-a-Service 
> (SaaS). It uses a new SQL database engine with a unique architecture designed 
> for the cloud. To read more details please check [1] and [2].
> 
> Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are batch 
> write and batch read that use the Snowflake COPY [4] operation underneath. In 
> both cases ParDo IOs load files on a stage and then they are inserted into 
> the Snowflake table of choice using the COPY API. The currently supported 
> stage is Google Cloud Storage[5].
> 
> The schema how Snowflake Read IO works (write operation works similarly but 
> in opposite direction):
> 
> 
> 
> Here is an Apache Beam fork [6] with current work of the Snowflake IO.
> 
> In the near future we would like to also add IO for writing streams which 
> will use SnowPipe - Snowflake mechanism for continuous loading[7]. Also, we 
> would like to use cross language to provide Python connectors as well.
> 
> We are open for all opinions and suggestions. In case of any 
> questions/comments please do not hesitate to post them.
> 
> In case of no objection I will create jira tickets and share them in this 
> thread.
> 
> Cheers,
> Kasia
> 
> [1] https://www.snowflake.com  
> [2] https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html 
>  
> [3] https://docs.snowflake.net/manuals/user-guide/jdbc.html 
>  
> [4] https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html 
>  
> [5] 
> https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake
>  
> 
>  
> [6] https://cloud.google.com/storage  
> [7] https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html 
>  
> 



Re: [PROPOSAL] Snowflake Java Connector for Apache Beam

2020-03-23 Thread Jean-Baptiste Onofre
Hi,

It’s very interesting. +1 to create a Jira and prepare a PR for review.

Thanks !
Regards
JB

> Le 23 mars 2020 à 15:23, Katarzyna Kucharczyk  a 
> écrit :
> 
> Hi all,
> 
> Me and my colleagues have developed a new Java connector for Snowflake that 
> we would like to add to Beam.
> 
> Snowflake is an analytic data warehouse provided as Software-as-a-Service 
> (SaaS). It uses a new SQL database engine with a unique architecture designed 
> for the cloud. To read more details please check [1] and [2].
> 
> Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are batch 
> write and batch read that use the Snowflake COPY [4] operation underneath. In 
> both cases ParDo IOs load files on a stage and then they are inserted into 
> the Snowflake table of choice using the COPY API. The currently supported 
> stage is Google Cloud Storage[5].
> 
> The schema how Snowflake Read IO works (write operation works similarly but 
> in opposite direction):
> 
> 
> 
> Here is an Apache Beam fork [6] with current work of the Snowflake IO.
> 
> In the near future we would like to also add IO for writing streams which 
> will use SnowPipe - Snowflake mechanism for continuous loading[7]. Also, we 
> would like to use cross language to provide Python connectors as well.
> 
> We are open for all opinions and suggestions. In case of any 
> questions/comments please do not hesitate to post them.
> 
> In case of no objection I will create jira tickets and share them in this 
> thread.
> 
> Cheers,
> Kasia
> 
> [1] https://www.snowflake.com  
> [2] https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html 
>  
> [3] https://docs.snowflake.net/manuals/user-guide/jdbc.html 
>  
> [4] https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html 
>  
> [5] 
> https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake
>  
> 
>  
> [6] https://cloud.google.com/storage  
> [7] https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html 
>  
> 



[PROPOSAL] Snowflake Java Connector for Apache Beam

2020-03-23 Thread Katarzyna Kucharczyk
Hi all,

Me and my colleagues have developed a new Java connector for Snowflake that
we would like to add to Beam.

Snowflake is an analytic data warehouse provided as Software-as-a-Service
(SaaS). It uses a new SQL database engine with a unique architecture
designed for the cloud. To read more details please check [1] and [2].

Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are batch
write and batch read that use the Snowflake COPY [4] operation underneath.
In both cases ParDo IOs load files on a stage and then they are inserted
into the Snowflake table of choice using the COPY API. The currently
supported stage is Google Cloud Storage[5].

The schema how Snowflake Read IO works (write operation works similarly but
in opposite direction):

Here is an Apache Beam fork [6] with current work of the Snowflake IO.

In the near future we would like to also add IO for writing streams which
will use SnowPipe - Snowflake mechanism for continuous loading[7]. Also, we
would like to use cross language to provide Python connectors as well.

We are open for all opinions and suggestions. In case of any
questions/comments please do not hesitate to post them.

In case of no objection I will create jira tickets and share them in this
thread. Cheers, Kasia

[1] https://www.snowflake.com

[2] https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html

[3] https://docs.snowflake.net/manuals/user-guide/jdbc.html

[4] https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html

[5]
https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake


[6] https://cloud.google.com/storage

[7] https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html