Re: [PROPOSAL] Snowflake Java Connector for Apache Beam

Dariusz Aniszewski Fri, 10 Apr 2020 08:21:46 -0700

Hello

It's been a while since my last activity on beam dev-list ;) Happy to be
back!


Few days ago Kasia created a JIRA issue for adding SnowflakeIO:
https://issues.apache.org/jira/browse/BEAM-9722

Today, I'm happy to share the first PR with you with SnowflakeIO.Read:
https://github.com/apache/beam/pull/11360

Subsequent PRs (with Write and other parts) will come later, after this one
is approved and merged, as reviewing the whole thing at once would be very
hard.

We're looking forward to seeing all your reviews!

Cheers,
Dariusz





On Thu, Mar 26, 2020 at 4:58 PM Katarzyna Kucharczyk <
[email protected]> wrote:

> Hi,
> Thank you for your enthusiasm and for so many questions/comments :) I hope
> to address them all.
>
> Alexey, as far as I know, copy methods have better performance than
> inserts/selects. I think currently in Beam's JDBC loading and unloading is
> provided by selects and inserts as well. But I saw copy command in Postgres
> JDBC, maybe it's something worth investigating in the future?
> As it comes to other cloud storages, we thought GCP is a good starting
> point. It makes also sense in case of using Dataflow as a runner, so the
> user would have expenses only on one provider. But I think it would be
> great to add other storages in the future. As Ismaël mentioned it would be
> good to know if S3 works fine with FileIO as well.
> We didn't think about using Beam Schema in the IO, but it might be worth
> checking in case of creating table with specified schema.
>
> Cham, thanks for advice about SDF. I wonder how it might influence whole
> IO. I guess it can be helpful while staging files and splitting in
> pipeline. The COPY operation is called once for all staged files. It should
> be optimised on Snowflake side. I have to research it and check how it's
> done in other IOs.
>
> Ismaël, unfortunately there is no such thing as embedded Snowflake :( What
> we currently plan is to create fake Snowflake service for unit testing.
>
> Indeed, this is interesting that there are many tool with similar copy
> pattern. I am curious if it could be shared functionality in Beam.
>
> Thanks again for all comments and suggestions - those are extremely
> helpful,
> Kasia
>
> On Tue, Mar 24, 2020 at 10:28 AM Ismaël Mejía <[email protected]> wrote:
>
>> Forgot to mention that one particularly pesky issue we found in the work
>> on
>> Redshift is to be able to write unit tests on this.
>>
>> Is there an embedded version of SnowFlake to run those. I would like also
>> if
>> possible to get some ideas on how to test this use case.
>>
>> Also we should probably ensure that the FileIO part is generic enough so
>> we can
>> use S3 too because users can be using Snowflake in AWS too.
>>
>>
>> On Tue, Mar 24, 2020 at 10:10 AM Ismaël Mejía <[email protected]> wrote:
>>
>>> Great !
>>> It seems this pattern (COPY + parallel file read) is becoming a standard
>>> for
>>> 'data warehouses' we are using something similar too in the AWS Redshift
>>> PR (WIP)
>>> for details: https://github.com/apache/beam/pull/10206
>>>
>>> Maybe worth for all of us to check and se eif we can converge the
>>> implementations as
>>> much as possible to provide users a consistent experience.
>>>
>>>
>>> On Tue, Mar 24, 2020 at 10:02 AM Elias Djurfeldt <
>>> [email protected]> wrote:
>>>
>>>> Awesome job! I'm very interested in the cross-language support.
>>>>
>>>> Cheers,
>>>>
>>>> On Tue, 24 Mar 2020 at 01:20, Chamikara Jayalath <[email protected]>
>>>> wrote:
>>>>
>>>>> Sounds great. Looks like operation of the Snowflake source will be
>>>>> similar to BigQuery source (export files to GCS and read files). This will
>>>>> allow you to better parallelize reading (current JDBC source is limited to
>>>>> one worker when reading).
>>>>>
>>>>> Seems like you already support initial splitting using files -
>>>>> https://github.com/PolideaInternal/beam/blob/snowflake-io/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L374
>>>>> Prob. also consider supporting dynamic work rebalancing when runners
>>>>> support this through SDF.
>>>>>
>>>>> Thanks,
>>>>> Cham
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 23, 2020 at 9:49 AM Alexey Romanenko <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Great! This is always welcomed to have more IOs in Beam. I’d be happy
>>>>>> to take look on your PR once it will be created.
>>>>>>
>>>>>> Just a couple of questions for now.
>>>>>>
>>>>>> 1) Afaik, you can connect to Snowflake using standard JDBC driver. Do
>>>>>> you plan to compare a performance between this SnowflakeIO and Beam 
>>>>>> JdbcIO?
>>>>>> 2) Are you going to support staging in other locations, like S3 and
>>>>>> Azure?
>>>>>> 3) Does “ withSchema()” allows to infer Snowflake schema to Beam
>>>>>> schema?
>>>>>>
>>>>>> On 23 Mar 2020, at 15:23, Katarzyna Kucharczyk <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Me and my colleagues have developed a new Java connector for
>>>>>> Snowflake that we would like to add to Beam.
>>>>>>
>>>>>> Snowflake is an analytic data warehouse provided as
>>>>>> Software-as-a-Service (SaaS). It uses a new SQL database engine with a
>>>>>> unique architecture designed for the cloud. To read more details please
>>>>>> check [1] and [2].
>>>>>>
>>>>>> Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are
>>>>>> batch write and batch read that use the Snowflake COPY [4] operation
>>>>>> underneath. In both cases ParDo IOs load files on a stage and then they 
>>>>>> are
>>>>>> inserted into the Snowflake table of choice using the COPY API. The
>>>>>> currently supported stage is Google Cloud Storage[5].
>>>>>>
>>>>>> The schema how Snowflake Read IO works (write operation works
>>>>>> similarly but in opposite direction):
>>>>>> Here is an Apache Beam fork [6] with current work of the Snowflake IO.
>>>>>>
>>>>>> In the near future we would like to also add IO for writing streams
>>>>>> which will use SnowPipe - Snowflake mechanism for continuous loading[7].
>>>>>> Also, we would like to use cross language to provide Python connectors as
>>>>>> well.
>>>>>>
>>>>>> We are open for all opinions and suggestions. In case of any
>>>>>> questions/comments please do not hesitate to post them.
>>>>>>
>>>>>> In case of no objection I will create jira tickets and share them in
>>>>>> this thread. Cheers, Kasia
>>>>>>
>>>>>> [1] https://www.snowflake.com
>>>>>> [2]
>>>>>> https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html
>>>>>>
>>>>>> [3] https://docs.snowflake.net/manuals/user-guide/jdbc.html
>>>>>> [4]
>>>>>> https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
>>>>>> [5]
>>>>>> https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake
>>>>>>
>>>>>> [6] https://cloud.google.com/storage
>>>>>> [7]
>>>>>> https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Elias Djurfeldt
>>>> Mirado Consulting
>>>>
>>>

-- 

Dariusz Aniszewski
Polidea <https://www.polidea.com/> | Lead Software Engineer

M: +48 535 432 708 <+48535432708>
E: [email protected]
[image: Polidea] <https://www.polidea.com/>

Check out our projects! <https://www.polidea.com/our-work>
[image: Github] <https://github.com/Polidea> [image: Facebook]
<https://www.facebook.com/Polidea.Software> [image: Twitter]
<https://twitter.com/polidea> [image: Linkedin]
<https://www.linkedin.com/company/polidea> [image: Instagram]
<https://instagram.com/polidea> [image: Behance]
<https://www.behance.net/polidea> [image: dribbble]
<https://dribbble.com/polideadesign>

Re: [PROPOSAL] Snowflake Java Connector for Apache Beam

Reply via email to