Great ! It seems this pattern (COPY + parallel file read) is becoming a standard for 'data warehouses' we are using something similar too in the AWS Redshift PR (WIP) for details: https://github.com/apache/beam/pull/10206
Maybe worth for all of us to check and se eif we can converge the implementations as much as possible to provide users a consistent experience. On Tue, Mar 24, 2020 at 10:02 AM Elias Djurfeldt <elias.djurfe...@mirado.com> wrote: > Awesome job! I'm very interested in the cross-language support. > > Cheers, > > On Tue, 24 Mar 2020 at 01:20, Chamikara Jayalath <chamik...@google.com> > wrote: > >> Sounds great. Looks like operation of the Snowflake source will be >> similar to BigQuery source (export files to GCS and read files). This will >> allow you to better parallelize reading (current JDBC source is limited to >> one worker when reading). >> >> Seems like you already support initial splitting using files - >> https://github.com/PolideaInternal/beam/blob/snowflake-io/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L374 >> Prob. also consider supporting dynamic work rebalancing when runners >> support this through SDF. >> >> Thanks, >> Cham >> >> >> >> >> On Mon, Mar 23, 2020 at 9:49 AM Alexey Romanenko < >> aromanenko....@gmail.com> wrote: >> >>> Great! This is always welcomed to have more IOs in Beam. I’d be happy to >>> take look on your PR once it will be created. >>> >>> Just a couple of questions for now. >>> >>> 1) Afaik, you can connect to Snowflake using standard JDBC driver. Do >>> you plan to compare a performance between this SnowflakeIO and Beam JdbcIO? >>> 2) Are you going to support staging in other locations, like S3 and >>> Azure? >>> 3) Does “ withSchema()” allows to infer Snowflake schema to Beam schema? >>> >>> On 23 Mar 2020, at 15:23, Katarzyna Kucharczyk <ka.kucharc...@gmail.com> >>> wrote: >>> >>> Hi all, >>> >>> Me and my colleagues have developed a new Java connector for Snowflake >>> that we would like to add to Beam. >>> >>> Snowflake is an analytic data warehouse provided as >>> Software-as-a-Service (SaaS). It uses a new SQL database engine with a >>> unique architecture designed for the cloud. To read more details please >>> check [1] and [2]. >>> >>> Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are batch >>> write and batch read that use the Snowflake COPY [4] operation underneath. >>> In both cases ParDo IOs load files on a stage and then they are inserted >>> into the Snowflake table of choice using the COPY API. The currently >>> supported stage is Google Cloud Storage[5]. >>> >>> The schema how Snowflake Read IO works (write operation works similarly >>> but in opposite direction): >>> Here is an Apache Beam fork [6] with current work of the Snowflake IO. >>> >>> In the near future we would like to also add IO for writing streams >>> which will use SnowPipe - Snowflake mechanism for continuous loading[7]. >>> Also, we would like to use cross language to provide Python connectors as >>> well. >>> >>> We are open for all opinions and suggestions. In case of any >>> questions/comments please do not hesitate to post them. >>> >>> In case of no objection I will create jira tickets and share them in >>> this thread. Cheers, Kasia >>> >>> [1] https://www.snowflake.com >>> [2] >>> https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html >>> [3] https://docs.snowflake.net/manuals/user-guide/jdbc.html >>> [4] https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html >>> >>> [5] >>> https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake >>> >>> [6] https://cloud.google.com/storage >>> [7] >>> https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html >>> >>> >>> > > -- > Elias Djurfeldt > Mirado Consulting >