Sounds great. Looks like operation of the Snowflake source will be similar to BigQuery source (export files to GCS and read files). This will allow you to better parallelize reading (current JDBC source is limited to one worker when reading).
Seems like you already support initial splitting using files - https://github.com/PolideaInternal/beam/blob/snowflake-io/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L374 Prob. also consider supporting dynamic work rebalancing when runners support this through SDF. Thanks, Cham On Mon, Mar 23, 2020 at 9:49 AM Alexey Romanenko <aromanenko....@gmail.com> wrote: > Great! This is always welcomed to have more IOs in Beam. I’d be happy to > take look on your PR once it will be created. > > Just a couple of questions for now. > > 1) Afaik, you can connect to Snowflake using standard JDBC driver. Do you > plan to compare a performance between this SnowflakeIO and Beam JdbcIO? > 2) Are you going to support staging in other locations, like S3 and Azure? > 3) Does “ withSchema()” allows to infer Snowflake schema to Beam schema? > > On 23 Mar 2020, at 15:23, Katarzyna Kucharczyk <ka.kucharc...@gmail.com> > wrote: > > Hi all, > > Me and my colleagues have developed a new Java connector for Snowflake > that we would like to add to Beam. > > Snowflake is an analytic data warehouse provided as Software-as-a-Service > (SaaS). It uses a new SQL database engine with a unique architecture > designed for the cloud. To read more details please check [1] and [2]. > > Proposed Snowflake IOs use JDBC Snowflake library [3]. The IOs are batch > write and batch read that use the Snowflake COPY [4] operation underneath. > In both cases ParDo IOs load files on a stage and then they are inserted > into the Snowflake table of choice using the COPY API. The currently > supported stage is Google Cloud Storage[5]. > > The schema how Snowflake Read IO works (write operation works similarly > but in opposite direction): > Here is an Apache Beam fork [6] with current work of the Snowflake IO. > > In the near future we would like to also add IO for writing streams which > will use SnowPipe - Snowflake mechanism for continuous loading[7]. Also, we > would like to use cross language to provide Python connectors as well. > > We are open for all opinions and suggestions. In case of any > questions/comments please do not hesitate to post them. > > In case of no objection I will create jira tickets and share them in this > thread. Cheers, Kasia > > [1] https://www.snowflake.com > [2] https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html > [3] https://docs.snowflake.net/manuals/user-guide/jdbc.html > [4] https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html > [5] > https://github.com/PolideaInternal/beam/tree/snowflake-io/sdks/java/io/snowflake > > [6] https://cloud.google.com/storage > [7] https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe.html > > >