[GitHub] [beam] TheNeuralBit commented on a change in pull request #12823: [BEAM-10882] Update Snowflake docs

GitBox Wed, 23 Sep 2020 15:04:46 -0700


TheNeuralBit commented on a change in pull request #12823:
URL: https://github.com/apache/beam/pull/12823#discussion_r493921269




##########
File path: website/www/site/content/en/documentation/io/built-in/snowflake.md
##########
@@ -362,3 +635,208 @@ static SnowflakeIO.CsvMapper<GenericRecord> 
getCsvMapper() {
            };
 }
 {{< /highlight >}}
+## Using SnowflakeIO in Python SDK
+### Intro
+Snowflake cross-language implementation is supporting both reading and writing 
operations for Python programming language, thanks to
+cross-language which is part of [Portability Framework 
Roadmap](https://beam.apache.org/roadmap/portability/) which aims to provide 
full interoperability
+across the Beam ecosystem. From a developer perspective it means the 
possibility of combining transforms written in different 
languages(Java/Python/Go).
+
+Currently, cross-language is supporting only [Apache 
Flink](https://flink.apache.org/) as a runner in a stable manner but plans are 
to support all runners.
+For more information about cross-language please see [multi sdk 
efforts](https://beam.apache.org/roadmap/connectors-multi-sdk/)
+and [Beam on top of 
Flink](https://flink.apache.org/ecosystem/2020/02/22/apache-beam-how-beam-runs-on-top-of-flink.html)
 articles.
+
+### Set up
+Please see [Apache Beam with Flink 
runner](https://beam.apache.org/documentation/runners/flink/) for a setup.
+
+### Reading from Snowflake
+One of the functions of SnowflakeIO is reading Snowflake tables - either full 
tables via table name or custom data via query. Output of the read transform is 
a 
[PCollection](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.pvalue.html#apache_beam.pvalue.PCollection)
 of user-defined data type.
+#### General usage
+{{< highlight >}}
+OPTIONS = [
+   "--runner=FlinkRunner",
+   "--flink_version=1.10",
+   "--flink_master=localhost:8081",
+   "--environment_type=LOOPBACK"
+]
+
+with TestPipeline(options=PipelineOptions(OPTIONS)) as p:
+   (p
+       | ReadFromSnowflake(
+           server_name=<SNOWFLAKE SERVER NAME>,
+           username=<SNOWFLAKE USERNAME>,
+           password=<SNOWFLAKE PASSWORD>,
+           o_auth_token=<OAUTH TOKEN>,
+           private_key_path=<PATH TO P8 FILE>,
+           raw_private_key=<PRIVATE_KEY>
+           private_key_passphrase=<PASSWORD FOR KEY>,
+           schema=<SNOWFLAKE SCHEMA>,
+           database=<SNOWFLAKE DATABASE>,
+           staging_bucket_name=<GCS BUCKET NAME>,
+           storage_integration_name=<SNOWFLAKE STORAGE INTEGRATION NAME>,
+           csv_mapper=<CSV MAPPER FUNCTION>,
+           table=<SNOWFALKE TABLE>,
+           query=<IF NOT TABLE THEN QUERY>,
+           role=<SNOWFLAKE ROLE>,
+           warehouse=<SNOWFLAKE WAREHOUSE>,
+           expansion_service=<EXPANSION SERVICE ADDRESS>))
+{{< /highlight >}}
+
+#### Required parameters
+- `server_name` Full Snowflake server name with an account, zone, and domain.
+
+- `schema` Name of the Snowflake schema in the database to use.
+
+- `database` Name of the Snowflake database to use.
+
+- `staging_bucket_name` Name of the Google Cloud Storage bucket. Bucket will 
be used as a temporary location for storing CSV files. Those temporary 
directories will be named `sf_copy_csv_DATE_TIME_RANDOMSUFFIX` and they will be 
removed automatically once Read operation finishes.
+
+- `storage_integration_name` Is the name of a Snowflake storage integration 
object created according to [Snowflake 
documentation](https://docs.snowflake.net/manuals/sql-reference/sql/create-storage-integration.html).
+
+- `csv_mapper` Specifies a function which must translate user-defined object 
to array of strings. SnowflakeIO uses a [COPY INTO 
<location>](https://docs.snowflake.net/manuals/sql-reference/sql/copy-into-location.html)
 statement to move data from a Snowflake table to Google Cloud Storage as CSV 
files. These files are then downloaded via 
[FileIO](https://beam.apache.org/releases/javadoc/2.3.0/index.html?org/apache/beam/sdk/io/FileIO.html)
 and processed line by line. Each line is split into an array of Strings using 
the [OpenCSV](http://opencsv.sourceforge.net/) library. The csv_mapper function 
job is to give the user the possibility to convert the array of Strings to a 
user-defined type, ie. GenericRecord for Avro or Parquet files, or custom 
objects.

Review comment:
       Similarly on the `WriteToSnowflake` side we can inspect user types and 
infer a schema from it




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [beam] TheNeuralBit commented on a change in pull request #12823: [BEAM-10882] Update Snowflake docs

Reply via email to