Github user vesense commented on a diff in the pull request: https://github.com/apache/storm/pull/1777#discussion_r87967670 --- Diff: docs/storm-sql-reference.md --- @@ -1203,4 +1203,103 @@ and class for aggregate function is here: For now users can skip implementing `result` method if it doesn't need transform accumulated value, but this behavior is subject to change so providing `result` is recommended. -Please note that users should use `--jars` or `--artifacts` while running Storm SQL runner to make sure UDFs and/or UDAFs are available in classpath. \ No newline at end of file +Please note that users should use `--jars` or `--artifacts` while running Storm SQL runner to make sure UDFs and/or UDAFs are available in classpath. + +## External Data Sources + +### Specifying External Data Sources + +In StormSQL data is represented by external tables. Users can specify data sources using the `CREATE EXTERNAL TABLE` statement. The syntax of `CREATE EXTERNAL TABLE` closely follows the one defined in [Hive Data Definition Language](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL): + +``` +CREATE EXTERNAL TABLE table_name field_list + [ STORED AS + INPUTFORMAT input_format_classname + OUTPUTFORMAT output_format_classname + ] + LOCATION location + [ TBLPROPERTIES tbl_properties ] + [ AS select_stmt ] +``` + +Default input format and output format are JSON. We will introduce `supported formats` from further section. + +For example, the following statement specifies a Kafka spout and sink: + +``` +CREATE EXTERNAL TABLE FOO (ID INT PRIMARY KEY) LOCATION 'kafka://localhost:2181/brokers?topic=test' TBLPROPERTIES '{"producer":{"bootstrap.servers":"localhost:9092","acks":"1","key.serializer":"org.apache.org.apache.storm.kafka.IntSerializer","value.serializer":"org.apache.org.apache.storm.kafka.ByteBufferSerializer"}}' +``` + +### Plugging in External Data Sources + +Users plug in external data sources through implementing the `ISqlTridentDataSource` interface and registers them using the mechanisms of Java's service loader. The external data source will be chosen based on the scheme of the URI of the tables. Please refer to the implementation of `storm-sql-kafka` for more details. + +### Supported Formats + +| Format | Input format class | Output format class | Requires properties +|:--------------- |:------------------ |:------------------- |:-------------------- +| JSON | org.apache.storm.sql.runtime.serde.json.JsonScheme | org.apache.storm.sql.runtime.serde.json.JsonSerializer | No +| Avro | org.apache.storm.sql.runtime.serde.avro.AvroScheme | org.apache.storm.sql.runtime.serde.avro.AvroSerializer | Yes +| CSV | org.apache.storm.sql.runtime.serde.csv.CsvScheme | org.apache.storm.sql.runtime.serde.csv.CsvSerializer | No +| TSV | org.apache.storm.sql.runtime.serde.tsv.TsvScheme | org.apache.storm.sql.runtime.serde.tsv.TsvSerializer | No + +#### Avro + +Avro requires users to describe the schema of record (both input and output). Schema should be described on `TBLPROPERTIES`. +Input format needs to be described to `input.avro.schema`, and output format needs to be described to `output.avro.schema`. +Schema string should be an escaped JSON so that `TBLPROPERTIES` is valid JSON. + +Example Schema description: + +`"input.avro.schema": "{\"type\": \"record\", \"name\": \"large_orders\", \"fields\" : [ {\"name\": \"ID\", \"type\": \"int\"}, {\"name\": \"TOTAL\", \"type\": \"int\"} ]}"` + +`"output.avro.schema": "{\"type\": \"record\", \"name\": \"large_orders\", \"fields\" : [ {\"name\": \"ID\", \"type\": \"int\"}, {\"name\": \"TOTAL\", \"type\": \"int\"} ]}"` + +#### CSV + +It uses `Standard RFC4180 CSV Parser` and doesn't need any other properties. --- End diff -- Minor. How about add a link to RFC4180? It is convenient for users who want to look.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---