Sounds similar to Confluent Kafka Schema Registry and Kafka Connect. The Schema Registry and Kafka Connect themselves are open-source, but some of the datasource-specific adapters, and GUIs to manage it all, are not open-source (see Confluent Enterprise Edition).
Note that the Schema Registry and Kafka Connect are generic tools, and not spark-specific. Regards, Simon > Am 08.07.2017 um 19:49 schrieb Benjamin Kim <bbuil...@gmail.com>: > > Has anyone seen AWS Glue? I was wondering if there is something similar going > to be built into Spark Structured Streaming? I like the Data Catalog idea to > store and track any data source/destination. It profiles the data to derive > the scheme and data types. Also, it does some sort-of automated schema > evolution when or if the schema changes. It leaves only the transformation > logic to the ETL developer. I think some of this can enhance or simplify > Structured Streaming. For example, AWS S3 can be catalogued as a Data Source; > in Structured Streaming, Input DataFrame is created like a SQL view based off > of the S3 Data Source; lastly, the Transform logic, if any, just manipulates > the data going from the Input DataFrame to the Result DataFrame, which is > another view based off of a catalogued Data Destination. This would relieve > the ETL developer from caring about any Data Source or Destination. All > server information, access credentials, data schemas, folder directory > structures, file formats, and any other properties can be securely stored > away with only a select few. > > I'm just curious to know if anyone has thought the same thing. > > Cheers, > Ben > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org