Re: Glue-like Functionality

Simon Kitching Mon, 10 Jul 2017 00:12:07 -0700

Sounds similar to Confluent Kafka Schema Registry and Kafka Connect.

The Schema Registry and Kafka Connect themselves are open-source, but some of 
the datasource-specific adapters, and GUIs to manage it all, are not 
open-source (see Confluent Enterprise Edition).


Note that the Schema Registry and Kafka Connect are generic tools, and not 
spark-specific.

Regards, Simon

> Am 08.07.2017 um 19:49 schrieb Benjamin Kim <bbuil...@gmail.com>:
> 
> Has anyone seen AWS Glue? I was wondering if there is something similar going 
> to be built into Spark Structured Streaming? I like the Data Catalog idea to 
> store and track any data source/destination. It profiles the data to derive 
> the scheme and data types. Also, it does some sort-of automated schema 
> evolution when or if the schema changes. It leaves only the transformation 
> logic to the ETL developer. I think some of this can enhance or simplify 
> Structured Streaming. For example, AWS S3 can be catalogued as a Data Source; 
> in Structured Streaming, Input DataFrame is created like a SQL view based off 
> of the S3 Data Source; lastly, the Transform logic, if any, just manipulates 
> the data going from the Input DataFrame to the Result DataFrame, which is 
> another view based off of a catalogued Data Destination. This would relieve 
> the ETL developer from caring about any Data Source or Destination. All 
> server information, access credentials, data schemas, folder directory 
> structures, file formats, and any other properties can be securely stored 
> away with only a select few.
> 
> I'm just curious to know if anyone has thought the same thing.
> 
> Cheers,
> Ben
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Glue-like Functionality

Reply via email to