Has anyone seen AWS Glue? I was wondering if there is something similar going 
to be built into Spark Structured Streaming? I like the Data Catalog idea to 
store and track any data source/destination. It profiles the data to derive the 
scheme and data types. Also, it does some sort-of automated schema evolution 
when or if the schema changes. It leaves only the transformation logic to the 
ETL developer. I think some of this can enhance or simplify Structured 
Streaming. For example, AWS S3 can be catalogued as a Data Source; in 
Structured Streaming, Input DataFrame is created like a SQL view based off of 
the S3 Data Source; lastly, the Transform logic, if any, just manipulates the 
data going from the Input DataFrame to the Result DataFrame, which is another 
view based off of a catalogued Data Destination. This would relieve the ETL 
developer from caring about any Data Source or Destination. All server 
information, access credentials, data schemas, folder directory structures, 
file formats, and any other properties can be securely stored away with only a 
select few.

I'm just curious to know if anyone has thought the same thing.

Cheers,
Ben
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to