Sebastian Herold created SPARK-21706: ----------------------------------------
Summary: Support Custom PartitionSpec Provider for Kinesis Firehose or similar Key: SPARK-21706 URL: https://issues.apache.org/jira/browse/SPARK-21706 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0, 2.1.1, 1.6.3 Reporter: Sebastian Herold Many people are using Kinesis Firehose to ingest data into a S3-based data lake. Kinesis Firehose produces a directory layout like this: {code} s3://data-lake-bucket/my-prefix/2017/08/11/10/my-stream-2017-08-11-11-10-10 s3://data-lake-bucket/my-prefix/2017/08/11/11/my-stream-2017-08-11-11-11-10 . . . s3://data-lake-bucket/my-prefix/2017/08/12/00/my-stream-2017-08-12-00-01-01 {code} Spark is (like Hive) not supporting this kind of partitioning. Therefore it would be great, if you could configure a {{CustomPartitionDiscoverer}} or {{PartitionSpecProvider}} to provide a custom partition mapping and easily select a date range of files afterwards. Sadly, the partition discovery is deeply integrated into {{DataSource}}. *Could this be encapsulated smarter to be able to intercept the default behaviour?* Another partition schema that I've seen a lot in this context is: {code} s3://data-lake-bucket/prefix/2017-08-11/file.1.json s3://data-lake-bucket/prefix/2017-08-11/file.2.json . . . s3://data-lake-bucket/prefix/2017-08-12/file.1.json {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org