Hi, I looked at the stackoverflow reference.
The first question that comes to my mind is how you are populating these gcs buckets? Are you shifting data from on-prem and landing them in the buckets and creating a new folder at the given interval? Where will you be running your Spark Structured Streaming? On dataproics? HTH LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Mon, 15 Mar 2021 at 19:00, Gowrishankar Sunder <shankar556...@gmail.com> wrote: > Hi, > We have a use case to stream files from GCS time-partitioned folders > and perform structured streaming queries on top of them. I have detailed > the use cases and requirements in this Stackoverflow question > <https://stackoverflow.com/questions/66590057/spark-structured-streaming-source-watching-time-partitioned-gcs-partitions> > but > at a high level, the problems I am facing are listed below and would like > guidance on the best approach to use > > - Custom source APIs for Structured Streaming are undergoing major > changes (including the new Table API support) and the documentation does > not capture much details when it comes to building custom sources. I was > wondering if the current APIs are expected to remain stable through the > targeted 3.2 release and if there are examples on how to use them for my > use case. > - The default FileStream > > <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#creating-streaming-dataframes-and-streaming-datasets> > source looks up a static glob path which might not scale when the job runs > for days with multiple time partitions. But it has some really useful > features handling files - supports all major source formats (AVRO, Parquet, > JSON etc...), takes care of compression and partitioning large files into > sub-tasks - all of which I need to implement again for the current custom > source APIs as they stand. I was wondering if I can still somehow make use > of them while solving the scaling time partitioning file globbing issue. > > Thanks > > >