Our online services running in GCP collect data from our clients and write it to GCS under time-partitioned folders like yyyy/mm/dd/hh/mm (current-time) or similar ones. We need these files to be processed in real-time from Spark. As for the runtime, we plan to run it either on Dataproc or K8s.
- Gowrishankar Sunder On Mon, Mar 15, 2021 at 12:13 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > Hi, > > I looked at the stackoverflow reference. > > The first question that comes to my mind is how you are populating these > gcs buckets? Are you shifting data from on-prem and landing them in the > buckets and creating a new folder at the given interval? > > Where will you be running your Spark Structured Streaming? On dataproics? > > HTH > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 15 Mar 2021 at 19:00, Gowrishankar Sunder <shankar556...@gmail.com> > wrote: > >> Hi, >> We have a use case to stream files from GCS time-partitioned folders >> and perform structured streaming queries on top of them. I have detailed >> the use cases and requirements in this Stackoverflow question >> <https://stackoverflow.com/questions/66590057/spark-structured-streaming-source-watching-time-partitioned-gcs-partitions> >> but >> at a high level, the problems I am facing are listed below and would like >> guidance on the best approach to use >> >> - Custom source APIs for Structured Streaming are undergoing major >> changes (including the new Table API support) and the documentation does >> not capture much details when it comes to building custom sources. I was >> wondering if the current APIs are expected to remain stable through the >> targeted 3.2 release and if there are examples on how to use them for my >> use case. >> - The default FileStream >> >> <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#creating-streaming-dataframes-and-streaming-datasets> >> source looks up a static glob path which might not scale when the job runs >> for days with multiple time partitions. But it has some really useful >> features handling files - supports all major source formats (AVRO, >> Parquet, >> JSON etc...), takes care of compression and partitioning large files into >> sub-tasks - all of which I need to implement again for the current custom >> source APIs as they stand. I was wondering if I can still somehow make use >> of them while solving the scaling time partitioning file globbing issue. >> >> Thanks >> >> >>