Re: Spark Structured Streaming from GCS files

Gowrishankar Sunder Mon, 15 Mar 2021 13:14:53 -0700

Our online services running in GCP collect data from our clients and write
it to GCS under time-partitioned folders like yyyy/mm/dd/hh/mm
(current-time) or similar ones. We need these files to be processed in
real-time from Spark. As for the runtime, we plan to run it either on
Dataproc or K8s.


- Gowrishankar Sunder


On Mon, Mar 15, 2021 at 12:13 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

>
> Hi,
>
> I looked at the stackoverflow reference.
>
> The first question that comes to my mind is how you are populating these
> gcs buckets? Are you shifting data from on-prem and landing them in the
> buckets and  creating a new folder at the given interval?
>
> Where will you be running your Spark Structured Streaming? On dataproics?
>
> HTH
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 15 Mar 2021 at 19:00, Gowrishankar Sunder <shankar556...@gmail.com>
> wrote:
>
>> Hi,
>>    We have a use case to stream files from GCS time-partitioned folders
>> and perform structured streaming queries on top of them. I have detailed
>> the use cases and requirements in this Stackoverflow question
>> <https://stackoverflow.com/questions/66590057/spark-structured-streaming-source-watching-time-partitioned-gcs-partitions>
>>  but
>> at a high level, the problems I am facing are listed below and would like
>> guidance on the best approach to use
>>
>>    - Custom source APIs for Structured Streaming are undergoing major
>>    changes (including the new Table API support) and the documentation does
>>    not capture much details when it comes to building custom sources. I was
>>    wondering if the current APIs are expected to remain stable through the
>>    targeted 3.2 release and if there are examples on how to use them for my
>>    use case.
>>    - The default FileStream
>>    
>> <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#creating-streaming-dataframes-and-streaming-datasets>
>>    source looks up a static glob path which might not scale when the job runs
>>    for days with multiple time partitions. But it has some really useful
>>    features handling files - supports all major source formats (AVRO, 
>> Parquet,
>>    JSON etc...), takes care of compression and partitioning large files into
>>    sub-tasks - all of which I need to implement again for the current custom
>>    source APIs as they stand. I was wondering if I can still somehow make use
>>    of them while solving the scaling time partitioning file globbing issue.
>>
>> Thanks
>>
>>
>>

Re: Spark Structured Streaming from GCS files

Reply via email to