Decompress Gzip files from EventHub with Structured Streaming

Data Guy Tue, 08 Mar 2022 10:41:55 -0800

Hi everyone,

*<first time writing to this mailing list>*


Context: I have events coming into Databricks from an Azure Event Hub in a
Gzip compressed format. Currently, I extract the files with a UDF and send
the unzipped data into the silver layer in my Delta Lake with .write. Note
that even though data comes in continuously I do not use .writeStream as of
now.

I have a few design-related questions that I hope someone with experience
could help me with!

   1. Is there a better way to extract Gzip files than a UDF?
   2. Is Spark Structured Streaming or Batch with Databricks Jobs better?
   (Pipeline runs every 3 hours once, but the data is continuously coming from
   Event Hub)
   3. Should I use Autoloader or just simply stream data into Databricks
   using Event Hubs?

I am especially curious about the trade-offs and the best way forward. I
don't have massive amounts of data.

Thank you very much in advance!

Best wishes,
Maurizio Vancho Argall

Decompress Gzip files from EventHub with Structured Streaming

Reply via email to