Thanks for the tips!

I think I figured out what might be causing it. It's the checkpointing to
Microsoft Azure Data Lake Storage (ADLS).

When I use "local checkpointing" it works, but when i use fails when there's
a groupBy in the stream. Weirdly it works when there is no groupBy clause in
the stream.

It's able to create the checkpoint location and the base files on ADLS, but
it can't write any commits. Hence, why I see the first batch of records, but
then it crashes on writing the first commit.

Here's what my checkpointing location looks like:

abfss://"+container_name+"@"+storage_account_name+".dfs.core.windows.net/"+file_name

and my SparkSession:

spark = pyspark.sql.SparkSession.builder\
    .appName("pyspark-stream-read")\
    .master("spark://spark-master:7077")\
    .config("spark.executor.memory", "512m")\
    .config("spark.jars.packages",
"io.delta:delta-core_2.12:0.7.0,org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.hadoop:hadoop-azure:3.2.1")
\
    .config("spark.sql.extensions",
"io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.delta.logStore.class",
"org.apache.spark.sql.delta.storage.AzureLogStore") \
    .getOrCreate()






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to