Structured Streaming on GCP Dataproc - java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer

2022-02-01 Thread karan alang
Hello All, I'm running a simple Structured Streaming on GCP, which reads data from Kafka and prints onto console. Command : cloud dataproc jobs submit pyspark /Users/karanalang/Documents/Technology/gcp/DataProc/StructuredStreaming_Kafka_GCP-Batch-feb1.py --cluster dataproc-ss-poc

Re: Structured Streaming - not showing records on console

2022-02-01 Thread karan alang
Hi Mich, thnx, seems 'complete' mode is supported only if there are streaming aggregations. I get this error on changing the output mode. pyspark.sql.utils.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets; Project

Re: Structured Streaming - not showing records on console

2022-02-01 Thread Mich Talebzadeh
hm. I am trying to recall if I am correct so you should try outpudeMode('complete') with format('console') result = resultMF. \ writeStream. \ outputMode('complete'). \ option("numRows", 1000). \

Structured Streaming - not showing records on console

2022-02-01 Thread karan alang
Hello Spark Experts, I've a simple Structured Streaming program, which reads data from Kafka, and writes on the console. This is working in batch mode (i.e spark.read or df.write), not not working in streaming mode. Details are in the stackoverflow

Re: Code fails when AQE enabled in Spark 3.1

2022-02-01 Thread Sean Owen
At a glance, it doesn't seem so. That is a corner case in two ways - very old dates and using RDDs, at least it seems. I also suspect that individual change is tied to a lot of other date related changes in 3.2, so may not be very back-portable. You should pursue updating to 3.2 for many reasons,

Re: A Persisted Spark DataFrame is computed twice

2022-02-01 Thread Gourav Sengupta
Hi, Can you please try to use SPARK SQL, instead of dataframes and see the difference? You will get a lot of theoretical arguments, and that is fine, but they are just largely and essentially theories. Also try to apply the function to the result of the filters as a sub-query by caching in the