[pyspark 2.4] broadcasting DataFrame throws error

2020-09-16 Thread Rishi Shah
Hello All, Hope this email finds you well. I have a dataframe of size 8TB (parquet snappy compressed), however I group it by a column and get a much smaller aggregated dataframe of size 700 rows (just two columns, key and count). When I use it like below to broadcast this aggregated result, it

Re: Is there any good Docker container / compose with spark 2.4+ and YARN 2.8.2+

2020-09-16 Thread Ricardo Martinelli de Oliveira
Ivan, Although this is kubernetes-related docs it might apply to your use case: https://spark.apache.org/docs/latest/running-on-kubernetes.html#docker-images There is a script that can create the image for you in spark distribution, it was added in 2.3. So if you downloaded a spark 2.3+

Re: Spark Kafka Streaming With Transactional Messages

2020-09-16 Thread jianyangusa
I have the same issue. Do you have a solution? Maybe spark stream not support transaction message. I use Kafka stream to retrieve the transaction message. Maybe we can ask Spark support this feature. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Structured Streaming Checkpoint Error

2020-09-16 Thread German Schiavon
Hi! I have an Structured Streaming Application that reads from kafka, performs some aggregations and writes in S3 in parquet format. Everything seems to work great except that from time to time I get a checkpoint error, at the beginning I thought it was a random error but it happened more than 3

Is there any good Docker container / compose with spark 2.4+ and YARN 2.8.2+

2020-09-16 Thread Ivan Petrov
Hi, looking for a ready to use docker-container that has inside: - spark 2.4 or higher - yarn 2.8.2 or higher I'm looking for a way to submit spark jobs on yarn.