Is there any good Docker container / compose with spark 2.4+ and YARN 2.8.2+

2020-09-16 Thread Ivan Petrov
Hi,
looking for a ready to use docker-container that has inside:
- spark 2.4 or higher
- yarn 2.8.2 or higher
I'm looking for a way to submit spark jobs on yarn.


Structured Streaming Checkpoint Error

2020-09-16 Thread German Schiavon
Hi!

I have an Structured Streaming Application that reads from kafka, performs
some aggregations and writes in S3 in parquet format.

Everything seems to work great except that from time to time I get a
checkpoint error, at the beginning I thought it was a random error but it
happened more than 3 times already in a few days

Caused by: java.io.FileNotFoundException: No such file or directory:
s3a://xxx/xxx/validation-checkpoint/offsets/.140.11adef9a-7636-4752-9e6c-48d605a9cca5.tmp


Does this happen to anyone else?

Thanks in advance.

*This is the full error :*

ERROR streaming.MicroBatchExecution: Query segmentValidation [id =
14edaddf-25bb-4259-b7a2-6107907f962f, runId =
0a757476-94ec-4a53-960a-91f54ce47110] terminated with error

java.io.FileNotFoundException: No such file or directory:
s3a://xxx/xxx//validation-checkpoint/offsets/.140.11adef9a-7636-4752-9e6c-48d605a9cca5.tmp

at
org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2310)

at
org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2204)

at
org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2143)

at org.apache.hadoop.fs.FileSystem.getFileLinkStatus(FileSystem.java:2664)

at
org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:131)

at
org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:726)

at
org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:699)

at org.apache.hadoop.fs.FileContext.rename(FileContext.java:1032)

at
org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.renameTempFile(CheckpointFileManager.scala:329)

at
org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.close(CheckpointFileManager.scala:147)

at
org.apache.spark.sql.execution.streaming.HDFSMetadataLog.writeBatchToFile(HDFSMetadataLog.scala:134)

at
org.apache.spark.sql.execution.streaming.HDFSMetadataLog.$anonfun$add$3(HDFSMetadataLog.scala:120)

at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at scala.Option.getOrElse(Option.scala:189)


Re: Spark Kafka Streaming With Transactional Messages

2020-09-16 Thread jianyangusa
I have the same issue. Do you have a solution? Maybe spark stream not support
transaction message. I use Kafka stream to retrieve the transaction message.
Maybe we can ask Spark support this feature.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is there any good Docker container / compose with spark 2.4+ and YARN 2.8.2+

2020-09-16 Thread Ricardo Martinelli de Oliveira
Ivan,

Although this is kubernetes-related docs it might apply to your use case:

https://spark.apache.org/docs/latest/running-on-kubernetes.html#docker-images

There is a script that can create the image for you in spark distribution,
it was added in 2.3. So if you downloaded a spark 2.3+ distribution then
you will find this script.

On Wed, Sep 16, 2020 at 7:55 AM Ivan Petrov  wrote:

> Hi,
> looking for a ready to use docker-container that has inside:
> - spark 2.4 or higher
> - yarn 2.8.2 or higher
> I'm looking for a way to submit spark jobs on yarn.
>


-- 

Ricardo Martinelli De Oliveira

Data Engineer, AI CoE

Red Hat Brazil 

Av. Brigadeiro Faria Lima, 3900

8th floor

rmart...@redhat.comT: +551135426125
M: +5511970696531
@redhatjobs    redhatjobs
 @redhatjobs




[pyspark 2.4] broadcasting DataFrame throws error

2020-09-16 Thread Rishi Shah
Hello All,

Hope this email finds you well. I have a dataframe of size 8TB (parquet
snappy compressed), however I group it by a column and get a much smaller
aggregated dataframe of size 700 rows (just two columns, key and count).
When I use it like below to broadcast this aggregated result, it throws
dataframe can not be broadcasted error.

df_agg = df.groupBy('column1').count().cache()
# df_agg.count()
df_join = df.join(broadcast(df_agg), 'column1', 'left_outer')
df_join.write.parquet('PATH')

The same code works with input df size of 3TB without any modifications.

Any suggestions?

-- 
Regards,

Rishi Shah