Re: Better way to debug serializable issues

2020-02-18 Thread Maxim Gekk
Hi Ruijing, Spark uses SerializationDebugger ( https://spark.apache.org/docs/latest/api/java/org/apache/spark/serializer/SerializationDebugger.html) as default debugger to detect the serialization issues. You can take more detailed serialization exception information by setting the following

Better way to debug serializable issues

2020-02-18 Thread Ruijing Li
Hi all, When working with spark jobs, I sometimes have to tackle with serialization issues, and I have a difficult time trying to fix those. A lot of times, the serialization issues happen only in cluster mode across the network in a mesos container, so I can’t debug locally. And the exception

Re: Questions about count() performance with dataframes and parquet files

2020-02-18 Thread Nicolas PARIS
> either materialize the Dataframe on HDFS (e.g. parquet or checkpoint) I wonder if avro is a better candidate for this because it's row oriented it should be faster to write/read for such a task. Never heard about checkpoint. Enrico Minack writes: > It is not about very large or small, it is