PySpark Streaming “PicklingError: Could not serialize object” when use transform operator and checkpoint enabled

2019-05-23 Thread Xilang Yan
In PySpark streaming, if checkpoint enabled, and if use a stream.transform operator to join with another rdd, “PicklingError: Could not serialize object” will be thrown. I have asked the same question at stackoverflow:

Re: Spark SQL met "Block broadcast_xxx not found"

2019-05-07 Thread Xilang Yan
Ok... I am sure it is a bug of spark, I found the bug code, but the code is removed in 2.2.3, so I just upgrade spark to fix the problem. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe

Spark SQL met "Block broadcast_xxx not found"

2019-04-27 Thread Xilang Yan
We met broadcast issue in some of our applications, but not every time we run application, usually it gone when we rerun it. In the exception log, I see below two types of exception: Exception 1: 10:09:20.295 [shuffle-server-6-2] ERROR org.apache.spark.network.server.TransportRequestHandler -

Limit the block size of data received by spring streaming receiver

2018-01-07 Thread Xilang Yan
Hey, We use a customize receiver to receive data from our MQ. We used to use def store(dataItem: T) to store data however I found the block size can be very different from 0.5K to 5M size. So that data partition processing time is very different. Shuffle is an option, but I want to avoid it. I