Re: [DISCUSS] Spark Columnar Processing

2019-04-02 Thread Renjie Liu
data in each batch as >>> InternalRows. >>> >>> >>> Instead, we propose a new set of APIs to work on an >>> RDD[InternalColumnarBatch] instead of abusing type erasure. With this we >>> propose adding in a Rule similar to how WholeStageCodeGen currently works. >>> Each part of the physical SparkPlan would expose columnar support through a >>> combination of traits and method calls. The rule would then decide when >>> columnar processing would start and when it would end. Switching between >>> columnar and row based processing is not free, so the rule would make a >>> decision based off of an estimate of the cost to do the transformation and >>> the estimated speedup in processing time. >>> >>> >>> This should allow us to disable columnar support by simply disabling the >>> rule that modifies the physical SparkPlan. It should be minimal risk to >>> the existing row-based code path, as that code should not be touched, and >>> in many cases could be reused to implement the columnar version. This also >>> allows for small easily manageable patches. No huge patches that no one >>> wants to review. >>> >>> >>> As far as the memory layout is concerned OnHeapColumnVector and >>> OffHeapColumnVector are already really close to being Apache Arrow >>> compatible so shifting them over would be a relatively simple change. >>> Alternatively we could add in a new implementation that is Arrow compatible >>> if there are reasons to keep the old ones. >>> >>> >>> Again this is just to get the discussion started, any feedback is >>> welcome, and we will file a SPIP on it once we feel like the major changes >>> we are proposing are acceptable. >>> >>> Thanks, >>> >>> Bobby Evans >>> >>> >> -- Renjie Liu Software Engineer, MVAD

Guaranteed processing orders of each batch in Spark Streaming

2015-10-19 Thread Renjie Liu
Hi, all: I've read source code and it seems that there is no guarantee that the order of processing of each RDD is guaranteed since jobs are just submitted to a thread pool. I believe that this is quite important in streaming since updates should be ordered.