data in each batch as
>>> InternalRows.
>>>
>>>
>>> Instead, we propose a new set of APIs to work on an
>>> RDD[InternalColumnarBatch] instead of abusing type erasure. With this we
>>> propose adding in a Rule similar to how WholeStageCodeGen currently works.
>>> Each part of the physical SparkPlan would expose columnar support through a
>>> combination of traits and method calls. The rule would then decide when
>>> columnar processing would start and when it would end. Switching between
>>> columnar and row based processing is not free, so the rule would make a
>>> decision based off of an estimate of the cost to do the transformation and
>>> the estimated speedup in processing time.
>>>
>>>
>>> This should allow us to disable columnar support by simply disabling the
>>> rule that modifies the physical SparkPlan. It should be minimal risk to
>>> the existing row-based code path, as that code should not be touched, and
>>> in many cases could be reused to implement the columnar version. This also
>>> allows for small easily manageable patches. No huge patches that no one
>>> wants to review.
>>>
>>>
>>> As far as the memory layout is concerned OnHeapColumnVector and
>>> OffHeapColumnVector are already really close to being Apache Arrow
>>> compatible so shifting them over would be a relatively simple change.
>>> Alternatively we could add in a new implementation that is Arrow compatible
>>> if there are reasons to keep the old ones.
>>>
>>>
>>> Again this is just to get the discussion started, any feedback is
>>> welcome, and we will file a SPIP on it once we feel like the major changes
>>> we are proposing are acceptable.
>>>
>>> Thanks,
>>>
>>> Bobby Evans
>>>
>>>
>>
--
Renjie Liu
Software Engineer, MVAD
Hi, all:
I've read source code and it seems that there is no guarantee that the
order of processing of each RDD is guaranteed since jobs are just submitted
to a thread pool. I believe that this is quite important in streaming
since updates should be ordered.