s code for element-wise
>> selection (excluding sort and join). The SIMDzation or GPUization
>> capability depends on a compiler that translates native code from the code
>> generated by the whole-stage codegen.
>>
>> 3. The current Projection assume to store row-oriented data,
t;>>>>
>>>>>>> We split it this way because we thought it would be simplest to
>>>>>>> implement,
>>>>>>> and because it would provide a benefit to more than just GPU accelerated
>>>>>>> queries.
>>&
the current structure and remaining issues. This is
>>>>>> orthogonal to cost-benefit trade-off discussion.
>>>>>>
>>>>>> The code generation basically consists of three parts.
>>>>>> 1. Loading
>>>>>> 2. Sele
ColumnVector (
>>>>> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java)
>>>>> class. By combining with ColumnarBatchScan, the whole-stage code
>>>>> generation
>>>>> generate code
rage if there is
>>>> no row-based operation.
>>>> Note: The current master does not support Arrow as a data source.
>>>> However, I think it is not technically hard to support Arrow.
>>>>
>>>> 2. The current whole-stage codegen generates
gt;>> 2. The current whole-stage codegen generates code for element-wise
>>> selection (excluding sort and join). The SIMDzation or GPUization
>>> capability depends on a compiler that translates native code from the code
>>> generated by the whole-stage codegen.
&
store row-oriented data, I think that
>> is a part that Wenchen pointed out
>>
>> My slides
>> https://www.slideshare.net/ishizaki/making-hardware-accelerator-easier-to-use/41
>> <https://www.slideshare.net/ishizaki/making-hardware-accelerator-easier-to-use>ma
give a presentation about in-memory data storages for SPark at
>> SAIS 2019
>> https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=40
>> (
>> https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=40
>> ) :)
>>
>>
> :)
>
> Kazuaki Ishizaki
>
>
>
> From:Wenchen Fan
> To:Bobby Evans
> Cc:Spark dev list
> Date:2019/03/26 13:53
> Subject:Re: [DISCUSS] Spark Columnar Processing
> --
>
>
>
> Do y
list
Date: 2019/03/26 13:53
Subject:Re: [DISCUSS] Spark Columnar Processing
Do you have some initial perf numbers? It seems fine to me to remain
row-based inside Spark with whole-stage-codegen, and convert rows to
columnar batches when communicating with external systems.
On Mon, Mar
Reynold,
>From our experiments, it is not a massive refactoring of the code. Most
expressions can be supported by a relatively small change while leaving the
existing code path untouched. We didn't try to do columnar with code
generation, but I suspect it would be similar, although the code gene
26% improvement is underwhelming if it requires massive refactoring of the
codebase. Also you can't just add the benefits up this way, because:
- Both vectorization and codegen reduces the overhead in virtual function calls
- Vectorization code is more friendly to compilers / CPUs, but requires
Cloudera reports a 26% improvement in hive query runtimes by enabling
vectorization. I would expect to see similar improvements but at the cost
of keeping more data in memory. But remember this also enables a number of
different hardware acceleration techniques. If the data format is arrow
compat
Do you have some initial perf numbers? It seems fine to me to remain
row-based inside Spark with whole-stage-codegen, and convert rows to
columnar batches when communicating with external systems.
On Mon, Mar 25, 2019 at 1:05 PM Bobby Evans wrote:
> This thread is to discuss adding in support fo
This thread is to discuss adding in support for data frame processing using
an in-memory columnar format compatible with Apache Arrow. My main goal in
this is to lay the groundwork so we can add in support for GPU accelerated
processing of data frames, but this feature has a number of other
benefi
15 matches
Mail list logo