Re: [DISCUSS] Spark Columnar Processing

2019-04-13 Thread Bobby Evans
s code for element-wise >> selection (excluding sort and join). The SIMDzation or GPUization >> capability depends on a compiler that translates native code from the code >> generated by the whole-stage codegen. >> >> 3. The current Projection assume to store row-oriented data,

Re: [DISCUSS] Spark Columnar Processing

2019-04-11 Thread Reynold Xin
t;>>>> >>>>>>> We split it this way because we thought it would be simplest to >>>>>>> implement, >>>>>>> and because it would provide a benefit to more than just GPU accelerated >>>>>>> queries. >>&

Re: [DISCUSS] Spark Columnar Processing

2019-04-11 Thread Bobby Evans
the current structure and remaining issues. This is >>>>>> orthogonal to cost-benefit trade-off discussion. >>>>>> >>>>>> The code generation basically consists of three parts. >>>>>> 1. Loading >>>>>> 2. Sele

Re: [DISCUSS] Spark Columnar Processing

2019-04-05 Thread Bobby Evans
ColumnVector ( >>>>> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java) >>>>> class. By combining with ColumnarBatchScan, the whole-stage code >>>>> generation >>>>> generate code

Re: [DISCUSS] Spark Columnar Processing

2019-04-03 Thread Bobby Evans
rage if there is >>>> no row-based operation. >>>> Note: The current master does not support Arrow as a data source. >>>> However, I think it is not technically hard to support Arrow. >>>> >>>> 2. The current whole-stage codegen generates

Re: [DISCUSS] Spark Columnar Processing

2019-04-02 Thread Renjie Liu
gt;>> 2. The current whole-stage codegen generates code for element-wise >>> selection (excluding sort and join). The SIMDzation or GPUization >>> capability depends on a compiler that translates native code from the code >>> generated by the whole-stage codegen. &

Re: [DISCUSS] Spark Columnar Processing

2019-04-02 Thread Bobby Evans
store row-oriented data, I think that >> is a part that Wenchen pointed out >> >> My slides >> https://www.slideshare.net/ishizaki/making-hardware-accelerator-easier-to-use/41 >> <https://www.slideshare.net/ishizaki/making-hardware-accelerator-easier-to-use>ma

Re: [DISCUSS] Spark Columnar Processing

2019-04-01 Thread Reynold Xin
give a presentation about in-memory data storages for SPark at >> SAIS 2019 >> https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=40 >> ( >> https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=40 >> ) :) >> >>

Re: [DISCUSS] Spark Columnar Processing

2019-03-27 Thread Bobby Evans
> :) > > Kazuaki Ishizaki > > > > From:Wenchen Fan > To:Bobby Evans > Cc:Spark dev list > Date:2019/03/26 13:53 > Subject:Re: [DISCUSS] Spark Columnar Processing > -- > > > > Do y

Re: [DISCUSS] Spark Columnar Processing

2019-03-26 Thread Kazuaki Ishizaki
list Date: 2019/03/26 13:53 Subject:Re: [DISCUSS] Spark Columnar Processing Do you have some initial perf numbers? It seems fine to me to remain row-based inside Spark with whole-stage-codegen, and convert rows to columnar batches when communicating with external systems. On Mon, Mar

Re: [DISCUSS] Spark Columnar Processing

2019-03-26 Thread Bobby Evans
Reynold, >From our experiments, it is not a massive refactoring of the code. Most expressions can be supported by a relatively small change while leaving the existing code path untouched. We didn't try to do columnar with code generation, but I suspect it would be similar, although the code gene

Re: [DISCUSS] Spark Columnar Processing

2019-03-26 Thread Reynold Xin
26% improvement is underwhelming if it requires massive refactoring of the codebase. Also you can't just add the benefits up this way, because: - Both vectorization and codegen reduces the overhead in virtual function calls - Vectorization code is more friendly to compilers / CPUs, but requires

Re: [DISCUSS] Spark Columnar Processing

2019-03-26 Thread Bobby Evans
Cloudera reports a 26% improvement in hive query runtimes by enabling vectorization. I would expect to see similar improvements but at the cost of keeping more data in memory. But remember this also enables a number of different hardware acceleration techniques. If the data format is arrow compat

Re: [DISCUSS] Spark Columnar Processing

2019-03-25 Thread Wenchen Fan
Do you have some initial perf numbers? It seems fine to me to remain row-based inside Spark with whole-stage-codegen, and convert rows to columnar batches when communicating with external systems. On Mon, Mar 25, 2019 at 1:05 PM Bobby Evans wrote: > This thread is to discuss adding in support fo

[DISCUSS] Spark Columnar Processing

2019-03-25 Thread Bobby Evans
This thread is to discuss adding in support for data frame processing using an in-memory columnar format compatible with Apache Arrow. My main goal in this is to lay the groundwork so we can add in support for GPU accelerated processing of data frames, but this feature has a number of other benefi