Re: [DISCUSS] Spark Columnar Processing

Bobby Evans Thu, 11 Apr 2019 07:14:08 -0700

The SPIP has been up for almost 6 days now with really no discussion on
it.  I am hopeful that means it's okay and we are good to call a vote on
it, but I want to give everyone one last chance to take a look and
comment.  If there are no comments by tomorrow I this we will start a vote
for this.


Thanks,

Bobby

On Fri, Apr 5, 2019 at 2:24 PM Bobby Evans <bo...@apache.org> wrote:

> I just filed SPARK-27396 as the SPIP for this proposal.  Please use that
> JIRA for further discussions.
>
> Thanks for all of the feedback,
>
> Bobby
>
> On Wed, Apr 3, 2019 at 7:15 PM Bobby Evans <bo...@apache.org> wrote:
>
>> I am still working on the SPIP and should get it up in the next few
>> days.  I have the basic text more or less ready, but I want to get a
>> high-level API concept ready too just to have something more concrete.  I
>> have not really done much with contributing new features to spark so I am
>> not sure where a design document really fits in here because from
>> http://spark.apache.org/improvement-proposals.html and
>> http://spark.apache.org/contributing.html it does not mention a design
>> anywhere.  I am happy to put one up, but I was hoping the API concept would
>> cover most of that.
>>
>> Thanks,
>>
>> Bobby
>>
>> On Tue, Apr 2, 2019 at 9:16 PM Renjie Liu <liurenjie2...@gmail.com>
>> wrote:
>>
>>> Hi, Bobby:
>>> Do you have design doc? I'm also interested in this topic and want to
>>> help contribute.
>>>
>>> On Tue, Apr 2, 2019 at 10:00 PM Bobby Evans <bo...@apache.org> wrote:
>>>
>>>> Thanks to everyone for the feedback.
>>>>
>>>> Overall the feedback has been really positive for exposing columnar as
>>>> a processing option to users.  I'll write up a SPIP on the proposed changes
>>>> to support columnar processing (not necessarily implement it) and then ping
>>>> the list again for more feedback and discussion.
>>>>
>>>> Thanks again,
>>>>
>>>> Bobby
>>>>
>>>> On Mon, Apr 1, 2019 at 5:09 PM Reynold Xin <r...@databricks.com> wrote:
>>>>
>>>>> I just realized I didn't make it very clear my stance here ... here's
>>>>> another try:
>>>>>
>>>>> I think it's a no brainer to have a good columnar UDF interface. This
>>>>> would facilitate a lot of high performance applications, e.g. GPU-based
>>>>> accelerations for machine learning algorithms.
>>>>>
>>>>> On rewriting the entire internals of Spark SQL to leverage columnar
>>>>> processing, I don't see enough evidence to suggest that's a good idea yet.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 27, 2019 at 8:10 AM, Bobby Evans <bo...@apache.org> wrote:
>>>>>
>>>>>> Kazuaki Ishizaki,
>>>>>>
>>>>>> Yes, ColumnarBatchScan does provide a framework for doing code
>>>>>> generation for the processing of columnar data.  I have to admit that I
>>>>>> don't have a deep understanding of the code generation piece, so if I get
>>>>>> something wrong please correct me.  From what I had seen only input 
>>>>>> formats
>>>>>> currently inherent from ColumnarBatchScan, and from comments in the trait
>>>>>>
>>>>>>   /**
>>>>>>    * Generate [[ColumnVector]] expressions for our parent to consume
>>>>>> as rows.
>>>>>>    * This is called once per [[ColumnarBatch]].
>>>>>>    */
>>>>>>
>>>>>> https://github.com/apache/spark/blob/956b52b1670985a67e49b938ac1499ae65c79f6e/sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala#L42-L43
>>>>>>
>>>>>> It appears that ColumnarBatchScan is really only intended to pull out
>>>>>> the data from the batch, and not to process that data in a columnar
>>>>>> fashion.  The Loading stage that you mentioned.
>>>>>>
>>>>>> > The SIMDzation or GPUization capability depends on a compiler that
>>>>>> translates native code from the code generated by the whole-stage 
>>>>>> codegen.
>>>>>> To be able to support vectorized processing Hive stayed with pure
>>>>>> java and let the JVM detect and do the SIMDzation of the code.  To make
>>>>>> that happen they created loops to go through each element in a column and
>>>>>> remove all conditionals from the body of the loops.  To the best of my
>>>>>> knowledge that would still require a separate code path like I am 
>>>>>> proposing
>>>>>> to make the different processing phases generate code that the JVM can
>>>>>> compile down to SIMD instructions.  The generated code is full of null
>>>>>> checks for each element which would prevent the operations we want.  
>>>>>> Also,
>>>>>> the intermediate results are often stored in UnsafeRow instances.  This 
>>>>>> is
>>>>>> really fast for row-based processing, but the complexity of how they 
>>>>>> work I
>>>>>> believe would prevent the JVM from being able to vectorize the 
>>>>>> processing.
>>>>>> If you have a better way to take java code and vectorize it we should put
>>>>>> it into OpenJDK instead of spark so everyone can benefit from it.
>>>>>>
>>>>>> Trying to compile directly from generated java code to something a
>>>>>> GPU can process is something we are tackling but we decided to go a
>>>>>> different route from what you proposed.  From talking with several 
>>>>>> compiler
>>>>>> experts here at NVIDIA my understanding is that IBM in partnership with
>>>>>> NVIDIA attempted in the past to extend the JVM to run at least partially 
>>>>>> on
>>>>>> GPUs, but it was really difficult to get right, especially with how java
>>>>>> does memory management and memory layout.
>>>>>>
>>>>>> To avoid that complexity we decided to split the JITing up into two
>>>>>> separate pieces.  I didn't mention any of this before because this
>>>>>> discussion was intended to just be around the memory layout support, and
>>>>>> not GPU processing.  The first part would be to take the Catalyst AST and
>>>>>> produce CUDA code directly from it.  If properly done we should be able 
>>>>>> to
>>>>>> do the selection and projection phases within a single kernel.  The 
>>>>>> biggest
>>>>>> issue comes with UDFs as they cannot easily be vectorized for the CPU or
>>>>>> GPU.  So to deal with that we have a prototype written by the compiler 
>>>>>> team
>>>>>> that is trying to tackle SPARK-14083 which can translate basic UDFs into
>>>>>> catalyst expressions.  If the UDF is too complicated or covers operations
>>>>>> not yet supported it will fall back to the original UDF processing.  I
>>>>>> don't know how close the team is to submit a SPIP or a patch for it, but 
>>>>>> I
>>>>>> do know that they have some very basic operations working.  The big issue
>>>>>> is that it requires java 11+ so it can use standard APIs to get the byte
>>>>>> code of scala UDFs.
>>>>>>
>>>>>> We split it this way because we thought it would be simplest to
>>>>>> implement, and because it would provide a benefit to more than just GPU
>>>>>> accelerated queries.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Bobby
>>>>>>
>>>>>> On Tue, Mar 26, 2019 at 11:59 PM Kazuaki Ishizaki <
>>>>>> ishiz...@jp.ibm.com> wrote:
>>>>>>
>>>>>> Looks interesting discussion.
>>>>>> Let me describe the current structure and remaining issues. This is
>>>>>> orthogonal to cost-benefit trade-off discussion.
>>>>>>
>>>>>> The code generation basically consists of three parts.
>>>>>> 1. Loading
>>>>>> 2. Selection (map, filter, ...)
>>>>>> 3. Projection
>>>>>>
>>>>>> 1. Columnar storage (e.g. Parquet, Orc, Arrow , and table cache) is
>>>>>> well abstracted by using ColumnVector (
>>>>>> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java)
>>>>>> class. By combining with ColumnarBatchScan, the whole-stage code 
>>>>>> generation
>>>>>> generate code to directly get valus from the columnar storage if there is
>>>>>> no row-based operation.
>>>>>> Note: The current master does not support Arrow as a data source.
>>>>>> However, I think it is not technically hard to support Arrow.
>>>>>>
>>>>>> 2. The current whole-stage codegen generates code for element-wise
>>>>>> selection (excluding sort and join). The SIMDzation or GPUization
>>>>>> capability depends on a compiler that translates native code from the 
>>>>>> code
>>>>>> generated by the whole-stage codegen.
>>>>>>
>>>>>> 3. The current Projection assume to store row-oriented data, I think
>>>>>> that is a part that Wenchen pointed out
>>>>>>
>>>>>> My slides
>>>>>> https://www.slideshare.net/ishizaki/making-hardware-accelerator-easier-to-use/41
>>>>>> <https://www.slideshare.net/ishizaki/making-hardware-accelerator-easier-to-use>may
>>>>>> simplify the above issue and possible implementation.
>>>>>>
>>>>>>
>>>>>>
>>>>>> FYI. NVIDIA will present an approach to exploit GPU with Arrow thru
>>>>>> Python at SAIS 2019
>>>>>> https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=110.
>>>>>> I think that it uses Python UDF support with Arrow in Spark.
>>>>>>
>>>>>> P.S. I will give a presentation about in-memory data storages for
>>>>>> SPark at SAIS 2019
>>>>>> https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=40
>>>>>> :)
>>>>>>
>>>>>> Kazuaki Ishizaki
>>>>>>
>>>>>>
>>>>>>
>>>>>> From:        Wenchen Fan <cloud0...@gmail.com>
>>>>>> To:        Bobby Evans <bo...@apache.org>
>>>>>> Cc:        Spark dev list <dev@spark.apache.org>
>>>>>> Date:        2019/03/26 13:53
>>>>>> Subject:        Re: [DISCUSS] Spark Columnar Processing
>>>>>> ------------------------------
>>>>>>
>>>>>>
>>>>>>
>>>>>> Do you have some initial perf numbers? It seems fine to me to remain
>>>>>> row-based inside Spark with whole-stage-codegen, and convert rows to
>>>>>> columnar batches when communicating with external systems.
>>>>>>
>>>>>> On Mon, Mar 25, 2019 at 1:05 PM Bobby Evans <*bo...@apache.org*
>>>>>> <bo...@apache.org>> wrote:
>>>>>> This thread is to discuss adding in support for data frame processing
>>>>>> using an in-memory columnar format compatible with Apache Arrow.  My main
>>>>>> goal in this is to lay the groundwork so we can add in support for GPU
>>>>>> accelerated processing of data frames, but this feature has a number of
>>>>>> other benefits.  Spark currently supports Apache Arrow formatted data as 
>>>>>> an
>>>>>> option to exchange data with python for pandas UDF processing. There has
>>>>>> also been discussion around extending this to allow for exchanging data
>>>>>> with other tools like pytorch, tensorflow, xgboost,... If Spark supports
>>>>>> processing on Arrow compatible data it could eliminate the
>>>>>> serialization/deserialization overhead when going between these systems.
>>>>>> It also would allow for doing optimizations on a CPU with SIMD 
>>>>>> instructions
>>>>>> similar to what Hive currently supports. Accelerated processing using a 
>>>>>> GPU
>>>>>> is something that we will start a separate discussion thread on, but I
>>>>>> wanted to set the context a bit.
>>>>>> Jason Lowe, Tom Graves, and I created a prototype over the past few
>>>>>> months to try and understand how to make this work.  What we are 
>>>>>> proposing
>>>>>> is based off of lessons learned when building this prototype, but we 
>>>>>> really
>>>>>> wanted to get feedback early on from the community. We will file a SPIP
>>>>>> once we can get agreement that this is a good direction to go in.
>>>>>>
>>>>>> The current support for columnar processing lets a Parquet or Orc
>>>>>> file format return a ColumnarBatch inside an RDD[InternalRow] using 
>>>>>> Scala’s
>>>>>> type erasure. The code generation is aware that the RDD actually holds
>>>>>> ColumnarBatchs and generates code to loop through the data in each batch 
>>>>>> as
>>>>>> InternalRows.
>>>>>>
>>>>>>
>>>>>> Instead, we propose a new set of APIs to work on an
>>>>>> RDD[InternalColumnarBatch] instead of abusing type erasure. With this we
>>>>>> propose adding in a Rule similar to how WholeStageCodeGen currently 
>>>>>> works.
>>>>>> Each part of the physical SparkPlan would expose columnar support 
>>>>>> through a
>>>>>> combination of traits and method calls. The rule would then decide when
>>>>>> columnar processing would start and when it would end. Switching between
>>>>>> columnar and row based processing is not free, so the rule would make a
>>>>>> decision based off of an estimate of the cost to do the transformation 
>>>>>> and
>>>>>> the estimated speedup in processing time.
>>>>>>
>>>>>>
>>>>>> This should allow us to disable columnar support by simply disabling
>>>>>> the rule that modifies the physical SparkPlan.  It should be minimal risk
>>>>>> to the existing row-based code path, as that code should not be touched,
>>>>>> and in many cases could be reused to implement the columnar version.  
>>>>>> This
>>>>>> also allows for small easily manageable patches. No huge patches that no
>>>>>> one wants to review.
>>>>>>
>>>>>>
>>>>>> As far as the memory layout is concerned OnHeapColumnVector and
>>>>>> OffHeapColumnVector are already really close to being Apache Arrow
>>>>>> compatible so shifting them over would be a relatively simple change.
>>>>>> Alternatively we could add in a new implementation that is Arrow 
>>>>>> compatible
>>>>>> if there are reasons to keep the old ones.
>>>>>>
>>>>>>
>>>>>> Again this is just to get the discussion started, any feedback is
>>>>>> welcome, and we will file a SPIP on it once we feel like the major 
>>>>>> changes
>>>>>> we are proposing are acceptable.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Bobby Evans
>>>>>>
>>>>>>
>>>>>
>>>
>>> --
>>> Renjie Liu
>>> Software Engineer, MVAD
>>>
>>

Re: [DISCUSS] Spark Columnar Processing

Reply via email to