Hi,

I am speaking in Spark Europe summit on exploiting GPUs for columnar
DataFrame operations
<https://spark-summit.org/eu-2015/events/exploiting-gpus-for-columnar-dataframe-operations/>.
I was going through various blogs, talks and JIRAs given by all the key
spark folks and trying to figure out where to make changes for this
proposal.

First of all, I must thank the recent progress in project tungsten that has
made my job easier. The changes for code generation
<https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala>
make it possible to allow me to generate OpenCL code for expressions
instead of existing java/scala code and run the OpenCL code on GPUs through
a Java library JavaCL.

However, before starting the work, I have a few questions/doubts as below:


   1. I found where the code generation
   
<https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala>
happens
   in spark code from the blogs
   
https://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html,

   
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
   and
   
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html.
   However, I could not find where is the generated code executed? A major
   part of my changes will be there since this executor will now have to send
   vectors of columns to GPU RAM, invoke execution, and get the results back
   to CPU RAM. Thus, the existing executor will significantly change.
   2. On the project tungsten blog
   
<https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html>,
   in the third Code Generation section, it is mentioned that you plan
   to increase the level of code generation from record-at-a-time expression
   evaluation to vectorized expression evaluation. Has this been implemented?
   If not, how do I implement this? I will need access to columnar ByteBuffer
   objects in DataFrame to do this. Having row by row access to data will
   defeat this exercise. In particular, I need access to
   
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala
   in the executor of the generated code.
   3. One thing that confuses me is the changes from 1.4 to 1.5 possibly
   due to JIRA https://issues.apache.org/jira/browse/SPARK-7956 and pull
   request https://github.com/apache/spark/pull/6479/files*. *This changed
   the code generation from quasiquotes (q) to string s operator. This makes
   it simpler for me to generate OpenCL code which is string based. The
   question, is this branch stable now? Should I make my changes on spark 1.4
   or spark 1.5 or master branch?
   4. How do I tune the batch size (number of rows in the ByteBuffer)? Is
   it through the property spark.sql.inMemoryColumnarStorage.batchSize?


Thanks in anticipation,

Kiran
PS:

Other things I found useful were:

*Spark DataFrames*: https://www.brighttalk.com/webcast/12891/166495
*Apache Spark 1.5*: https://www.brighttalk.com/webcast/12891/168177

The links to JavaCL/ScalaCL:

*Library to execute OpenCL code through Java*:
https://github.com/nativelibs4java/ScalaCL
*Library to convert Scala code to OpenCL and execute on GPUs*:
https://github.com/nativelibs4java/JavaCL

Reply via email to