Re: Will Spark-SQL support vectorized query engine someday?

2015-01-20 Thread Reynold Xin
I don't know if there is a list, but in general running performance
profiler can identify a lot of things...

On Tue, Jan 20, 2015 at 12:30 AM, Xuelin Cao xuelincao2...@gmail.com
wrote:


 Thanks, Reynold

   Regarding the lower hanging fruits, can you give me some example?
 Where can I find them in JIRA?


 On Tue, Jan 20, 2015 at 3:55 PM, Reynold Xin r...@databricks.com wrote:

 It will probably eventually make its way into part of the query engine,
 one way or another. Note that there are in general a lot of other lower
 hanging fruits before you have to do vectorization.

 As far as I know, Hive doesn't really have vectorization because the
 vectorization in Hive is simply writing everything in small batches, in
 order to avoid the virtual function call overhead, and hoping the JVM can
 unroll some of the loops. There is no SIMD involved.

 Something that is pretty useful, which isn't exactly from vectorization
 but comes from similar lines of research, is being able to push predicates
 down into the columnar compression encoding. For example, one can turn
 string comparisons into integer comparisons. These will probably give much
 larger performance improvements in common queries.


 On Mon, Jan 19, 2015 at 6:27 PM, Xuelin Cao xuelincao2...@gmail.com
 wrote:

 Hi,

  Correct me if I were wrong. It looks like, the current version of
 Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical
 operator produces a tuple by recursively call child-execute .

  There are papers that illustrate the benefits of vectorized query
 engine. And Hive-Stinger also embrace this style.

  So, the question is, will Spark-SQL give a support to vectorized
 query
 execution someday?

  Thanks






Will Spark-SQL support vectorized query engine someday?

2015-01-19 Thread Xuelin Cao
Hi,

 Correct me if I were wrong. It looks like, the current version of
Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical
operator produces a tuple by recursively call child-execute .

 There are papers that illustrate the benefits of vectorized query
engine. And Hive-Stinger also embrace this style.

 So, the question is, will Spark-SQL give a support to vectorized query
execution someday?

 Thanks


Re: Will Spark-SQL support vectorized query engine someday?

2015-01-19 Thread Reynold Xin
It will probably eventually make its way into part of the query engine, one
way or another. Note that there are in general a lot of other lower hanging
fruits before you have to do vectorization.

As far as I know, Hive doesn't really have vectorization because the
vectorization in Hive is simply writing everything in small batches, in
order to avoid the virtual function call overhead, and hoping the JVM can
unroll some of the loops. There is no SIMD involved.

Something that is pretty useful, which isn't exactly from vectorization but
comes from similar lines of research, is being able to push predicates down
into the columnar compression encoding. For example, one can turn string
comparisons into integer comparisons. These will probably give much larger
performance improvements in common queries.


On Mon, Jan 19, 2015 at 6:27 PM, Xuelin Cao xuelincao2...@gmail.com wrote:

 Hi,

  Correct me if I were wrong. It looks like, the current version of
 Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical
 operator produces a tuple by recursively call child-execute .

  There are papers that illustrate the benefits of vectorized query
 engine. And Hive-Stinger also embrace this style.

  So, the question is, will Spark-SQL give a support to vectorized query
 execution someday?

  Thanks