Re: Will Spark-SQL support vectorized query engine someday?
I don't know if there is a list, but in general running performance profiler can identify a lot of things... On Tue, Jan 20, 2015 at 12:30 AM, Xuelin Cao xuelincao2...@gmail.com wrote: Thanks, Reynold Regarding the lower hanging fruits, can you give me some example? Where can I find them in JIRA? On Tue, Jan 20, 2015 at 3:55 PM, Reynold Xin r...@databricks.com wrote: It will probably eventually make its way into part of the query engine, one way or another. Note that there are in general a lot of other lower hanging fruits before you have to do vectorization. As far as I know, Hive doesn't really have vectorization because the vectorization in Hive is simply writing everything in small batches, in order to avoid the virtual function call overhead, and hoping the JVM can unroll some of the loops. There is no SIMD involved. Something that is pretty useful, which isn't exactly from vectorization but comes from similar lines of research, is being able to push predicates down into the columnar compression encoding. For example, one can turn string comparisons into integer comparisons. These will probably give much larger performance improvements in common queries. On Mon, Jan 19, 2015 at 6:27 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hi, Correct me if I were wrong. It looks like, the current version of Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical operator produces a tuple by recursively call child-execute . There are papers that illustrate the benefits of vectorized query engine. And Hive-Stinger also embrace this style. So, the question is, will Spark-SQL give a support to vectorized query execution someday? Thanks
Will Spark-SQL support vectorized query engine someday?
Hi, Correct me if I were wrong. It looks like, the current version of Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical operator produces a tuple by recursively call child-execute . There are papers that illustrate the benefits of vectorized query engine. And Hive-Stinger also embrace this style. So, the question is, will Spark-SQL give a support to vectorized query execution someday? Thanks
Re: Will Spark-SQL support vectorized query engine someday?
It will probably eventually make its way into part of the query engine, one way or another. Note that there are in general a lot of other lower hanging fruits before you have to do vectorization. As far as I know, Hive doesn't really have vectorization because the vectorization in Hive is simply writing everything in small batches, in order to avoid the virtual function call overhead, and hoping the JVM can unroll some of the loops. There is no SIMD involved. Something that is pretty useful, which isn't exactly from vectorization but comes from similar lines of research, is being able to push predicates down into the columnar compression encoding. For example, one can turn string comparisons into integer comparisons. These will probably give much larger performance improvements in common queries. On Mon, Jan 19, 2015 at 6:27 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hi, Correct me if I were wrong. It looks like, the current version of Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical operator produces a tuple by recursively call child-execute . There are papers that illustrate the benefits of vectorized query engine. And Hive-Stinger also embrace this style. So, the question is, will Spark-SQL give a support to vectorized query execution someday? Thanks