I don't know if there is a list, but in general running performance profiler can identify a lot of things...
On Tue, Jan 20, 2015 at 12:30 AM, Xuelin Cao <xuelincao2...@gmail.com> wrote: > > Thanks, Reynold > > Regarding the "lower hanging fruits", can you give me some example? > Where can I find them in JIRA? > > > On Tue, Jan 20, 2015 at 3:55 PM, Reynold Xin <r...@databricks.com> wrote: > >> It will probably eventually make its way into part of the query engine, >> one way or another. Note that there are in general a lot of other lower >> hanging fruits before you have to do vectorization. >> >> As far as I know, Hive doesn't really have vectorization because the >> vectorization in Hive is simply writing everything in small batches, in >> order to avoid the virtual function call overhead, and hoping the JVM can >> unroll some of the loops. There is no SIMD involved. >> >> Something that is pretty useful, which isn't exactly from vectorization >> but comes from similar lines of research, is being able to push predicates >> down into the columnar compression encoding. For example, one can turn >> string comparisons into integer comparisons. These will probably give much >> larger performance improvements in common queries. >> >> >> On Mon, Jan 19, 2015 at 6:27 PM, Xuelin Cao <xuelincao2...@gmail.com> >> wrote: >> >>> Hi, >>> >>> Correct me if I were wrong. It looks like, the current version of >>> Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical >>> operator produces a tuple by recursively call child->execute . >>> >>> There are papers that illustrate the benefits of vectorized query >>> engine. And Hive-Stinger also embrace this style. >>> >>> So, the question is, will Spark-SQL give a support to vectorized >>> query >>> execution someday? >>> >>> Thanks >>> >> >> >