Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Liang-Chi Hsieh
Hi Maciej, FYI, this fix is submitted at https://github.com/apache/spark/pull/16785. Liang-Chi Hsieh wrote > Hi Maciej, > > After looking into the details of the time spent on preparing the executed > plan, the cause of the significant difference between 1.6 and current > codebase when

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Liang-Chi Hsieh
Hi Maciej, After looking into the details of the time spent on preparing the executed plan, the cause of the significant difference between 1.6 and current codebase when running the example, is the optimization process to generate constraints. There seems few operations in generating

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-02-02 Thread StanZhai
CentOS 7.1, Linux version 3.10.0-229.el7.x86_64 (buil...@kbuilder.dev.centos.org) (gcc version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) ) #1 SMP Fri Mar 6 11:36:42 UTC 2015 Michael Allman-2 wrote > Hi Stan, > > What OS/version are you using? > > Michael > >> On Jan 22, 2017, at 11:36 PM,

4 days left to submit your abstract to Spark Summit SF

2017-02-02 Thread Scott walent
We are just 4 days away from closing the CFP for Spark Summit 2017. We have expanded the tracks in SF to include sessions that focus on AI, Machine Learning and a 60 min deep dive track with technical demos. Submit your presentation today and join us for the 10th Spark Summit! Hurry, the CFP

Apache Spark Contribution

2017-02-02 Thread Gabi Cristache
Hello, My name is Gabriel Cristache and I am a student in my final year of a Computer Engineering/Science University. I want for my Bachelor Thesis to add support for dynamic scaling to a spark streaming application. *The goal of the project is to develop an algorithm that automatically scales

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Liang-Chi Hsieh
Hi Maciej, Thanks for the info you provided. I tried to run the same example with 1.6 and current branch and record the difference between the time cost on preparing the executed plan. Current branch: 292 ms 95 ms

Re: Structured Streaming Schema Issue

2017-02-02 Thread Sam Elamin
Hi All Ive done a bit more digging to where exactly this happens. It seems like the schema is infered again after the data leaves the source and then comes into the sink Below is a stack trace, the schema at the BigQuerySource has a LongType for customer id but then at the sink, the data

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Maciej Szymkiewicz
Hi Liang-Chi, Thank you for your answer and PR but what I think I wasn't specific enough. In hindsight I should have illustrate this better. What really troubles me here is a pattern of growing delays. Difference between 1.6.3 (roughly 20s runtime since the first job): 1.6 timeline vs 2.1.0

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Liang-Chi Hsieh
Thanks Nick for pointing it out. I totally agreed. In 1.6 codebase, actually Pipeline uses DataFrame instead of Dataset, because they are not merged yet in 1.6. In StringIndexer and OneHotEncoder, they have called ".rdd" on the Dataset, this would deserialize the rows. In 1.6, as they use