Hi Maciej,
FYI, this fix is submitted at https://github.com/apache/spark/pull/16785.
Liang-Chi Hsieh wrote
> Hi Maciej,
>
> After looking into the details of the time spent on preparing the executed
> plan, the cause of the significant difference between 1.6 and current
> codebase when
Hi Maciej,
After looking into the details of the time spent on preparing the executed
plan, the cause of the significant difference between 1.6 and current
codebase when running the example, is the optimization process to generate
constraints.
There seems few operations in generating
CentOS 7.1,
Linux version 3.10.0-229.el7.x86_64 (buil...@kbuilder.dev.centos.org) (gcc
version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) ) #1 SMP Fri Mar 6 11:36:42
UTC 2015
Michael Allman-2 wrote
> Hi Stan,
>
> What OS/version are you using?
>
> Michael
>
>> On Jan 22, 2017, at 11:36 PM,
We are just 4 days away from closing the CFP for Spark Summit 2017.
We have expanded the tracks in SF to include sessions that focus on AI,
Machine Learning and a 60 min deep dive track with technical demos.
Submit your presentation today and join us for the 10th Spark Summit!
Hurry, the CFP
Hello,
My name is Gabriel Cristache and I am a student in my final year of a
Computer Engineering/Science University. I want for my Bachelor Thesis to
add support for dynamic scaling to a spark streaming application.
*The goal of the project is to develop an algorithm that automatically
scales
Hi Maciej,
Thanks for the info you provided.
I tried to run the same example with 1.6 and current branch and record the
difference between the time cost on preparing the executed plan.
Current branch:
292 ms
95 ms
Hi All
Ive done a bit more digging to where exactly this happens. It seems like
the schema is infered again after the data leaves the source and then comes
into the sink
Below is a stack trace, the schema at the BigQuerySource has a LongType for
customer id but then at the sink, the data
Hi Liang-Chi,
Thank you for your answer and PR but what I think I wasn't specific
enough. In hindsight I should have illustrate this better. What really
troubles me here is a pattern of growing delays. Difference between
1.6.3 (roughly 20s runtime since the first job):
1.6 timeline
vs 2.1.0
Thanks Nick for pointing it out. I totally agreed.
In 1.6 codebase, actually Pipeline uses DataFrame instead of Dataset,
because they are not merged yet in 1.6.
In StringIndexer and OneHotEncoder, they have called ".rdd" on the Dataset,
this would deserialize the rows.
In 1.6, as they use