Re: Analyzing and reusing cached Datasets

2016-11-20 Thread Jacek Laskowski
Hi Michael, Thanks a lot for your prompt answer. I greatly appreciate it. Having said that, I think we might be...cough...cough...wrong :) I think the "issue" is in QueryPlan.sameResult [1] as its scaladoc says: * Since its likely undecidable to generally determine if two given plans will

Re: Develop custom Estimator / Transformer for pipeline

2016-11-20 Thread Georg Heiler
The estimator should perform data cleaning tasks. This means some rows will be dropped, some columns dropped, some columns added, some values replaced in existing columns. IT should also store the mean or min for some numeric columns as a NaN replacement. However, override def

github mirroring is broken

2016-11-20 Thread Reynold Xin
FYI Github mirroring from Apache's official git repo to GitHub is broken since Sat Nov 19, and as a result GitHub is now stale. Merged pull requests won't show up in GitHub until ASF infra fixes the issue.

Re: OutOfMemoryError on parquet SnappyDecompressor

2016-11-20 Thread Aniket
Was anyone able find a solution or recommended conf for this? I am running into the same "java.lang.OutOfMemoryError: Direct buffer memory" but during snappy compression. Thanks, Aniket On Tue, Sep 23, 2014 at 7:04 PM Aaron Davidson [via Apache Spark Developers List]

Re:Re: Multiple streaming aggregations in structured streaming

2016-11-20 Thread Xinyu Zhang
MapWithState is also very useful. I want to calculate UV in real time, but "distinct count" and "multiple streaming aggregations" are not supported. Is there any method to calculate real-time UV in the current version? At 2016-11-19 06:01:45, "Michael Armbrust"