Re: Design document - MLlib's statistical package for DataFrames

2017-02-16 Thread bradc
Hi, While it is also missing in spark.mllib, I'd suggest adding cardinality as part of the Simple descriptive statistics for both spark.ml and spark.mlib? This is useful even for data in double precision FP to understand the "uniqueness" of the feature data. Cheers, Brad -- View this

Re: Spark Improvement Proposals

2017-02-16 Thread Ryan Blue
> [The shepherd] can advise on technical and procedural considerations for people outside the community The sentiment is good, but this doesn't justify requiring a shepherd for a proposal. There are plenty of people that wouldn't need this, would get feedback during discussion, or would ask a

Design document - MLlib's statistical package for DataFrames

2017-02-16 Thread Tim Hunter
Hello all, I have been looking at some of the missing items for complete feature parity between spark.ml and spark.mllib. Here is a proposal for porting mllib.stats, the descriptive statistics package:

Re: Structured Streaming Spark Summit Demo - Databricks people

2017-02-16 Thread Sam Elamin
Thanks Micheal it really was a great demo I figured I needed to add a trigger to display the results. But Buraz from Databricks mentioned here that the display on this functionality wont be

Re: Spark Improvement Proposals

2017-02-16 Thread Sam Elamin
Hi Folks I thought id chime in as someone new to the process so feel free to disregard it if it doesn't make sense. I definitely agree that we need a new forum to identify or discuss changes as JIRA isnt exactly the best place to do that, its a Bug tracker first and foremost. For example I was

Re: Structured Streaming Spark Summit Demo - Databricks people

2017-02-16 Thread Michael Armbrust
Thanks for your interest in Apache Spark Structured Streaming! There is nothing secret in that demo, though I did make some configuration changes in order to get the timing right (gotta have some dramatic effect :) ). Also I think the visualizations based on metrics output by the

Re: [build system] jenkins restart in ~1 hour

2017-02-16 Thread shane knapp
and we're back! :) On Thu, Feb 16, 2017 at 10:22 AM, shane knapp wrote: > we don't have many builds running right now, and i need to restart the > daemon quickly to enable a new plugin. > > i'll wait until the pull request builder jobs are finished and then > (gently) kick

[build system] jenkins restart in ~1 hour

2017-02-16 Thread shane knapp
we don't have many builds running right now, and i need to restart the daemon quickly to enable a new plugin. i'll wait until the pull request builder jobs are finished and then (gently) kick jenkins. updates as they come, shane (who's always nervous about touching this house of cards)

Re: File JIRAs for all flaky test failures

2017-02-16 Thread Reynold Xin
Josh's tool should give enough signal there already. I don't think we need some manual process to document them. If you want to work on those that'd be great. I bet you will get a lot of love because all developers hate flaky tests. On Thu, Feb 16, 2017 at 6:19 PM, Saikat Kanjilal

Re: File JIRAs for all flaky test failures

2017-02-16 Thread Saikat Kanjilal
I am specifically suggesting documenting a list of the the flaky tests and fixing them, that's all. To organize the effort I suggested tackling this by module. Your second sentence is what I was trying to gauge from the community before putting anymore effort into this.

Re: Spark Improvement Proposals

2017-02-16 Thread Cody Koeninger
Reynold, thanks, LGTM. Sean, great concerns. I agree that behavior is largely cultural and writing down a process won't necessarily solve any problems one way or the other. But one outwardly visible change I'm hoping for out of this a way for people who have a stake in Spark, but can't follow

Re: File JIRAs for all flaky test failures

2017-02-16 Thread Sean Owen
I'm not sure what you're specifically suggesting. Of course flaky tests are bad and they should be fixed, and people do. Yes, some are pretty hard to fix because they are rarely reproducible if at all. If you want to fix, fix; there's nothing more to it. I don't perceive flaky tests to be a

Re: Spark Improvement Proposals

2017-02-16 Thread Sean Owen
The text seems fine to me. Really, this is not describing a fundamentally new process, which is good. We've always had JIRAs, we've always been able to call a VOTE for a big question. This just writes down a sensible set of guidelines for putting those two together when a major change is proposed.

Re: Spark Improvement Proposals

2017-02-16 Thread Ryan Blue
The current proposal seems process-heavy to me. That's not necessarily bad, but there are a couple areas I haven't seen discussed. Why is there a shepherd? If the person proposing a change has a good idea, I don't see why one is either a good idea or necessary. The result of this requirement is

Re: File JIRAs for all flaky test failures

2017-02-16 Thread Saikat Kanjilal
I'd just like to follow up again on this thread, should we devote some energy to fixing unit tests based on module, there wasn't much interest in this last time but given the nature of this thread I'd be willing to deep dive into this again with some help.

Re: Spark Job Performance monitoring approaches

2017-02-16 Thread Saikat Kanjilal
There's also this: https://github.com/databricks/spark-perf [https://avatars2.githubusercontent.com/u/4998052?v=3=400] GitHub - databricks/spark-perf: Performance tests for Spark github.com Sweeps sets of

Re: File JIRAs for all flaky test failures

2017-02-16 Thread Reynold Xin
What exactly is the issue? I've been working on Spark dev for a long time and very rarely do I actually run into an issue that only manifest on Jenkins but not locally. I don't have some magic local setup either. We should definitely cut down test flakiness. On Thu, Feb 16, 2017 at 5:26 PM,

Re: Spark Improvement Proposals

2017-02-16 Thread Reynold Xin
Updated. Any feedback from other community members? On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger wrote: > Thanks for doing that. > > Given that there are at least 4 different Apache voting processes, > "typical Apache vote process" isn't meaningful to me. > > I think the