Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-19 Thread Dongjoon Hyun
Thank you, Wenchen. I made the minor document-only change PR. https://github.com/apache/spark/pull/22781 Bests, Dongjoon. On Fri, Oct 19, 2018 at 6:07 PM Wenchen Fan wrote: > AFAIK we haven't tested Java 9+ yet, so I'm ok to change it. > > Hi Dongjoon can you make a PR for it? We can merge

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-19 Thread Wenchen Fan
AFAIK we haven't tested Java 9+ yet, so I'm ok to change it. Hi Dongjoon can you make a PR for it? We can merge it very soon if we decide to do it. Thanks, Wenchen On Sat, Oct 20, 2018 at 5:27 AM Dongjoon Hyun wrote: > From the document, should we be more specific with 'Java 8' instead of >

Re: [MLlib] PCA Aggregator

2018-10-19 Thread Sean Owen
I think this is great info and context to put in the JIRA. On Fri, Oct 19, 2018, 6:53 PM Matt Saunders wrote: > Hi Sean, thanks for your feedback. I saw this as a missing feature in the > existing PCA implementation in MLlib. I suspect the use case is a common > one: you have data from

Re: [MLlib] PCA Aggregator

2018-10-19 Thread Matt Saunders
Hi Sean, thanks for your feedback. I saw this as a missing feature in the existing PCA implementation in MLlib. I suspect the use case is a common one: you have data from different entities (could be different users, different locations, or different products, for example) and you need to model

Re: [MLlib] PCA Aggregator

2018-10-19 Thread Sean Owen
It's OK to open a JIRA though I generally doubt any new functionality will be added. This might be viewed as a small worthwhile enhancement, haven't looked at it. It's always more compelling if you can sketch the use case for it and why it is more meaningful in spark than outside it. There is

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-19 Thread Dongjoon Hyun
>From the document, should we be more specific with 'Java 8' instead of 'Java 8+' because we don't build (or test) in the community with Java 9 ~ 11. https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/_site/building-spark.html > Building Spark using Maven requires Maven 3.3.9 or newer

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-19 Thread Ryan Blue
I think this is expected behavior, though not what I think is reasonable in the long term. To my knowledge, this is how the v1 sources behave, and v2 just reuses the same mechanism to instantiate sources and uses a new interface for v2 features. I think that the right approach is to use catalogs,

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-19 Thread Koert Kuipers
i deployed 2.4.0 RC3 on our dev cluster and ran into issue with spark shell and jline. there is probably a simple workaround so this is not a serious issue but just wanted to let you know. https://issues.apache.org/jira/browse/SPARK-25783 On Mon, Oct 15, 2018 at 4:59 PM Imran Rashid wrote: > I

Re: [MLlib] PCA Aggregator

2018-10-19 Thread Matt Saunders
Thanks, Eric. I went ahead and created SPARK-25782 for this improvement since it is a feature I and others have looked for in MLlib, but doesn't seem to exist yet. Also, while searching for PCA-related issues in JIRA I noticed that someone added grouping support for PCA to the MADlib project a

Re: [MLlib] PCA Aggregator

2018-10-19 Thread Erik Erlandson
For 3rd-party libs, I have been publishing independently, for example at isarn-sketches-spark or silex: https://github.com/isarn/isarn-sketches-spark https://github.com/radanalyticsio/silex Either of these repos provide some good working examples of publishing a spark UDAF or ML library for jvm

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Liu
Hi Peter, Thanks for the additional information - this is really helpful (I definitively got more than I was looking for :-) Cheers, Peter On Fri, Oct 19, 2018 at 12:53 PM Peter Rudenko wrote: > Hi Peter, we're using a part of Crail - it's core library, called disni ( >

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Rudenko
Hi Peter, we're using a part of Crail - it's core library, called disni ( https://github.com/zrlio/disni/). We couldn't reproduce results from that blog post, any case Crail is more platformic approach (it comes with it's own file system), while SparkRdma is a pluggable approach - it's just a

Re: [MLlib] PCA Aggregator

2018-10-19 Thread Stephen Boesch
Erik - is there a current locale for approved/recommended third party additions? The spark-packages has been stale for years it seems. Am Fr., 19. Okt. 2018 um 07:06 Uhr schrieb Erik Erlandson < eerla...@redhat.com>: > Hi Matt! > > There are a couple ways to do this. If you want to submit it

Re: [MLlib] PCA Aggregator

2018-10-19 Thread Erik Erlandson
Hi Matt! There are a couple ways to do this. If you want to submit it for inclusion in Spark, you should start by filing a JIRA for it, and then a pull request. Another possibility is to publish it as your own 3rd party library, which I have done for aggregators before. On Wed, Oct 17, 2018

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Liu
Hi Peter, thank you for the reply and detailed information! Would this something comparable with Crail? ( http://crail.incubator.apache.org/blog/2017/11/rdmashuffle.html) I was more looking for something simple/quick making the shuffle between the local jvms quicker (like the idea of using local

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Rudenko
Hey Peter, in SparkRDMA shuffle plugin ( https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle file, to do Remote Direct Memory Access. If the shuffle data is bigger then RAM, Mellanox NIC support On Demand Paging, where OS invalidates translations which are no longer valid due to