Re: RDD Location

2016-12-29 Thread Sun Rui
Maybe you can create your own subclass of RDD and override the getPreferredLocations() to implement the logic of dynamic changing of the locations. > On Dec 30, 2016, at 12:06, Fei Hu wrote: > > Dear all, > > Is there any way to change the host location for a certain

mllib metrics vs ml evaluators and how to improve apis for users

2016-12-29 Thread Ilya Matiach
Hi ML/MLLib developers, 1.I'm trying to add a weights column to ml spark evaluators (RegressionEvaluator, BinaryClassificationEvaluator, MutliclassClassificationEvaluator) that use mllib metrics and I have a few questions (JIRA 2.

RDD Location

2016-12-29 Thread Fei Hu
Dear all, Is there any way to change the host location for a certain partition of RDD? "protected def getPreferredLocations(split: Partition)" can be used to initialize the location, but how to change it after the initialization? Thanks, Fei

Re: shapeless in spark 2.1.0

2016-12-29 Thread Koert Kuipers
we also use spray for webservices that execute on spark, and spray depends on even older (and incompatible) shapeless 1.x to get rid of the old shapeless i would have to upgrade from spray to akka-http, which means going to java 8 this might also affect spark-job-server, which it seems uses

RE: repeated unioning of dataframes take worse than O(N^2) time

2016-12-29 Thread assaf.mendelson
Hi, I understand that doing a union creates a nested structures, however why isn’t it O(N)? If I look at the code it seems this should be a tree merge of two plans, that should occur at O(N) not O(N^2). Even when running the plan that should be O(N*LOG(N)) instead of O(N^2) or worse. Assaf.

Re: [ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Yin Huai
Hello Jacek, Actually, Reynold is still the release manager and I am just sending this message for him :) Sorry. I should have made it clear in my original email. Thanks, Yin On Thu, Dec 29, 2016 at 10:58 AM, Jacek Laskowski wrote: > Hi Yan, > > I've been surprised the first

Re: [ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Jacek Laskowski
Hi Yan, I've been surprised the first time when I noticed rxin stepped back and a new release manager stepped in. Congrats on your first ANNOUNCE! I can only expect even more great stuff coming in to Spark from the dev team after Reynold spared some time  Can't wait to read the changes...

Re: shapeless in spark 2.1.0

2016-12-29 Thread Maciej Szymkiewicz
Breeze 0.13 (RC-1 right now) bumps shapeless to 2.2.0 and 2.2.5 for Scala 2.10 and 2.11 respectively: https://github.com/scalanlp/breeze/pull/509 On 12/29/2016 07:13 PM, Ryan Williams wrote: > > Other option would presumably be for someone to make a release of > breeze with old-shapeless

Re: shapeless in spark 2.1.0

2016-12-29 Thread Ryan Williams
Other option would presumably be for someone to make a release of breeze with old-shapeless shaded... unless shapeless classes are exposed in breeze's public API, in which case you'd have to copy the relevant shapeless classes into breeze and then publish that? On Thu, Dec 29, 2016, 1:05 PM Sean

Re: repeated unioning of dataframes take worse than O(N^2) time

2016-12-29 Thread Sean Owen
Don't do that. Union them all at once with SparkContext.union On Thu, Dec 29, 2016, 17:21 assaf.mendelson wrote: > Hi, > > > > I have been playing around with doing union between a large number of > dataframes and saw that the performance of the actual union (not the >

Re: shapeless in spark 2.1.0

2016-12-29 Thread Sean Owen
It is breeze, but, what's the option? It can't be excluded. I think this falls in the category of things an app would need to shade in this situation. On Thu, Dec 29, 2016, 16:49 Koert Kuipers wrote: > i just noticed that spark 2.1.0 bring in a new transitive dependency on >

Re: shapeless in spark 2.1.0

2016-12-29 Thread Ryan Williams
`mvn dependency:tree -Dverbose -Dincludes=:shapeless_2.11` shows: [INFO] \- org.apache.spark:spark-mllib_2.11:jar:2.1.0:provided [INFO]\- org.scalanlp:breeze_2.11:jar:0.12:provided [INFO] \- com.chuusai:shapeless_2.11:jar:2.0.0:provided On Thu, Dec 29, 2016 at 12:11 PM Herman van

Re: repeated unioning of dataframes take worse than O(N^2) time

2016-12-29 Thread Maciej Szymkiewicz
Iterative union like this creates a deeply nested recursive structure in a similar manner to described here http://stackoverflow.com/q/34461804 You can try something like this http://stackoverflow.com/a/37612978 but there is of course on overhead of conversion between Dataset and RDD. On

repeated unioning of dataframes take worse than O(N^2) time

2016-12-29 Thread assaf.mendelson
Hi, I have been playing around with doing union between a large number of dataframes and saw that the performance of the actual union (not the action) is worse than O(N^2). Since a union basically defines a lineage (i.e. current + union with of other as a child) this should be almost

Re: shapeless in spark 2.1.0

2016-12-29 Thread Herman van Hövell tot Westerflier
Which dependency pulls in shapeless? On Thu, Dec 29, 2016 at 5:49 PM, Koert Kuipers wrote: > i just noticed that spark 2.1.0 bring in a new transitive dependency on > shapeless 2.0.0 > > shapeless is a popular library for scala users, and shapeless 2.0.0 is old > (2014) and

shapeless in spark 2.1.0

2016-12-29 Thread Koert Kuipers
i just noticed that spark 2.1.0 bring in a new transitive dependency on shapeless 2.0.0 shapeless is a popular library for scala users, and shapeless 2.0.0 is old (2014) and not compatible with more current versions. so this means a spark user that uses shapeless in his own development cannot

[ANNOUNCE] Announcing Apache Spark 2.1.0

2016-12-29 Thread Yin Huai
Hi all, Apache Spark 2.1.0 is the second release of Spark 2.x line. This release makes significant strides in the production readiness of Structured Streaming, with added support for event time watermarks