[jira] [Updated] (SPARK-18812) Clarify "Spark ML"
[ https://issues.apache.org/jira/browse/SPARK-18812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18812: -- Target Version/s: 2.1.0 (was: 2.1.1, 2.2.0) > Clarify "Spark ML" > -- > > Key: SPARK-18812 > URL: https://issues.apache.org/jira/browse/SPARK-18812 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 2.1.0 > > > It is useful to add an FAQ entry to explain "Spark ML" and reduce confusion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects
[ https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-17822: -- Target Version/s: 2.1.0, 2.0.3 (was: 2.0.3, 2.1.1, 2.2.0) > JVMObjectTracker.objMap may leak JVM objects > > > Key: SPARK-17822 > URL: https://issues.apache.org/jira/browse/SPARK-17822 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Yin Huai >Assignee: Xiangrui Meng > Fix For: 2.0.3, 2.1.0 > > Attachments: screenshot-1.png > > > JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we > observed that JVM objects that are not used anymore are still trapped in this > map, which prevents those object get GCed. > Seems it makes sense to use weak reference (like persistentRdds in > SparkContext). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18924) Improve collect/createDataFrame performance in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18924: -- Target Version/s: (was: 2.2.0) > Improve collect/createDataFrame performance in SparkR > - > > Key: SPARK-18924 > URL: https://issues.apache.org/jira/browse/SPARK-18924 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > SparkR has its own SerDe for data serialization between JVM and R. > The SerDe on the JVM side is implemented in: > * > [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] > * > [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] > The SerDe on the R side is implemented in: > * > [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] > * > [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] > The serialization between JVM and R suffers from huge storage and computation > overhead. For example, a short round trip of 1 million doubles surprisingly > took 3 minutes on my laptop: > {code} > > system.time(collect(createDataFrame(data.frame(x=runif(100) >user system elapsed > 14.224 0.582 189.135 > {code} > Collecting a medium-sized DataFrame to local and continuing with a local R > workflow is a use case we should pay attention to. SparkR will never be able > to cover all existing features from CRAN packages. It is also unnecessary for > Spark to do so because not all features need scalability. > Several factors contribute to the serialization overhead: > 1. The SerDe in R side is implemented using high-level R methods. > 2. DataFrame columns are not efficiently serialized, primitive type columns > in particular. > 3. Some overhead in the serialization protocol/impl. > 1) might be discussed before because R packages like rJava exist before > SparkR. I'm not sure whether we have a license issue in depending on those > libraries. Another option is to switch to low-level R'C interface or Rcpp, > which again might have license issue. I'm not an expert here. If we have to > implement our own, there still exist much space for improvement, discussed > below. > 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, > which collects rows to local and then constructs columns. However, > * it ignores column types and results boxing/unboxing overhead > * it collects all objects to driver and results high GC pressure > A relatively simple change is to implement specialized column builder based > on column types, primitive types in particular. We need to handle null/NA > values properly. A simple data structure we can use is > {code} > val size: Int > val nullIndexes: Array[Int] > val notNullValues: Array[T] // specialized for primitive types > {code} > On the R side, we can use `readBin` and `writeBin` to read the entire vector > in a single method call. The speed seems reasonable (at the order of GB/s): > {code} > > x <- runif(1000) # 1e7, not 1e6 > > system.time(r <- writeBin(x, raw(0))) >user system elapsed > 0.036 0.021 0.059 > > > system.time(y <- readBin(r, double(), 1000)) >user system elapsed > 0.015 0.007 0.024 > {code} > This is just a proposal that needs to be discussed and formalized. But in > general, it should be feasible to obtain 20x or more performance gain. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19337) Documentation and examples for LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-19337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870512#comment-15870512 ] Joseph K. Bradley commented on SPARK-19337: --- @yuhao yang Do you have time to work on this? Thanks! > Documentation and examples for LinearSVC > > > Key: SPARK-19337 > URL: https://issues.apache.org/jira/browse/SPARK-19337 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley > > User guide + example code for LinearSVC -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14523) Feature parity for Statistics ML with MLlib
[ https://issues.apache.org/jira/browse/SPARK-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861789#comment-15861789 ] Joseph K. Bradley commented on SPARK-14523: --- I'd like to keep this open until we have linked tasks for the missing functionality. [~hujiayin] This is for parity w.r.t. the RDD-based API, not for adding new functionality to MLlib. I think there's already a JIRA for ARIMA somewhere. > Feature parity for Statistics ML with MLlib > --- > > Key: SPARK-14523 > URL: https://issues.apache.org/jira/browse/SPARK-14523 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: yuhao yang > > Some statistics functions have been supported by DataFrame directly. Use this > jira to discuss/design the statistics package in Spark.ML and its function > scope. Hypothesis test and correlation computation may still need to expose > independent interfaces. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-14523) Feature parity for Statistics ML with MLlib
[ https://issues.apache.org/jira/browse/SPARK-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reopened SPARK-14523: --- > Feature parity for Statistics ML with MLlib > --- > > Key: SPARK-14523 > URL: https://issues.apache.org/jira/browse/SPARK-14523 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: yuhao yang > > Some statistics functions have been supported by DataFrame directly. Use this > jira to discuss/design the statistics package in Spark.ML and its function > scope. Hypothesis test and correlation computation may still need to expose > independent interfaces. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18613) spark.ml LDA classes should not expose spark.mllib in APIs
[ https://issues.apache.org/jira/browse/SPARK-18613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-18613. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16860 [https://github.com/apache/spark/pull/16860] > spark.ml LDA classes should not expose spark.mllib in APIs > -- > > Key: SPARK-18613 > URL: https://issues.apache.org/jira/browse/SPARK-18613 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Sue Ann Hong >Priority: Critical > Fix For: 2.2.0 > > > spark.ml.LDAModel exposes dependencies on spark.mllib in 2 methods, but it > should not: > * {{def oldLocalModel: OldLocalLDAModel}} > * {{def getModel: OldLDAModel}} > This task is to deprecate those methods. I recommend creating > {{private[ml]}} versions of the methods which are used internally in order to > avoid deprecation warnings. > Setting target for 2.2, but I'm OK with getting it into 2.1 if we have time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861714#comment-15861714 ] Joseph K. Bradley commented on SPARK-9478: -- [~sethah] Thanks for researching this! +1 for not using weights during bagging and using importance weights to compensate. Intuitively, that seems like it should give better estimators for class conditional probabilities than the other option. If you're splitting this into trees and forests, could you please target your PR against a subtask for trees? > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw > > Currently, this implementation of random forest does not support class > weights. Class weights are important when there is imbalanced training data > or the evaluation metric of a classifier is imbalanced (e.g. true positive > rate at some false positive threshold). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10802) Let ALS recommend for subset of data
[ https://issues.apache.org/jira/browse/SPARK-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860110#comment-15860110 ] Joseph K. Bradley commented on SPARK-10802: --- Linking related issue for feature parity in DataFrame-based API. > Let ALS recommend for subset of data > > > Key: SPARK-10802 > URL: https://issues.apache.org/jira/browse/SPARK-10802 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Tomasz Bartczak >Priority: Minor > > Currently MatrixFactorizationModel allows to get recommendations for > - single user > - single product > - all users > - all products > recommendation for all users/products do a cartesian join inside. > It would be useful in some cases to get recommendations for subset of > users/products by providing an RDD with which MatrixFactorizationModel could > do an intersection before doing a cartesian join. This would make it much > faster in situation where recommendations are needed only for subset of > users/products, and when the subset is still too large to make it feasible to > recommend one-by-one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19535) ALSModel recommendAll analogs
Joseph K. Bradley created SPARK-19535: - Summary: ALSModel recommendAll analogs Key: SPARK-19535 URL: https://issues.apache.org/jira/browse/SPARK-19535 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.2.0 Reporter: Joseph K. Bradley Assignee: Sue Ann Hong Add methods analogous to the spark.mllib MatrixFactorizationModel methods recommendProductsForUsers/UsersForProducts. The initial implementation should be very simple, using DataFrame joins. Future work can add optimizations. I recommend naming them: * recommendForAllUsers * recommendForAllItems -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB
[ https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860108#comment-15860108 ] Joseph K. Bradley commented on SPARK-13857: --- Hi all, catching up these many ALS discussions now. This work to support evaluation and tuning for recommendation is great, but I'm worried about it not being resolved in time for 2.2. I've heard a lot of requests for the plain functionality available in spark.mllib for recommendUsers/Products, so I'd recommend we just add those methods for now as a short-term solution. Let's keep working on the evaluation/tuning plans too. I'll create a JIRA for adding basic recommendUsers/Products methods. > Feature parity for ALS ML with MLLIB > > > Key: SPARK-13857 > URL: https://issues.apache.org/jira/browse/SPARK-13857 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Nick Pentreath >Assignee: Nick Pentreath > > Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods > {{recommendProducts/recommendUsers}} for recommending top K to a given user / > item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to > recommend top K across all users/items. > Additionally, SPARK-10802 is for adding the ability to do > {{recommendProductsForUsers}} for a subset of users (or vice versa). > Look at exposing or porting (as appropriate) these methods to ALS in ML. > Investigate if efficiency can be improved at the same time (see SPARK-11968). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10141) Number of tasks on executors still become negative after failures
[ https://issues.apache.org/jira/browse/SPARK-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-10141. - Resolution: Done > Number of tasks on executors still become negative after failures > - > > Key: SPARK-10141 > URL: https://issues.apache.org/jira/browse/SPARK-10141 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Joseph K. Bradley >Priority: Minor > Attachments: Screen Shot 2015-08-20 at 3.14.49 PM.png > > > I hit this failure when running LDA on EC2 (after I made the model size > really big). > I was using the LDAExample.scala code on an EC2 cluster with 16 workers > (r3.2xlarge), on a Wikipedia dataset: > {code} > Training set size (documents) 4534059 > Vocabulary size (terms) 1 > Training set size (tokens)895575317 > EM optimizer > 1K topics > {code} > Failure message: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in > stage 22.0 failed 4 times, most recent failure: Lost task 55.3 in stage 22.0 > (TID 2881, 10.0.202.128): java.io.IOException: Failed to connect to > /10.0.202.128:54740 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.ConnectException: Connection refused: /10.0.202.128:54740 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > ... 1 more > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1267) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1255) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1254) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1254) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:684) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1480) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1442) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1431) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:554) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1805) > at org.apache.spark.SparkContext.r
[jira] [Assigned] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN
[ https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-17975: - Assignee: Tathagata Das > EMLDAOptimizer fails with ClassCastException on YARN > > > Key: SPARK-17975 > URL: https://issues.apache.org/jira/browse/SPARK-17975 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.1 > Environment: Centos 6, CDH 5.7, Java 1.7u80 >Reporter: Jeff Stein >Assignee: Tathagata Das > Fix For: 2.0.3, 2.1.1, 2.2.0 > > Attachments: docs.txt > > > I'm able to reproduce the error consistently with a 2000 record text file > with each record having 1-5 terms and checkpointing enabled. It looks like > the problem was introduced with the resolution for SPARK-13355. > The EdgeRDD class seems to be lying about it's type in a way that causes > RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an > RDD of Edge elements. > {code} > val spark = SparkSession.builder.appName("lda").getOrCreate() > spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints") > val data: RDD[(Long, Vector)] = // snip > data.setName("data").cache() > val lda = new LDA > val optimizer = new EMLDAOptimizer > lda.setOptimizer(optimizer) > .setK(10) > .setMaxIterations(400) > .setAlpha(-1) > .setBeta(-1) > .setCheckpointInterval(7) > val ldaModel = lda.run(data) > {code} > {noformat} > 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID > 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be > cast to org.apache.spark.graphx.Edge > at > org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.a
[jira] [Resolved] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN
[ https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-17975. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 2.0.3 > EMLDAOptimizer fails with ClassCastException on YARN > > > Key: SPARK-17975 > URL: https://issues.apache.org/jira/browse/SPARK-17975 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.1 > Environment: Centos 6, CDH 5.7, Java 1.7u80 >Reporter: Jeff Stein > Fix For: 2.0.3, 2.1.1, 2.2.0 > > Attachments: docs.txt > > > I'm able to reproduce the error consistently with a 2000 record text file > with each record having 1-5 terms and checkpointing enabled. It looks like > the problem was introduced with the resolution for SPARK-13355. > The EdgeRDD class seems to be lying about it's type in a way that causes > RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an > RDD of Edge elements. > {code} > val spark = SparkSession.builder.appName("lda").getOrCreate() > spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints") > val data: RDD[(Long, Vector)] = // snip > data.setName("data").cache() > val lda = new LDA > val optimizer = new EMLDAOptimizer > lda.setOptimizer(optimizer) > .setK(10) > .setMaxIterations(400) > .setAlpha(-1) > .setBeta(-1) > .setCheckpointInterval(7) > val ldaModel = lda.run(data) > {code} > {noformat} > 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID > 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be > cast to org.apache.spark.graphx.Edge > at > org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-ma
[jira] [Updated] (SPARK-14804) Graph vertexRDD/EdgeRDD checkpoint results ClassCastException:
[ https://issues.apache.org/jira/browse/SPARK-14804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14804: -- Fix Version/s: (was: 3.0.0) 2.2.0 > Graph vertexRDD/EdgeRDD checkpoint results ClassCastException: > --- > > Key: SPARK-14804 > URL: https://issues.apache.org/jira/browse/SPARK-14804 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.6.1 >Reporter: SuYan >Assignee: Tathagata Das >Priority: Minor > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > {code} > graph3.vertices.checkpoint() > graph3.vertices.count() > graph3.vertices.map(_._2).count() > {code} > 16/04/21 21:04:43 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 > (TID 13, localhost): java.lang.ClassCastException: > org.apache.spark.graphx.impl.ShippableVertexPartition cannot be cast to > scala.Tuple2 > at > com.xiaomi.infra.codelab.spark.Graph2$$anonfun$main$1.apply(Graph2.scala:80) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:91) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:219) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > look at the code: > {code} > private[spark] def computeOrReadCheckpoint(split: Partition, context: > TaskContext): Iterator[T] = > { > if (isCheckpointedAndMaterialized) { > firstParent[T].iterator(split, context) > } else { > compute(split, context) > } > } > private[spark] def isCheckpointedAndMaterialized: Boolean = isCheckpointed > override def isCheckpointed: Boolean = { >firstParent[(PartitionID, EdgePartition[ED, VD])].isCheckpointed > } > {code} > for VertexRDD or EdgeRDD, first parent is its partitionRDD > RDD[ShippableVertexPartition[VD]]/RDD[(PartitionID, EdgePartition[ED, VD])] > 1. we call vertexRDD.checkpoint, it partitionRDD will checkpoint, so > VertexRDD.isCheckpointedAndMaterialized=true. > 2. then we call vertexRDD.iterator, because checkoint=true it called > firstParent.iterator(which is not CheckpointRDD, actually is partitionRDD). > > so returned iterator is iterator[ShippableVertexPartition] not expect > iterator[(VertexId, VD)]] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN
[ https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859984#comment-15859984 ] Joseph K. Bradley commented on SPARK-17975: --- Will do, thanks! > EMLDAOptimizer fails with ClassCastException on YARN > > > Key: SPARK-17975 > URL: https://issues.apache.org/jira/browse/SPARK-17975 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.1 > Environment: Centos 6, CDH 5.7, Java 1.7u80 >Reporter: Jeff Stein > Attachments: docs.txt > > > I'm able to reproduce the error consistently with a 2000 record text file > with each record having 1-5 terms and checkpointing enabled. It looks like > the problem was introduced with the resolution for SPARK-13355. > The EdgeRDD class seems to be lying about it's type in a way that causes > RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an > RDD of Edge elements. > {code} > val spark = SparkSession.builder.appName("lda").getOrCreate() > spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints") > val data: RDD[(Long, Vector)] = // snip > data.setName("data").cache() > val lda = new LDA > val optimizer = new EMLDAOptimizer > lda.setOptimizer(optimizer) > .setK(10) > .setMaxIterations(400) > .setAlpha(-1) > .setBeta(-1) > .setCheckpointInterval(7) > val ldaModel = lda.run(data) > {code} > {noformat} > 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID > 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be > cast to org.apache.spark.graphx.Edge > at > org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-10141) Number of tasks on executors still become negative after failures
[ https://issues.apache.org/jira/browse/SPARK-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859982#comment-15859982 ] Joseph K. Bradley commented on SPARK-10141: --- I'll close this if no one has seen it in Spark 2.0 or 2.1. Thanks all! > Number of tasks on executors still become negative after failures > - > > Key: SPARK-10141 > URL: https://issues.apache.org/jira/browse/SPARK-10141 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Joseph K. Bradley >Priority: Minor > Attachments: Screen Shot 2015-08-20 at 3.14.49 PM.png > > > I hit this failure when running LDA on EC2 (after I made the model size > really big). > I was using the LDAExample.scala code on an EC2 cluster with 16 workers > (r3.2xlarge), on a Wikipedia dataset: > {code} > Training set size (documents) 4534059 > Vocabulary size (terms) 1 > Training set size (tokens)895575317 > EM optimizer > 1K topics > {code} > Failure message: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in > stage 22.0 failed 4 times, most recent failure: Lost task 55.3 in stage 22.0 > (TID 2881, 10.0.202.128): java.io.IOException: Failed to connect to > /10.0.202.128:54740 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.ConnectException: Connection refused: /10.0.202.128:54740 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > ... 1 more > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1267) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1255) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1254) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1254) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:684) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1480) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1442) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1431) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:554) >
[jira] [Assigned] (SPARK-18613) spark.ml LDA classes should not expose spark.mllib in APIs
[ https://issues.apache.org/jira/browse/SPARK-18613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-18613: - Assignee: Sue Ann Hong > spark.ml LDA classes should not expose spark.mllib in APIs > -- > > Key: SPARK-18613 > URL: https://issues.apache.org/jira/browse/SPARK-18613 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Sue Ann Hong >Priority: Critical > > spark.ml.LDAModel exposes dependencies on spark.mllib in 2 methods, but it > should not: > * {{def oldLocalModel: OldLocalLDAModel}} > * {{def getModel: OldLDAModel}} > This task is to deprecate those methods. I recommend creating > {{private[ml]}} versions of the methods which are used internally in order to > avoid deprecation warnings. > Setting target for 2.2, but I'm OK with getting it into 2.1 if we have time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858373#comment-15858373 ] Joseph K. Bradley commented on SPARK-17139: --- @sethah Yep, that looks like what I had in mind. Thanks for writing it out clearly. What do you think? > Add model summary for MultinomialLogisticRegression > --- > > Key: SPARK-17139 > URL: https://issues.apache.org/jira/browse/SPARK-17139 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Add model summary to multinomial logistic regression using same interface as > in other ML models. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19400) GLM fails for intercept only model
[ https://issues.apache.org/jira/browse/SPARK-19400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-19400. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16740 [https://github.com/apache/spark/pull/16740] > GLM fails for intercept only model > -- > > Key: SPARK-19400 > URL: https://issues.apache.org/jira/browse/SPARK-19400 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Fix For: 2.2.0 > > > Intercept-only GLM fails for non-Gaussian family because of reducing an empty > array in IWLS. > {code} > val dataset = Seq( > (1.0, 1.0, 2.0, 0.0, 5.0), > (0.5, 2.0, 1.0, 1.0, 2.0), > (1.0, 3.0, 0.5, 2.0, 1.0), > (2.0, 4.0, 1.5, 3.0, 3.0) > ).toDF("y", "w", "off", "x1", "x2") > val formula = new RFormula().setFormula("y ~ 1") > val output = formula.fit(dataset).transform(dataset) > val glr = new GeneralizedLinearRegression().setFamily("poisson") > val model = glr.fit(output) > java.lang.UnsupportedOperationException: empty.reduceLeft > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19400) GLM fails for intercept only model
[ https://issues.apache.org/jira/browse/SPARK-19400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-19400: - Assignee: Wayne Zhang > GLM fails for intercept only model > -- > > Key: SPARK-19400 > URL: https://issues.apache.org/jira/browse/SPARK-19400 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Fix For: 2.2.0 > > > Intercept-only GLM fails for non-Gaussian family because of reducing an empty > array in IWLS. > {code} > val dataset = Seq( > (1.0, 1.0, 2.0, 0.0, 5.0), > (0.5, 2.0, 1.0, 1.0, 2.0), > (1.0, 3.0, 0.5, 2.0, 1.0), > (2.0, 4.0, 1.5, 3.0, 3.0) > ).toDF("y", "w", "off", "x1", "x2") > val formula = new RFormula().setFormula("y ~ 1") > val output = formula.fit(dataset).transform(dataset) > val glr = new GeneralizedLinearRegression().setFamily("poisson") > val model = glr.fit(output) > java.lang.UnsupportedOperationException: empty.reduceLeft > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17629) Add local version of Word2Vec findSynonyms for spark.ml
[ https://issues.apache.org/jira/browse/SPARK-17629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-17629: - Shepherd: Joseph K. Bradley Assignee: Asher Krim Affects Version/s: 2.2.0 Target Version/s: 2.2.0 Component/s: ML Issue Type: New Feature (was: Question) Summary: Add local version of Word2Vec findSynonyms for spark.ml (was: Should ml Word2Vec findSynonyms match the mllib implementation?) > Add local version of Word2Vec findSynonyms for spark.ml > --- > > Key: SPARK-17629 > URL: https://issues.apache.org/jira/browse/SPARK-17629 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Asher Krim >Assignee: Asher Krim >Priority: Minor > > ml Word2Vec's findSynonyms methods depart from mllib in that they return > distributed results, rather than the results directly: > {code} > def findSynonyms(word: String, num: Int): DataFrame = { > val spark = SparkSession.builder().getOrCreate() > spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", > "similarity") > } > {code} > What was the reason for this decision? I would think that most users would > request a reasonably small number of results back, and want to use them > directly on the driver, similar to the _take_ method on dataframes. Returning > parallelized results creates a costly round trip for the data that doesn't > seem necessary. > The original PR: https://github.com/apache/spark/pull/7263 > [~MechCoder] - do you perhaps recall the reason? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17498) StringIndexer.setHandleInvalid should have another option 'new'
[ https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15856918#comment-15856918 ] Joseph K. Bradley commented on SPARK-17498: --- Linking related issue for QuantileDiscretizer where we provide a handleInvalid option for putting invalid values in a special bucket. > StringIndexer.setHandleInvalid should have another option 'new' > --- > > Key: SPARK-17498 > URL: https://issues.apache.org/jira/browse/SPARK-17498 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Miroslav Balaz > > That will map unseen label to maximum known label +1, IndexToString would map > that back to "" or NA if there is something like that in spark, -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17498) StringIndexer.setHandleInvalid should have another option 'new'
[ https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-17498: -- Summary: StringIndexer.setHandleInvalid should have another option 'new' (was: StringIndexer.setHandleInvalid sohuld have another option 'new') > StringIndexer.setHandleInvalid should have another option 'new' > --- > > Key: SPARK-17498 > URL: https://issues.apache.org/jira/browse/SPARK-17498 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Miroslav Balaz > > That will map unseen label to maximum known label +1, IndexToString would map > that back to "" or NA if there is something like that in spark, -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17139) Add model summary for MultinomialLogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15856612#comment-15856612 ] Joseph K. Bradley edited comment on SPARK-17139 at 2/7/17 7:25 PM: --- I'll offer a few thoughts first: * A "ClassificationSummary" could be the same as a "MulticlassClassificationSummary" because binary is a special type of multiclass. * Following the structure of abstractions for Prediction is reasonable. * Separating binary and multiclass is reasonable; the separation is more significant for evaluation than for the Prediction abstractions. * Abstract classes have been a pain in the case of Prediction abstractions, so I'd prefer we use traits. The 2 alternatives I see are: 1. BinaryClassificationSummary inherits from ClassificationSummary. No separate MulticlassClassificationSummary. 2. BinaryClassificationSummary and MulticlassClassificationSummary inherit from ClassificationSummary. Both alternatives are semantically reasonable. However, since ClassificationSummary = MulticlassClassificationSummary in terms of functionality, and since the Prediction abstractions combine binary and multiclass, I prefer option 1. If we go with option 1, then we need 4 concrete classes: * LogisticRegressionSummary * LogisticRegressionTrainingSummary * BinaryLogisticRegressionSummary * BinaryLogisticRegressionTrainingSummary We would definitely want binary summaries to inherit from their multiclass counterparts, and for training summaries to inherit from their general counterparts: * LogisticRegressionSummary * LogisticRegressionTrainingSummary: LogisticRegressionSummary * BinaryLogisticRegressionSummary: LogisticRegressionSummary * BinaryLogisticRegressionTrainingSummary: LogisticRegressionTrainingSummary, BinaryLogisticRegressionSummary Of course, this is a problem. But we could solve it by having all of these be traits, with concrete classes inheriting. I.e., {{LogisticRegressionModel.summary}} could return {{trait LogisticRegressionTrainingSummary}}, which could be of concrete type {{LogisticRegressionTrainingSummaryImpl}} (multiclass) or {{BinaryLogisticRegressionTrainingSummaryImpl}} (binary). I suspect MiMa will complain about this, but IIRC it's safe since all of these summaries have private constructors and can't be extended outside of Spark. Btw, we could introduce a set of abstractions matching the Prediction ones, but that should probably happen under a separate JIRA. What do you think? was (Author: josephkb): I'll offer a few thoughts first: * A "ClassificationSummary" could be the same as a "MulticlassClassificationSummary" because binary is a special type of multiclass. * Following the structure of abstractions for Prediction is reasonable. * Separating binary and multiclass is reasonable; the separation is more significant for evaluation than for the Prediction abstractions. * Abstract classes have been a pain in the case of Prediction abstractions, so I'd prefer we use traits. The 2 alternatives I see are: 1. BinaryClassificationSummary inherits from ClassificationSummary. No separate MulticlassClassificationSummary. 2. BinaryClassificationSummary and MulticlassClassificationSummary inherit from ClassificationSummary. Both alternatives are semantically reasonable. However, since ClassificationSummary = MulticlassClassificationSummary in terms of functionality, and since the Prediction abstractions combine binary and multiclass, I prefer option 1. If we go with option 1, then we need 4 concrete classes: * LogisticRegressionSummary * LogisticRegressionTrainingSummary * BinaryLogisticRegressionSummary * BinaryLogisticRegressionTrainingSummary We would definitely want binary summaries to inherit from their multiclass counterparts, and for training summaries to inherit from their general counterparts: * LogisticRegressionSummary * LogisticRegressionTrainingSummary: LogisticRegressionSummary * BinaryLogisticRegressionSummary: LogisticRegressionSummary * BinaryLogisticRegressionTrainingSummary: LogisticRegressionTrainingSummary, BinaryLogisticRegressionSummary Of course, this is a problem. But we could solve it by having all of these be traits, with concrete classes inheriting. I.e., {{LogisticRegressionModel.summary}} could return {{trait LogisticRegressionTrainingSummary}}, which could be of concrete type {{LogisticRegressionTrainingSummaryImpl}} (multiclass) or {{BinaryLogisticRegressionTrainingSummaryImpl}} (binary). I suspect MiMa will complain about this, but IIRC it's safe since all of these summaries have private constructors and can't be extended outside of Spark. What do you think? > Add model summary for MultinomialLogisticRegression > --- > > Key: SPARK-17139 > URL: https://issues.apache.org/jira/browse/SPARK-17139 >
[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15856612#comment-15856612 ] Joseph K. Bradley commented on SPARK-17139: --- I'll offer a few thoughts first: * A "ClassificationSummary" could be the same as a "MulticlassClassificationSummary" because binary is a special type of multiclass. * Following the structure of abstractions for Prediction is reasonable. * Separating binary and multiclass is reasonable; the separation is more significant for evaluation than for the Prediction abstractions. * Abstract classes have been a pain in the case of Prediction abstractions, so I'd prefer we use traits. The 2 alternatives I see are: 1. BinaryClassificationSummary inherits from ClassificationSummary. No separate MulticlassClassificationSummary. 2. BinaryClassificationSummary and MulticlassClassificationSummary inherit from ClassificationSummary. Both alternatives are semantically reasonable. However, since ClassificationSummary = MulticlassClassificationSummary in terms of functionality, and since the Prediction abstractions combine binary and multiclass, I prefer option 1. If we go with option 1, then we need 4 concrete classes: * LogisticRegressionSummary * LogisticRegressionTrainingSummary * BinaryLogisticRegressionSummary * BinaryLogisticRegressionTrainingSummary We would definitely want binary summaries to inherit from their multiclass counterparts, and for training summaries to inherit from their general counterparts: * LogisticRegressionSummary * LogisticRegressionTrainingSummary: LogisticRegressionSummary * BinaryLogisticRegressionSummary: LogisticRegressionSummary * BinaryLogisticRegressionTrainingSummary: LogisticRegressionTrainingSummary, BinaryLogisticRegressionSummary Of course, this is a problem. But we could solve it by having all of these be traits, with concrete classes inheriting. I.e., {{LogisticRegressionModel.summary}} could return {{trait LogisticRegressionTrainingSummary}}, which could be of concrete type {{LogisticRegressionTrainingSummaryImpl}} (multiclass) or {{BinaryLogisticRegressionTrainingSummaryImpl}} (binary). I suspect MiMa will complain about this, but IIRC it's safe since all of these summaries have private constructors and can't be extended outside of Spark. What do you think? > Add model summary for MultinomialLogisticRegression > --- > > Key: SPARK-17139 > URL: https://issues.apache.org/jira/browse/SPARK-17139 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Add model summary to multinomial logistic regression using same interface as > in other ML models. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10817) ML abstraction umbrella
[ https://issues.apache.org/jira/browse/SPARK-10817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10817: -- Priority: Major (was: Critical) > ML abstraction umbrella > --- > > Key: SPARK-10817 > URL: https://issues.apache.org/jira/browse/SPARK-10817 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > This is an umbrella for discussing and creating ML abstractions. This was > originally handled under [SPARK-1856] and [SPARK-3702], under which we > created the Pipelines API and some Developer APIs for classification and > regression. > This umbrella is for future work, including: > * Stabilizing the classification and regression APIs > * Discussing traits vs. abstract classes for abstraction APIs > * Creating other abstractions not yet covered (clustering, multilabel > prediction, etc.) > Note that [SPARK-3702] still has useful discussion and design docs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries
[ https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19498: -- Description: Per the recent discussion on the dev list, this JIRA is for discussing how we can make MLlib DataFrame-based APIs more extensible, especially for the purpose of writing 3rd-party libraries with APIs extended from the MLlib APIs (for custom Transformers, Estimators, etc.). * For people who have written such libraries, what issues have you run into? * What APIs are not public or extensible enough? Do they require changes before being made more public? * Are APIs for non-Scala languages such as Java and Python friendly or extensive enough? The easy answer is to make everything public, but that would be terrible of course in the long-term. Let's discuss what is needed and how we can present stable, sufficient, and easy-to-use APIs for 3rd-party developers. > Discussion: Making MLlib APIs extensible for 3rd party libraries > > > Key: SPARK-19498 > URL: https://issues.apache.org/jira/browse/SPARK-19498 > Project: Spark > Issue Type: Brainstorming > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Priority: Critical > > Per the recent discussion on the dev list, this JIRA is for discussing how we > can make MLlib DataFrame-based APIs more extensible, especially for the > purpose of writing 3rd-party libraries with APIs extended from the MLlib APIs > (for custom Transformers, Estimators, etc.). > * For people who have written such libraries, what issues have you run into? > * What APIs are not public or extensible enough? Do they require changes > before being made more public? > * Are APIs for non-Scala languages such as Java and Python friendly or > extensive enough? > The easy answer is to make everything public, but that would be terrible of > course in the long-term. Let's discuss what is needed and how we can present > stable, sufficient, and easy-to-use APIs for 3rd-party developers. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries
Joseph K. Bradley created SPARK-19498: - Summary: Discussion: Making MLlib APIs extensible for 3rd party libraries Key: SPARK-19498 URL: https://issues.apache.org/jira/browse/SPARK-19498 Project: Spark Issue Type: Brainstorming Components: ML Affects Versions: 2.2.0 Reporter: Joseph K. Bradley Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7258) spark.ml API taking Graph instead of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-7258. Resolution: Won't Fix > spark.ml API taking Graph instead of DataFrame > -- > > Key: SPARK-7258 > URL: https://issues.apache.org/jira/browse/SPARK-7258 > Project: Spark > Issue Type: New Feature > Components: GraphX, ML >Reporter: Joseph K. Bradley > > It would be useful to have an API in ML Pipelines for working with Graphs, > not just DataFrames. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7258) spark.ml API taking Graph instead of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15856318#comment-15856318 ] Joseph K. Bradley commented on SPARK-7258: -- This made more sense in the past...will close for now. > spark.ml API taking Graph instead of DataFrame > -- > > Key: SPARK-7258 > URL: https://issues.apache.org/jira/browse/SPARK-7258 > Project: Spark > Issue Type: New Feature > Components: GraphX, ML >Reporter: Joseph K. Bradley > > It would be useful to have an API in ML Pipelines for working with Graphs, > not just DataFrames. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19467) PySpark ML shouldn't use circular imports
[ https://issues.apache.org/jira/browse/SPARK-19467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-19467: - Assignee: Maciej Szymkiewicz > PySpark ML shouldn't use circular imports > - > > Key: SPARK-19467 > URL: https://issues.apache.org/jira/browse/SPARK-19467 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 2.2.0 > > > {{pyspark.ml}} and {{pyspark.ml.pipeline}} contain circular imports with the > [former > one|https://github.com/apache/spark/blob/39f328ba3519b01940a7d1cdee851ba4e75ef31f/python/pyspark/ml/__init__.py#L23]: > {code} > from pyspark.ml.pipeline import Pipeline, PipelineModel > {code} > and the [latter > one|https://github.com/apache/spark/blob/39f328ba3519b01940a7d1cdee851ba4e75ef31f/python/pyspark/ml/pipeline.py#L24]: > {code} > from pyspark.ml import Estimator, Model, Transformer > {code} > This is unnecessary and can cause failures when working with external tools. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19467) PySpark ML shouldn't use circular imports
[ https://issues.apache.org/jira/browse/SPARK-19467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-19467. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16814 [https://github.com/apache/spark/pull/16814] > PySpark ML shouldn't use circular imports > - > > Key: SPARK-19467 > URL: https://issues.apache.org/jira/browse/SPARK-19467 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 2.2.0 > > > {{pyspark.ml}} and {{pyspark.ml.pipeline}} contain circular imports with the > [former > one|https://github.com/apache/spark/blob/39f328ba3519b01940a7d1cdee851ba4e75ef31f/python/pyspark/ml/__init__.py#L23]: > {code} > from pyspark.ml.pipeline import Pipeline, PipelineModel > {code} > and the [latter > one|https://github.com/apache/spark/blob/39f328ba3519b01940a7d1cdee851ba4e75ef31f/python/pyspark/ml/pipeline.py#L24]: > {code} > from pyspark.ml import Estimator, Model, Transformer > {code} > This is unnecessary and can cause failures when working with external tools. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15573) Backwards-compatible persistence for spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15855097#comment-15855097 ] Joseph K. Bradley commented on SPARK-15573: --- It's a good point that we can't make updates to older Spark releases for persistence. However, I doubt that we would backport many such fixes for non-bugs. The issue you reference is arguably a scalability limit, not a bug. Still, adding an internal ML persistence version is a good idea; I'd be OK with it. > Backwards-compatible persistence for spark.ml > - > > Key: SPARK-15573 > URL: https://issues.apache.org/jira/browse/SPARK-15573 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA is for imposing backwards-compatible persistence for the > DataFrames-based API for MLlib. I.e., we want to be able to load models > saved in previous versions of Spark. We will not require loading models > saved in later versions of Spark. > This requires: > * Putting unit tests in place to check loading models from previous versions > * Notifying all committers active on MLlib to be aware of this requirement in > the future > The unit tests could be written as in spark.mllib, where we essentially > copied and pasted the save() code every time it changed. This happens > rarely, so it should be acceptable, though other designs are fine. > Subtasks of this JIRA should cover checking and adding tests for existing > cases, such as KMeansModel (whose format changed between 1.6 and 2.0). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15855090#comment-15855090 ] Joseph K. Bradley commented on SPARK-19208: --- You're right that sharing intermediate results will be necessary. I'm happy with [~mlnick]'s VectorSummarizer API. I also think that, if we wanted to use the API I suggested above, the version returning a single struct col would work: {{df.select(VectorSummary.summary("features", "weights"))}}. The new column could be constructed from intermediate columns which would not show up in the final output. (Is this essentially the "private UDAF" [~podongfeng] is mentioning above?) I'm OK either way. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15855060#comment-15855060 ] Joseph K. Bradley commented on SPARK-12157: --- I don't know of any Python UDF perf tests. Ad hoc tests could suffice for now... > Support numpy types as return values of Python UDFs > --- > > Key: SPARK-12157 > URL: https://issues.apache.org/jira/browse/SPARK-12157 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.2 >Reporter: Justin Uang > > Currently, if I have a python UDF > {code} > import pyspark.sql.types as T > import pyspark.sql.functions as F > from pyspark.sql import Row > import numpy as np > argmax = F.udf(lambda x: np.argmax(x), T.IntegerType()) > df = sqlContext.createDataFrame([Row(array=[1,2,3])]) > df.select(argmax("array")).count() > {code} > I get an exception that is fairly opaque: > {code} > Caused by: net.razorvine.pickle.PickleException: expected zero arguments for > construction of ClassDict (for numpy.dtype) > at > net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:85) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403) > {code} > Numpy types like np.int and np.float64 should automatically be cast to the > proper dtypes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16824) Add API docs for VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15855057#comment-15855057 ] Joseph K. Bradley commented on SPARK-16824: --- I think we didn't document it since the future of UDTs becoming public APIs was uncertain. VectorUDT is private in spark.ml. Still, adding docs for the public spark.mllib VectorUDT sounds good to me. > Add API docs for VectorUDT > -- > > Key: SPARK-16824 > URL: https://issues.apache.org/jira/browse/SPARK-16824 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following on the [discussion > here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153], > it appears that {{VectorUDT}} is missing documentation, at least in PySpark. > I'm not sure if this is intentional or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16824) Add API docs for VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-16824: -- Issue Type: Documentation (was: Improvement) > Add API docs for VectorUDT > -- > > Key: SPARK-16824 > URL: https://issues.apache.org/jira/browse/SPARK-16824 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following on the [discussion > here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153], > it appears that {{VectorUDT}} is missing documentation, at least in PySpark. > I'm not sure if this is intentional or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16824) Add API docs for VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-16824: -- Component/s: MLlib > Add API docs for VectorUDT > -- > > Key: SPARK-16824 > URL: https://issues.apache.org/jira/browse/SPARK-16824 > Project: Spark > Issue Type: Improvement > Components: Documentation, MLlib, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following on the [discussion > here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153], > it appears that {{VectorUDT}} is missing documentation, at least in PySpark. > I'm not sure if this is intentional or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19247) Improve ml word2vec save/load scalability
[ https://issues.apache.org/jira/browse/SPARK-19247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-19247. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16607 [https://github.com/apache/spark/pull/16607] > Improve ml word2vec save/load scalability > - > > Key: SPARK-19247 > URL: https://issues.apache.org/jira/browse/SPARK-19247 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Asher Krim >Assignee: Asher Krim > Fix For: 2.2.0 > > > ml word2vec models can be somewhat large (~4gb is not uncommon). The current > save implementation saves the model as a single large datum, which can cause > rpc issues and fail to save the model. > On the loading side, there are issues with loading this large datum as well. > This was already solved for mllib word2vec in > https://issues.apache.org/jira/browse/SPARK-11994, but the change was never > ported to the ml word2vec implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19247) Improve ml word2vec save/load scalability
[ https://issues.apache.org/jira/browse/SPARK-19247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-19247: - Assignee: Asher Krim > Improve ml word2vec save/load scalability > - > > Key: SPARK-19247 > URL: https://issues.apache.org/jira/browse/SPARK-19247 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Asher Krim >Assignee: Asher Krim > > ml word2vec models can be somewhat large (~4gb is not uncommon). The current > save implementation saves the model as a single large datum, which can cause > rpc issues and fail to save the model. > On the loading side, there are issues with loading this large datum as well. > This was already solved for mllib word2vec in > https://issues.apache.org/jira/browse/SPARK-11994, but the change was never > ported to the ml word2vec implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19247) Improve ml word2vec save/load scalability
[ https://issues.apache.org/jira/browse/SPARK-19247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19247: -- Shepherd: Joseph K. Bradley Affects Version/s: 2.2.0 Target Version/s: 2.2.0 > Improve ml word2vec save/load scalability > - > > Key: SPARK-19247 > URL: https://issues.apache.org/jira/browse/SPARK-19247 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Asher Krim > > ml word2vec models can be somewhat large (~4gb is not uncommon). The current > save implementation saves the model as a single large datum, which can cause > rpc issues and fail to save the model. > On the loading side, there are issues with loading this large datum as well. > This was already solved for mllib word2vec in > https://issues.apache.org/jira/browse/SPARK-11994, but the change was never > ported to the ml word2vec implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14503) spark.ml Scala API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15850707#comment-15850707 ] Joseph K. Bradley commented on SPARK-14503: --- Sounds good, thank you! > spark.ml Scala API for FPGrowth > --- > > Key: SPARK-14503 > URL: https://issues.apache.org/jira/browse/SPARK-14503 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > This task is the first port of spark.mllib.fpm functionality to spark.ml > (Scala). > This will require a brief design doc to confirm a reasonable DataFrame-based > API, with details for this class. The doc could also look ahead to the other > fpm classes, especially if their API decisions will affect FPGrowth. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19389) Minor doc fixes, including Since tags in Python Params
[ https://issues.apache.org/jira/browse/SPARK-19389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-19389. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16723 [https://github.com/apache/spark/pull/16723] > Minor doc fixes, including Since tags in Python Params > -- > > Key: SPARK-19389 > URL: https://issues.apache.org/jira/browse/SPARK-19389 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > Fix For: 2.2.0 > > > I spotted some doc issues, mainly in Python, when reviewing [SPARK-19336] > which were not related to that PR. This PR fixes them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14503) spark.ml Scala API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848747#comment-15848747 ] Joseph K. Bradley commented on SPARK-14503: --- There are a couple of design issues which have been mentioned in either the design doc or the PR, but which should probably be discussed in more detail in JIRA: * Item type: It looks like this currently assumes every item is represented as a String. I'd like us to support any Catalyst type. If that's hard to do (until we port the implementation over to DataFrames), then just supporting String is OK as long as it's clearly documented. * FPGrowth vs AssociationRules: The APIs are a bit fuzzy right now. I’ve listed them out below. The problem is that AssociationRules is tightly tied to FPGrowth. While I like the idea of being able to use AssociationRules to analyze the output of multiple FPM algorithms, I don’t think it’s applicable to PrefixSpan since it does not take the ordering of the itemsets into account. I’d propose we provide a single API under the name "FPGrowth." ** Q: Have you heard of anyone needing the AssociationRules API without going through FPGrowth first? If so, then we could expose the AssociationRules algorithm as a @DeveloperApi static method. What does everyone think? Current APIs * FPGrowth ** Input to fit() and transform(): Seq(items) ** Output *** transform(): —> same as AssociationRules *** getFreqItems: {code}DataFrame["items", "freq"]{code} * AssociationRules ** Input *** fit(): (output of FPGrowth) *** transform(): Seq(items) —> Not good that fit/transform take different inputs ** Output *** transform(): predicted items for each Seq(items) *** associationRules: {code}DataFrame["antecedent", "consequent", "confidence"]{code} Proposal: Combine under FPGrowth * FPGrowth ** Input to fit() and transform(): Seq(items) ** Output *** transform(): predicted items for each Seq(items) *** getFreqItems: {code}DataFrame["items", "freq"]{code} *** associationRules: {code}DataFrame["antecedent", "consequent", "confidence"]{code} > spark.ml Scala API for FPGrowth > --- > > Key: SPARK-14503 > URL: https://issues.apache.org/jira/browse/SPARK-14503 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > This task is the first port of spark.mllib.fpm functionality to spark.ml > (Scala). > This will require a brief design doc to confirm a reasonable DataFrame-based > API, with details for this class. The doc could also look ahead to the other > fpm classes, especially if their API decisions will affect FPGrowth. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()
[ https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847786#comment-15847786 ] Joseph K. Bradley edited comment on SPARK-12965 at 2/1/17 12:43 AM: I'd say this is both a SQL and MLlib issue, where the MLlib issue is blocked by the SQL one. * SQL: {{schema}} handles periods/quotes inconsistently relative to the rest of the Dataset API * ML: StringIndexer could avoid using schema.fieldNames and instead use an API provided by StructType for checking for the existence of a field. That said, that API needs to be added to StructType... I'm going to update this issue to be for ML only to handle fixing StringIndexer and link to a separate JIRA for the SQL issue. was (Author: josephkb): I'd say this is both a SQL and MLlib issue. I'm going to update this issue to be for ML only to handle fixing StringIndexer and link to a separate JIRA for the SQL issue. > Indexer setInputCol() doesn't resolve column names like DataFrame.col() > --- > > Key: SPARK-12965 > URL: https://issues.apache.org/jira/browse/SPARK-12965 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Joshua Taylor > Attachments: SparkMLDotColumn.java > > > The setInputCol() method doesn't seem to resolve column names in the same way > that other methods do. E.g., Given a DataFrame df, {{df.col("`a.b`")}} will > return a column. On a StringIndexer indexer, > {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting > and transforming seem to have no effect. Running the following code produces: > {noformat} > +---+---++ > |a.b|a_b|a_bIndex| > +---+---++ > |foo|foo| 0.0| > |bar|bar| 1.0| > +---+---++ > {noformat} > but I think it should have another column, {{abIndex}} with the same contents > as a_bIndex. > {code} > public class SparkMLDotColumn { > public static void main(String[] args) { > // Get the contexts > SparkConf conf = new SparkConf() > .setMaster("local[*]") > .setAppName("test") > .set("spark.ui.enabled", "false"); > JavaSparkContext sparkContext = new JavaSparkContext(conf); > SQLContext sqlContext = new SQLContext(sparkContext); > > // Create a schema with a single string column named "a.b" > StructType schema = new StructType(new StructField[] { > DataTypes.createStructField("a.b", > DataTypes.StringType, false) > }); > // Create an empty RDD and DataFrame > List rows = Arrays.asList(RowFactory.create("foo"), > RowFactory.create("bar")); > JavaRDD rdd = sparkContext.parallelize(rows); > DataFrame df = sqlContext.createDataFrame(rdd, schema); > > df = df.withColumn("a_b", df.col("`a.b`")); > > StringIndexer indexer0 = new StringIndexer(); > indexer0.setInputCol("a_b"); > indexer0.setOutputCol("a_bIndex"); > df = indexer0.fit(df).transform(df); > > StringIndexer indexer1 = new StringIndexer(); > indexer1.setInputCol("`a.b`"); > indexer1.setOutputCol("abIndex"); > df = indexer1.fit(df).transform(df); > > df.show(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()
[ https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847786#comment-15847786 ] Joseph K. Bradley commented on SPARK-12965: --- I'd say this is both a SQL and MLlib issue. I'm going to update this issue to be for ML only to handle fixing StringIndexer and link to a separate JIRA for the SQL issue. > Indexer setInputCol() doesn't resolve column names like DataFrame.col() > --- > > Key: SPARK-12965 > URL: https://issues.apache.org/jira/browse/SPARK-12965 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Joshua Taylor > Attachments: SparkMLDotColumn.java > > > The setInputCol() method doesn't seem to resolve column names in the same way > that other methods do. E.g., Given a DataFrame df, {{df.col("`a.b`")}} will > return a column. On a StringIndexer indexer, > {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting > and transforming seem to have no effect. Running the following code produces: > {noformat} > +---+---++ > |a.b|a_b|a_bIndex| > +---+---++ > |foo|foo| 0.0| > |bar|bar| 1.0| > +---+---++ > {noformat} > but I think it should have another column, {{abIndex}} with the same contents > as a_bIndex. > {code} > public class SparkMLDotColumn { > public static void main(String[] args) { > // Get the contexts > SparkConf conf = new SparkConf() > .setMaster("local[*]") > .setAppName("test") > .set("spark.ui.enabled", "false"); > JavaSparkContext sparkContext = new JavaSparkContext(conf); > SQLContext sqlContext = new SQLContext(sparkContext); > > // Create a schema with a single string column named "a.b" > StructType schema = new StructType(new StructField[] { > DataTypes.createStructField("a.b", > DataTypes.StringType, false) > }); > // Create an empty RDD and DataFrame > List rows = Arrays.asList(RowFactory.create("foo"), > RowFactory.create("bar")); > JavaRDD rdd = sparkContext.parallelize(rows); > DataFrame df = sqlContext.createDataFrame(rdd, schema); > > df = df.withColumn("a_b", df.col("`a.b`")); > > StringIndexer indexer0 = new StringIndexer(); > indexer0.setInputCol("a_b"); > indexer0.setOutputCol("a_bIndex"); > df = indexer0.fit(df).transform(df); > > StringIndexer indexer1 = new StringIndexer(); > indexer1.setInputCol("`a.b`"); > indexer1.setOutputCol("abIndex"); > df = indexer1.fit(df).transform(df); > > df.show(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()
[ https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-12965: -- Affects Version/s: (was: 1.6.0) 2.2.0 1.6.3 2.0.2 2.1.0 > Indexer setInputCol() doesn't resolve column names like DataFrame.col() > --- > > Key: SPARK-12965 > URL: https://issues.apache.org/jira/browse/SPARK-12965 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Joshua Taylor > Attachments: SparkMLDotColumn.java > > > The setInputCol() method doesn't seem to resolve column names in the same way > that other methods do. E.g., Given a DataFrame df, {{df.col("`a.b`")}} will > return a column. On a StringIndexer indexer, > {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting > and transforming seem to have no effect. Running the following code produces: > {noformat} > +---+---++ > |a.b|a_b|a_bIndex| > +---+---++ > |foo|foo| 0.0| > |bar|bar| 1.0| > +---+---++ > {noformat} > but I think it should have another column, {{abIndex}} with the same contents > as a_bIndex. > {code} > public class SparkMLDotColumn { > public static void main(String[] args) { > // Get the contexts > SparkConf conf = new SparkConf() > .setMaster("local[*]") > .setAppName("test") > .set("spark.ui.enabled", "false"); > JavaSparkContext sparkContext = new JavaSparkContext(conf); > SQLContext sqlContext = new SQLContext(sparkContext); > > // Create a schema with a single string column named "a.b" > StructType schema = new StructType(new StructField[] { > DataTypes.createStructField("a.b", > DataTypes.StringType, false) > }); > // Create an empty RDD and DataFrame > List rows = Arrays.asList(RowFactory.create("foo"), > RowFactory.create("bar")); > JavaRDD rdd = sparkContext.parallelize(rows); > DataFrame df = sqlContext.createDataFrame(rdd, schema); > > df = df.withColumn("a_b", df.col("`a.b`")); > > StringIndexer indexer0 = new StringIndexer(); > indexer0.setInputCol("a_b"); > indexer0.setOutputCol("a_bIndex"); > df = indexer0.fit(df).transform(df); > > StringIndexer indexer1 = new StringIndexer(); > indexer1.setInputCol("`a.b`"); > indexer1.setOutputCol("abIndex"); > df = indexer1.fit(df).transform(df); > > df.show(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()
[ https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-12965: -- Component/s: (was: Spark Core) > Indexer setInputCol() doesn't resolve column names like DataFrame.col() > --- > > Key: SPARK-12965 > URL: https://issues.apache.org/jira/browse/SPARK-12965 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Joshua Taylor > Attachments: SparkMLDotColumn.java > > > The setInputCol() method doesn't seem to resolve column names in the same way > that other methods do. E.g., Given a DataFrame df, {{df.col("`a.b`")}} will > return a column. On a StringIndexer indexer, > {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting > and transforming seem to have no effect. Running the following code produces: > {noformat} > +---+---++ > |a.b|a_b|a_bIndex| > +---+---++ > |foo|foo| 0.0| > |bar|bar| 1.0| > +---+---++ > {noformat} > but I think it should have another column, {{abIndex}} with the same contents > as a_bIndex. > {code} > public class SparkMLDotColumn { > public static void main(String[] args) { > // Get the contexts > SparkConf conf = new SparkConf() > .setMaster("local[*]") > .setAppName("test") > .set("spark.ui.enabled", "false"); > JavaSparkContext sparkContext = new JavaSparkContext(conf); > SQLContext sqlContext = new SQLContext(sparkContext); > > // Create a schema with a single string column named "a.b" > StructType schema = new StructType(new StructField[] { > DataTypes.createStructField("a.b", > DataTypes.StringType, false) > }); > // Create an empty RDD and DataFrame > List rows = Arrays.asList(RowFactory.create("foo"), > RowFactory.create("bar")); > JavaRDD rdd = sparkContext.parallelize(rows); > DataFrame df = sqlContext.createDataFrame(rdd, schema); > > df = df.withColumn("a_b", df.col("`a.b`")); > > StringIndexer indexer0 = new StringIndexer(); > indexer0.setInputCol("a_b"); > indexer0.setOutputCol("a_bIndex"); > df = indexer0.fit(df).transform(df); > > StringIndexer indexer1 = new StringIndexer(); > indexer1.setInputCol("`a.b`"); > indexer1.setOutputCol("abIndex"); > df = indexer1.fit(df).transform(df); > > df.show(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19416) Dataset.schema is inconsistent with Dataset in handling columns with periods
Joseph K. Bradley created SPARK-19416: - Summary: Dataset.schema is inconsistent with Dataset in handling columns with periods Key: SPARK-19416 URL: https://issues.apache.org/jira/browse/SPARK-19416 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.2, 1.6.3, 2.2.0 Reporter: Joseph K. Bradley Priority: Minor When you have a DataFrame with a column with a period in its name, the API is inconsistent about how to quote the column name. Here's a reproduction: {code} import org.apache.spark.sql.functions.col val rows = Seq( ("foo", 1), ("bar", 2) ) val df = spark.createDataFrame(rows).toDF("a.b", "id") {code} These methods are all consistent: {code} df.select("a.b") // fails df.select("`a.b`") // succeeds df.select(col("a.b")) // fails df.select(col("`a.b`")) // succeeds df("a.b") // fails df("`a.b`") // succeeds {code} But {{schema}} is inconsistent: {code} df.schema("a.b") // succeeds df.schema("`a.b`") // fails {code} "fails" produces error messages like: {code} org.apache.spark.sql.AnalysisException: cannot resolve '`a.b`' given input columns: [a.b, id];; 'Project ['a.b] +- Project [_1#1511 AS a.b#1516, _2#1512 AS id#1517] +- LocalRelation [_1#1511, _2#1512] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822) at org.apache.spark.sql.Dataset.select(Dataset.scala:1121) at org.apache.spark.sql.Dataset.select(Dataset.scala:1139) at line9667c6d14e79417280e5882aa52e0de727.$read$$iw$$iw$$iw$$iw.(:34) at line9667c6d14e79417280e5882aa52e0de727.$read$$iw$$iw$$iw.(:41) at line9667c6d14e79417280e5882aa52e0de727.$read$$iw$$iw.(:43) at line9667c6d14e79417280e5882aa52e0de727.$read$$iw.(:45) at line9667c6d14e79417280e5882aa52e0de727.$eval$.$print$lzycompute(:7) at line9667c6d14e79417280e5882aa52e0de727.$eval$.$print(:6) {code} "succeeds" produces: {code} org.apache.spark.sql.DataFrame = [a.b: string
[jira] [Updated] (SPARK-19247) Improve ml word2vec save/load scalability
[ https://issues.apache.org/jira/browse/SPARK-19247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19247: -- Component/s: ML > Improve ml word2vec save/load scalability > - > > Key: SPARK-19247 > URL: https://issues.apache.org/jira/browse/SPARK-19247 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Asher Krim > > ml word2vec models can be somewhat large (~4gb is not uncommon). The current > save implementation saves the model as a single large datum, which can cause > rpc issues and fail to save the model. > On the loading side, there are issues with loading this large datum as well. > This was already solved for mllib word2vec in > https://issues.apache.org/jira/browse/SPARK-11994, but the change was never > ported to the ml word2vec implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19294) improve LocalLDAModel save/load scaling for large models
[ https://issues.apache.org/jira/browse/SPARK-19294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19294: -- Component/s: ML > improve LocalLDAModel save/load scaling for large models > > > Key: SPARK-19294 > URL: https://issues.apache.org/jira/browse/SPARK-19294 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Asher Krim > > The LDA model in ml has some of the same problems addressed by > https://issues.apache.org/jira/browse/SPARK-19247 for word2vec. > An LDA model is on order of `vocabSize` * `k`, which can easily reach 3gb for > k=1000 and vocabSize=3m. It's currently saved as a single datum in 1 > partition. > Instead, we should represent the matrix as a list, and use the logic from > https://issues.apache.org/jira/browse/SPARK-11994 to pick a reasonable number > of partitions. > cc [~josephkb] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19247) Improve ml word2vec save/load scalability
[ https://issues.apache.org/jira/browse/SPARK-19247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19247: -- Summary: Improve ml word2vec save/load scalability (was: improve ml word2vec save/load) > Improve ml word2vec save/load scalability > - > > Key: SPARK-19247 > URL: https://issues.apache.org/jira/browse/SPARK-19247 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Asher Krim > > ml word2vec models can be somewhat large (~4gb is not uncommon). The current > save implementation saves the model as a single large datum, which can cause > rpc issues and fail to save the model. > On the loading side, there are issues with loading this large datum as well. > This was already solved for mllib word2vec in > https://issues.apache.org/jira/browse/SPARK-11994, but the change was never > ported to the ml word2vec implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19247) Improve ml word2vec save/load scalability
[ https://issues.apache.org/jira/browse/SPARK-19247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19247: -- Issue Type: Improvement (was: Bug) > Improve ml word2vec save/load scalability > - > > Key: SPARK-19247 > URL: https://issues.apache.org/jira/browse/SPARK-19247 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Asher Krim > > ml word2vec models can be somewhat large (~4gb is not uncommon). The current > save implementation saves the model as a single large datum, which can cause > rpc issues and fail to save the model. > On the loading side, there are issues with loading this large datum as well. > This was already solved for mllib word2vec in > https://issues.apache.org/jira/browse/SPARK-11994, but the change was never > ported to the ml word2vec implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19294) improve LocalLDAModel save/load scaling for large models
[ https://issues.apache.org/jira/browse/SPARK-19294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19294: -- Summary: improve LocalLDAModel save/load scaling for large models (was: improve ml LDA save/load) > improve LocalLDAModel save/load scaling for large models > > > Key: SPARK-19294 > URL: https://issues.apache.org/jira/browse/SPARK-19294 > Project: Spark > Issue Type: Bug >Reporter: Asher Krim > > The LDA model in ml has some of the same problems addressed by > https://issues.apache.org/jira/browse/SPARK-19247 for word2vec. > An LDA model is on order of `vocabSize` * `k`, which can easily reach 3gb for > k=1000 and vocabSize=3m. It's currently saved as a single datum in 1 > partition. > Instead, we should represent the matrix as a list, and use the logic from > https://issues.apache.org/jira/browse/SPARK-11994 to pick a reasonable number > of partitions. > cc [~josephkb] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19294) improve LocalLDAModel save/load scaling for large models
[ https://issues.apache.org/jira/browse/SPARK-19294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19294: -- Issue Type: Improvement (was: Bug) > improve LocalLDAModel save/load scaling for large models > > > Key: SPARK-19294 > URL: https://issues.apache.org/jira/browse/SPARK-19294 > Project: Spark > Issue Type: Improvement >Reporter: Asher Krim > > The LDA model in ml has some of the same problems addressed by > https://issues.apache.org/jira/browse/SPARK-19247 for word2vec. > An LDA model is on order of `vocabSize` * `k`, which can easily reach 3gb for > k=1000 and vocabSize=3m. It's currently saved as a single datum in 1 > partition. > Instead, we should represent the matrix as a list, and use the logic from > https://issues.apache.org/jira/browse/SPARK-11994 to pick a reasonable number > of partitions. > cc [~josephkb] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19389) Minor doc fixes, including Since tags in Python Params
Joseph K. Bradley created SPARK-19389: - Summary: Minor doc fixes, including Since tags in Python Params Key: SPARK-19389 URL: https://issues.apache.org/jira/browse/SPARK-19389 Project: Spark Issue Type: Documentation Components: Documentation, ML, PySpark Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor I spotted some doc issues, mainly in Python, when reviewing [SPARK-19336] which were not related to that PR. This PR fixes them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19336) LinearSVC Python API
[ https://issues.apache.org/jira/browse/SPARK-19336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-19336. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16694 [https://github.com/apache/spark/pull/16694] > LinearSVC Python API > > > Key: SPARK-19336 > URL: https://issues.apache.org/jira/browse/SPARK-19336 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Miao Wang > Fix For: 2.2.0 > > > Create a Python wrapper for spark.ml.classification.LinearSVC -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15843670#comment-15843670 ] Joseph K. Bradley commented on SPARK-19208: --- Thanks for writing out your ideas. Here are my thoughts about the API: *Reference API: Double column stats* When working with Double columns (not Vectors), one would expect write things like: {{myDataFrame.select(min("x"), max("x"))}} to select 2 stats, min and max. Here, min and max are functions provided by Spark SQL which return columns. *Analogy* We should probably provide an analogous API. Here's what I imagine: {code} import org.apache.spark.ml.stat.VectorSummary val df: DataFrame = ... val results: DataFrame = df.select(VectorSummary.min("features"), VectorSummary.mean("features")) val weightedResults: DataFrame = df.select(VectorSummary.min("features"), VectorSummary.mean("features", "weight")) // Both of these result DataFrames contain 2 Vector columns. {code} I.e., we provide vectorized versions of stats functions. If you want to put everything into a single function, then we could also have VectorSummary have a function "summary" which returns a struct type with every stat available: {code} val results = df.select(VectorSummary.summary("features", "weights")) // results DataFrame contains 1 struct column, which has a Vector field for every statistic we provide. {code} Note: I removed "online" from the name since it the user does not need to know that it does online aggregation. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19208: -- Summary: MultivariateOnlineSummarizer performance optimization (was: MultivariateOnlineSummarizer perfermence optimization) > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19382) Test sparse vectors in LinearSVCSuite
Joseph K. Bradley created SPARK-19382: - Summary: Test sparse vectors in LinearSVCSuite Key: SPARK-19382 URL: https://issues.apache.org/jira/browse/SPARK-19382 Project: Spark Issue Type: Test Components: ML Reporter: Joseph K. Bradley Priority: Minor Currently, LinearSVCSuite does not test sparse vectors. We should. I recommend that generateSVMInput be modified to create a mix of dense and sparse vectors, rather than adding an additional test. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18218) Optimize BlockMatrix multiplication, which may cause OOM and low parallelism usage problem in several cases
[ https://issues.apache.org/jira/browse/SPARK-18218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18218: -- Shepherd: Burak Yavuz (was: Yanbo Liang) > Optimize BlockMatrix multiplication, which may cause OOM and low parallelism > usage problem in several cases > --- > > Key: SPARK-18218 > URL: https://issues.apache.org/jira/browse/SPARK-18218 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > After I take a deep look into `BlockMatrix.multiply` implementation, I found > that current implementation may cause some problem in special cases. > Now let me use an extreme case to represent it: > Suppose we have two blockMatrix A and B > A has 1 blocks, numRowBlocks = 1, numColBlocks = 1 > B also has 1 blocks, numRowBlocks = 1, numColBlocks = 1 > Now if we call A.mulitiply(B), no matter how A and B is partitioned, > the resultPartitioner will always contains only one partition, > this muliplication implementation will shuffle 1 * 1 blocks into one > reducer, this will cause the parallism became 1, > what's worse, because `RDD.cogroup` will load the total group element into > memory, now at reducer-side, 1 * 1 blocks will be loaded into memory, > because they are all shuffled into the same group. It will easily cause > executor OOM. > The above case is a little extreme, but other case, such as M*N dimensions > matrix A multiply N*P dimensions matrix B, when N is much larger than M and > P, we met the similar problem. > The multiplication implementation do not handle the task partition properly, > it will cause: > 1. when the middle dimension N is too large, it will cause reducer OOM. > 2. even if OOM do not occur, it will still cause parallism too low. > 3. when N is much large than M and P, and matrix A and B have many > partitions, it will cause too many partition on M and P dimension, it will > cause much larger shuffled data size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries
[ https://issues.apache.org/jira/browse/SPARK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840711#comment-15840711 ] Joseph K. Bradley commented on SPARK-4638: -- Commenting here b/c of the recent dev list thread: Non-linear kernels for SVMs in Spark would be great to have. The main barriers are: * Kernelized SVM training is hard to distribute. Naive methods require a lot of communication. To get this feature into Spark, we'd need to do proper background research and write up a good design. * Other ML algorithms are arguably more in demand and still need improvements (as of the date of this comment). Tree ensembles are first-and-foremost in my mind. > Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to > find non linear boundaries > --- > > Key: SPARK-4638 > URL: https://issues.apache.org/jira/browse/SPARK-4638 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: madankumar s > Labels: Gaussian, Kernels, SVM > Attachments: kernels-1.3.patch > > > SPARK MLlib Classification Module: > Add Kernel functionalities to SVM Classifier to find non linear patterns -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18080) Locality Sensitive Hashing (LSH) Python API
[ https://issues.apache.org/jira/browse/SPARK-18080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18080: -- Assignee: Yun Ni (was: Yanbo Liang) > Locality Sensitive Hashing (LSH) Python API > --- > > Key: SPARK-18080 > URL: https://issues.apache.org/jira/browse/SPARK-18080 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Yun Ni > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14804) Graph vertexRDD/EdgeRDD checkpoint results ClassCastException:
[ https://issues.apache.org/jira/browse/SPARK-14804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14804: -- Assignee: Tathagata Das > Graph vertexRDD/EdgeRDD checkpoint results ClassCastException: > --- > > Key: SPARK-14804 > URL: https://issues.apache.org/jira/browse/SPARK-14804 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.6.1 >Reporter: SuYan >Assignee: Tathagata Das >Priority: Minor > Fix For: 2.0.3, 2.1.1, 3.0.0 > > > {code} > graph3.vertices.checkpoint() > graph3.vertices.count() > graph3.vertices.map(_._2).count() > {code} > 16/04/21 21:04:43 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 > (TID 13, localhost): java.lang.ClassCastException: > org.apache.spark.graphx.impl.ShippableVertexPartition cannot be cast to > scala.Tuple2 > at > com.xiaomi.infra.codelab.spark.Graph2$$anonfun$main$1.apply(Graph2.scala:80) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:91) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:219) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > look at the code: > {code} > private[spark] def computeOrReadCheckpoint(split: Partition, context: > TaskContext): Iterator[T] = > { > if (isCheckpointedAndMaterialized) { > firstParent[T].iterator(split, context) > } else { > compute(split, context) > } > } > private[spark] def isCheckpointedAndMaterialized: Boolean = isCheckpointed > override def isCheckpointed: Boolean = { >firstParent[(PartitionID, EdgePartition[ED, VD])].isCheckpointed > } > {code} > for VertexRDD or EdgeRDD, first parent is its partitionRDD > RDD[ShippableVertexPartition[VD]]/RDD[(PartitionID, EdgePartition[ED, VD])] > 1. we call vertexRDD.checkpoint, it partitionRDD will checkpoint, so > VertexRDD.isCheckpointedAndMaterialized=true. > 2. then we call vertexRDD.iterator, because checkoint=true it called > firstParent.iterator(which is not CheckpointRDD, actually is partitionRDD). > > so returned iterator is iterator[ShippableVertexPartition] not expect > iterator[(VertexId, VD)]] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN
[ https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838995#comment-15838995 ] Joseph K. Bradley commented on SPARK-17975: --- [SPARK-14804] was just fixed. [~jvstein], do you have time to test master with your code to see if the bug you hit is fixed? Thanks! > EMLDAOptimizer fails with ClassCastException on YARN > > > Key: SPARK-17975 > URL: https://issues.apache.org/jira/browse/SPARK-17975 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.1 > Environment: Centos 6, CDH 5.7, Java 1.7u80 >Reporter: Jeff Stein > Attachments: docs.txt > > > I'm able to reproduce the error consistently with a 2000 record text file > with each record having 1-5 terms and checkpointing enabled. It looks like > the problem was introduced with the resolution for SPARK-13355. > The EdgeRDD class seems to be lying about it's type in a way that causes > RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an > RDD of Edge elements. > {code} > val spark = SparkSession.builder.appName("lda").getOrCreate() > spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints") > val data: RDD[(Long, Vector)] = // snip > data.setName("data").cache() > val lda = new LDA > val optimizer = new EMLDAOptimizer > lda.setOptimizer(optimizer) > .setK(10) > .setMaxIterations(400) > .setAlpha(-1) > .setBeta(-1) > .setCheckpointInterval(7) > val ldaModel = lda.run(data) > {code} > {noformat} > 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID > 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be > cast to org.apache.spark.graphx.Edge > at > org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-17265) EdgeRDD Difference throws an exception
[ https://issues.apache.org/jira/browse/SPARK-17265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838990#comment-15838990 ] Joseph K. Bradley commented on SPARK-17265: --- [SPARK-14804] was just fixed. [~shishir167], are you able to test with the master branch to see if this issue is fixed? > EdgeRDD Difference throws an exception > -- > > Key: SPARK-17265 > URL: https://issues.apache.org/jira/browse/SPARK-17265 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: windows, ubuntu >Reporter: Shishir Kharel > > Subtracting two edge RDD throws and exception. > val difference = graph1.edges.subtract(graph2.edges) > gives > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: > Lost task 1.0 in stage 0.0 (TID 1, localhost): java.lang.ClassCastException: > org.apache.spark.graphx.Edge cannot be cast to scala.Tuple2 > at > org.apache.spark.rdd.RDD$$anonfun$subtract$3$$anon$3.getPartition(RDD.scala:968) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:152) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17265) EdgeRDD Difference throws an exception
[ https://issues.apache.org/jira/browse/SPARK-17265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838990#comment-15838990 ] Joseph K. Bradley edited comment on SPARK-17265 at 1/26/17 1:41 AM: [SPARK-14804] was just fixed. [~shishir167], are you able to test with the master branch to see if the bug you hit is fixed? was (Author: josephkb): [SPARK-14804] was just fixed. [~shishir167], are you able to test with the master branch to see if this issue is fixed? > EdgeRDD Difference throws an exception > -- > > Key: SPARK-17265 > URL: https://issues.apache.org/jira/browse/SPARK-17265 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: windows, ubuntu >Reporter: Shishir Kharel > > Subtracting two edge RDD throws and exception. > val difference = graph1.edges.subtract(graph2.edges) > gives > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: > Lost task 1.0 in stage 0.0 (TID 1, localhost): java.lang.ClassCastException: > org.apache.spark.graphx.Edge cannot be cast to scala.Tuple2 > at > org.apache.spark.rdd.RDD$$anonfun$subtract$3$$anon$3.getPartition(RDD.scala:968) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:152) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17877) Can not checkpoint connectedComponents resulting graph
[ https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838993#comment-15838993 ] Joseph K. Bradley commented on SPARK-17877: --- [SPARK-14804] was just fixed. [~apivovarov], are you able to test with the master branch to see if the bug you hit is fixed? > Can not checkpoint connectedComponents resulting graph > -- > > Key: SPARK-17877 > URL: https://issues.apache.org/jira/browse/SPARK-17877 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.5.2, 1.6.2, 2.0.1 >Reporter: Alexander Pivovarov >Priority: Minor > > The following code demonstrates the issue > {code} > import org.apache.spark.graphx._ > val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L > -> "kelly")) > val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, > "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) > sc.setCheckpointDir("/tmp/check") > val g = Graph(users, rel) > g.checkpoint // /tmp/check/b1f46ba5-357a-4d6d-8f4d-411b64b27c2f appears > val gg = g.connectedComponents > gg.checkpoint > gg.vertices.collect > gg.edges.collect > gg.isCheckpointed // res5: Boolean = false, /tmp/check still contains only > 1 folder b1f46ba5-357a-4d6d-8f4d-411b64b27c2f > {code} > I think the last line should return true instead of false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18036) Decision Trees do not handle edge cases
[ https://issues.apache.org/jira/browse/SPARK-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-18036. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16377 [https://github.com/apache/spark/pull/16377] > Decision Trees do not handle edge cases > --- > > Key: SPARK-18036 > URL: https://issues.apache.org/jira/browse/SPARK-18036 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Seth Hendrickson >Assignee: Ilya Matiach >Priority: Minor > Fix For: 2.2.0 > > > Decision trees/GBT/RF do not handle edge cases such as constant features or > empty features. For example: > {code} > val dt = new DecisionTreeRegressor() > val data = Seq(LabeledPoint(1.0, Vectors.dense(Array.empty[Double]))).toDF() > dt.fit(data) > java.lang.UnsupportedOperationException: empty.max > at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229) > at scala.collection.mutable.ArrayOps$ofInt.max(ArrayOps.scala:234) > at > org.apache.spark.ml.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:207) > at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:105) > at > org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:93) > at > org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:46) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > ... 52 elided > {code} > as well as > {code} > val dt = new DecisionTreeRegressor() > val data = Seq(LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0))).toDF() > dt.fit(data) > java.lang.UnsupportedOperationException: empty.maxBy > at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236) > at > scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37) > at > org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:846) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18036) Decision Trees do not handle edge cases
[ https://issues.apache.org/jira/browse/SPARK-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18036: -- Assignee: Ilya Matiach > Decision Trees do not handle edge cases > --- > > Key: SPARK-18036 > URL: https://issues.apache.org/jira/browse/SPARK-18036 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Seth Hendrickson >Assignee: Ilya Matiach >Priority: Minor > > Decision trees/GBT/RF do not handle edge cases such as constant features or > empty features. For example: > {code} > val dt = new DecisionTreeRegressor() > val data = Seq(LabeledPoint(1.0, Vectors.dense(Array.empty[Double]))).toDF() > dt.fit(data) > java.lang.UnsupportedOperationException: empty.max > at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229) > at scala.collection.mutable.ArrayOps$ofInt.max(ArrayOps.scala:234) > at > org.apache.spark.ml.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:207) > at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:105) > at > org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:93) > at > org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:46) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > ... 52 elided > {code} > as well as > {code} > val dt = new DecisionTreeRegressor() > val data = Seq(LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0))).toDF() > dt.fit(data) > java.lang.UnsupportedOperationException: empty.maxBy > at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236) > at > scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37) > at > org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:846) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829014#comment-15829014 ] Joseph K. Bradley commented on SPARK-19208: --- +1 for [~mlnick]'s suggestion. If we're optimizing performance for summarizers, let's use DataFrame ops. With the right design, we could have all current statistics represented as columns in the result, but only materialize the ones we need. > MaxAbsScaler and MinMaxScaler are very inefficient > -- > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19208: -- Assignee: (was: Apache Spark) > MaxAbsScaler and MinMaxScaler are very inefficient > -- > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14975) Predicted Probability per training instance for Gradient Boosted Trees
[ https://issues.apache.org/jira/browse/SPARK-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14975: -- Summary: Predicted Probability per training instance for Gradient Boosted Trees (was: Predicted Probability per training instance for Gradient Boosted Trees in mllib. ) > Predicted Probability per training instance for Gradient Boosted Trees > -- > > Key: SPARK-14975 > URL: https://issues.apache.org/jira/browse/SPARK-14975 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Partha Talukder >Assignee: Ilya Matiach >Priority: Minor > Labels: mllib > Fix For: 2.2.0 > > > This function available for Logistic Regression, SVM etc. > (model.setThreshold()) but not for GBT. In comparison to "gbm" package in R, > where we can specify the distribution and get predicted probabilities or > classes. I understand that this algorithm works with "Classification" and > "Regression" algo's. Is there any way where in GBT we can get predicted > probabilities or provide thresholds to the model? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14975) Predicted Probability per training instance for Gradient Boosted Trees
[ https://issues.apache.org/jira/browse/SPARK-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-14975. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16441 [https://github.com/apache/spark/pull/16441] > Predicted Probability per training instance for Gradient Boosted Trees > -- > > Key: SPARK-14975 > URL: https://issues.apache.org/jira/browse/SPARK-14975 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Partha Talukder >Assignee: Ilya Matiach >Priority: Minor > Labels: mllib > Fix For: 2.2.0 > > > This function available for Logistic Regression, SVM etc. > (model.setThreshold()) but not for GBT. In comparison to "gbm" package in R, > where we can specify the distribution and get predicted probabilities or > classes. I understand that this algorithm works with "Classification" and > "Regression" algo's. Is there any way where in GBT we can get predicted > probabilities or provide thresholds to the model? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8855) Python API for Association Rules
[ https://issues.apache.org/jira/browse/SPARK-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-8855. Resolution: Won't Fix > Python API for Association Rules > > > Key: SPARK-8855 > URL: https://issues.apache.org/jira/browse/SPARK-8855 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Feynman Liang >Priority: Minor > > A simple Python wrapper and doctests needs to be written for Association > Rules. The relevant method is {{FPGrowthModel.generateAssociationRules}}. The > code will likely live in {{fpm.py}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19281) spark.ml Python API for FPGrowth
Joseph K. Bradley created SPARK-19281: - Summary: spark.ml Python API for FPGrowth Key: SPARK-19281 URL: https://issues.apache.org/jira/browse/SPARK-19281 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Joseph K. Bradley See parent issue. This is for a Python API *after* the Scala API has been designed and implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8855) Python API for Association Rules
[ https://issues.apache.org/jira/browse/SPARK-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828808#comment-15828808 ] Joseph K. Bradley commented on SPARK-8855: -- I'm going to close this issue in favor of the DataFrame-based API in [SPARK-14501] > Python API for Association Rules > > > Key: SPARK-8855 > URL: https://issues.apache.org/jira/browse/SPARK-8855 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Feynman Liang >Priority: Minor > > A simple Python wrapper and doctests needs to be written for Association > Rules. The relevant method is {{FPGrowthModel.generateAssociationRules}}. The > code will likely live in {{fpm.py}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14503) spark.ml Scala API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14503: -- Summary: spark.ml Scala API for FPGrowth (was: spark.ml API for FPGrowth) > spark.ml Scala API for FPGrowth > --- > > Key: SPARK-14503 > URL: https://issues.apache.org/jira/browse/SPARK-14503 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > This task is the first port of spark.mllib.fpm functionality to spark.ml > (Scala). > This will require a brief design doc to confirm a reasonable DataFrame-based > API, with details for this class. The doc could also look ahead to the other > fpm classes, especially if their API decisions will affect FPGrowth. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828803#comment-15828803 ] Joseph K. Bradley commented on SPARK-17136: --- CC [~avulanov], who has thought a lot about these issues too > Design optimizer interface for ML algorithms > > > Key: SPARK-17136 > URL: https://issues.apache.org/jira/browse/SPARK-17136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > We should consider designing an interface that allows users to use their own > optimizers in some of the ML algorithms, similar to MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5256: - Description: *Goal*: Improve APIs for optimization *Motivation*: There have been several disjoint mentions of improving the optimization APIs to make them more pluggable, extensible, etc. This JIRA is a place to discuss what API changes are necessary for the long term, and to provide links to other relevant JIRAs. Eventually, I hope this leads to a design doc outlining: * current issues * requirements such as supporting many types of objective functions, optimization algorithms, and parameters to those algorithms * ideal API * breakdown of smaller JIRAs needed to achieve that API was: *Goal*: Improve APIs for optimization *Motivation*: There have been several disjoint mentions of improving the optimization APIs to make them more pluggable, extensible, etc. This JIRA is a place to discuss what API changes are necessary for the long term, and to provide links to other relevant JIRAs. Eventually, I hope this leads to a design doc outlining: * current issues * requirements such as supporting many types of objective functions, optimization algorithms, and parameters to those algorithms * ideal API * breakdown of smaller JIRAs needed to achieve that API I will soon create an initial design doc, and I will try to watch this JIRA and include ideas from JIRA comments. > Improving MLlib optimization APIs > - > > Key: SPARK-5256 > URL: https://issues.apache.org/jira/browse/SPARK-5256 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Reporter: Joseph K. Bradley > > *Goal*: Improve APIs for optimization > *Motivation*: There have been several disjoint mentions of improving the > optimization APIs to make them more pluggable, extensible, etc. This JIRA is > a place to discuss what API changes are necessary for the long term, and to > provide links to other relevant JIRAs. > Eventually, I hope this leads to a design doc outlining: > * current issues > * requirements such as supporting many types of objective functions, > optimization algorithms, and parameters to those algorithms > * ideal API > * breakdown of smaller JIRAs needed to achieve that API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13610) Create a Transformer to disassemble vectors in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-13610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828756#comment-15828756 ] Joseph K. Bradley commented on SPARK-13610: --- One more: Would these selected subsets of elements be chosen by hand, or would they need to be chosen automatically via code? > Create a Transformer to disassemble vectors in DataFrames > - > > Key: SPARK-13610 > URL: https://issues.apache.org/jira/browse/SPARK-13610 > Project: Spark > Issue Type: New Feature > Components: ML, SQL >Affects Versions: 1.6.0 >Reporter: Andrew MacKinlay >Priority: Minor > > It is possible to convert a standalone numeric field into a single-item > Vector, using VectorAssembler. However the inverse operation of retrieving a > single item from a vector and translating it into a field doesn't appear to > be possible. The workaround I've found is to leave the raw field value in the > DF, but I have found no other ways to get a field out of a vector (eg to > perform arithmetic on it). Happy to be proved wrong though. Creating a > user-defined function doesn't work (in Python at least; it gets a > pickleexception).This seems like a simple operation which should be supported > for various use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19053) Supporting multiple evaluation metrics in DataFrame-based API: discussion
[ https://issues.apache.org/jira/browse/SPARK-19053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828755#comment-15828755 ] Joseph K. Bradley commented on SPARK-19053: --- After thinking about this more and hearing your thoughts, my top pick would be to define Metrics classes similar to the spark.mllib ones, with the current model summaries inheriting from these metrics classes. (We should sketch out APIs before implementation, though, to make sure this setup will not break any APIs or force awkward APIs.) Metrics classes could offer both aggregate and per-row metrics. > Supporting multiple evaluation metrics in DataFrame-based API: discussion > - > > Key: SPARK-19053 > URL: https://issues.apache.org/jira/browse/SPARK-19053 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley > > This JIRA is to discuss supporting the computation of multiple evaluation > metrics efficiently in the DataFrame-based API for MLlib. > In the RDD-based API, RegressionMetrics and other *Metrics classes support > efficient computation of multiple metrics. > In the DataFrame-based API, there are a few options: > * model/result summaries (e.g., LogisticRegressionSummary): These currently > provide the desired functionality, but they require a model and do not let > users compute metrics manually from DataFrames of predictions and true labels. > * Evaluator classes (e.g., RegressionEvaluator): These only support computing > a single metric in one pass over the data, but they do not require a model. > * new class analogous to Metrics: We could introduce a class analogous to > Metrics. Model/result summaries could use this internally as a replacement > for spark.mllib Metrics classes, or they could (maybe) inherit from these > classes. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14975) Predicted Probability per training instance for Gradient Boosted Trees in mllib.
[ https://issues.apache.org/jira/browse/SPARK-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14975: -- Shepherd: Joseph K. Bradley > Predicted Probability per training instance for Gradient Boosted Trees in > mllib. > - > > Key: SPARK-14975 > URL: https://issues.apache.org/jira/browse/SPARK-14975 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Partha Talukder >Assignee: Ilya Matiach >Priority: Minor > Labels: mllib > > This function available for Logistic Regression, SVM etc. > (model.setThreshold()) but not for GBT. In comparison to "gbm" package in R, > where we can specify the distribution and get predicted probabilities or > classes. I understand that this algorithm works with "Classification" and > "Regression" algo's. Is there any way where in GBT we can get predicted > probabilities or provide thresholds to the model? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14975) Predicted Probability per training instance for Gradient Boosted Trees in mllib.
[ https://issues.apache.org/jira/browse/SPARK-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14975: -- Assignee: Ilya Matiach > Predicted Probability per training instance for Gradient Boosted Trees in > mllib. > - > > Key: SPARK-14975 > URL: https://issues.apache.org/jira/browse/SPARK-14975 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Partha Talukder >Assignee: Ilya Matiach >Priority: Minor > Labels: mllib > > This function available for Logistic Regression, SVM etc. > (model.setThreshold()) but not for GBT. In comparison to "gbm" package in R, > where we can specify the distribution and get predicted probabilities or > classes. I understand that this algorithm works with "Classification" and > "Regression" algo's. Is there any way where in GBT we can get predicted > probabilities or provide thresholds to the model? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17747) WeightCol support non-double datatypes
[ https://issues.apache.org/jira/browse/SPARK-17747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-17747: -- Shepherd: Joseph K. Bradley Assignee: zhengruifeng Target Version/s: 2.2.0 > WeightCol support non-double datatypes > -- > > Key: SPARK-17747 > URL: https://issues.apache.org/jira/browse/SPARK-17747 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > WeightCol only support double type now, which should fit with other numeric > types, such as Int. > {code} > scala> df3.show(5) > +-++--+ > |label|features|weight| > +-++--+ > | 0.0|(692,[127,128,129...| 1| > | 1.0|(692,[158,159,160...| 1| > | 1.0|(692,[124,125,126...| 1| > | 1.0|(692,[152,153,154...| 1| > | 1.0|(692,[151,152,153...| 1| > +-++--+ > only showing top 5 rows > scala> val lr = new > LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8) > lr: org.apache.spark.ml.classification.LogisticRegression = > logreg_ee0308a72919 > scala> val lrm = lr.fit(df3) > 16/09/20 15:46:12 WARN LogisticRegression: LogisticRegression training > finished but the result is not converged because: max iterations reached > lrm: org.apache.spark.ml.classification.LogisticRegressionModel = > logreg_ee0308a72919 > scala> val lr = new > LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setWeightCol("weight") > lr: org.apache.spark.ml.classification.LogisticRegression = > logreg_ced7579d5680 > scala> val lrm = lr.fit(df3) > 16/09/20 15:46:27 WARN BlockManager: Putting block rdd_211_0 failed > 16/09/20 15:46:27 ERROR Executor: Exception in task 0.0 in stage 89.0 (TID 92) > scala.MatchError: > [0.0,1,(692,[127,128,129,130,131,154,155,156,157,158,159,181,182,183,184,185,186,187,188,189,207,208,209,210,211,212,213,214,215,216,217,235,236,237,238,239,240,241,242,243,244,245,262,263,264,265,266,267,268,269,270,271,272,273,289,290,291,292,293,294,295,296,297,300,301,302,316,317,318,319,320,321,328,329,330,343,344,345,346,347,348,349,356,357,358,371,372,373,374,384,385,386,399,400,401,412,413,414,426,427,428,429,440,441,442,454,455,456,457,466,467,468,469,470,482,483,484,493,494,495,496,497,510,511,512,520,521,522,523,538,539,540,547,548,549,550,566,567,568,569,570,571,572,573,574,575,576,577,578,594,595,596,597,598,599,600,601,602,603,604,622,623,624,625,626,627,628,629,630,651,652,653,654,655,656,657],[51.0,159.0,253.0,159.0,50.0,48.0,238.0,252.0,252.0,252.0,237.0,54.0,227.0,253.0,252.0,239.0,233.0,252.0,57.0,6.0,10.0,60.0,224.0,252.0,253.0,252.0,202.0,84.0,252.0,253.0,122.0,163.0,252.0,252.0,252.0,253.0,252.0,252.0,96.0,189.0,253.0,167.0,51.0,238.0,253.0,253.0,190.0,114.0,253.0,228.0,47.0,79.0,255.0,168.0,48.0,238.0,252.0,252.0,179.0,12.0,75.0,121.0,21.0,253.0,243.0,50.0,38.0,165.0,253.0,233.0,208.0,84.0,253.0,252.0,165.0,7.0,178.0,252.0,240.0,71.0,19.0,28.0,253.0,252.0,195.0,57.0,252.0,252.0,63.0,253.0,252.0,195.0,198.0,253.0,190.0,255.0,253.0,196.0,76.0,246.0,252.0,112.0,253.0,252.0,148.0,85.0,252.0,230.0,25.0,7.0,135.0,253.0,186.0,12.0,85.0,252.0,223.0,7.0,131.0,252.0,225.0,71.0,85.0,252.0,145.0,48.0,165.0,252.0,173.0,86.0,253.0,225.0,114.0,238.0,253.0,162.0,85.0,252.0,249.0,146.0,48.0,29.0,85.0,178.0,225.0,253.0,223.0,167.0,56.0,85.0,252.0,252.0,252.0,229.0,215.0,252.0,252.0,252.0,196.0,130.0,28.0,199.0,252.0,252.0,253.0,252.0,252.0,233.0,145.0,25.0,128.0,252.0,253.0,252.0,141.0,37.0])] > (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) > at > org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266) > at > org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.co
[jira] [Resolved] (SPARK-14567) Add instrumentation logs to MLlib training algorithms
[ https://issues.apache.org/jira/browse/SPARK-14567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-14567. --- Resolution: Fixed Fix Version/s: 2.2.0 > Add instrumentation logs to MLlib training algorithms > - > > Key: SPARK-14567 > URL: https://issues.apache.org/jira/browse/SPARK-14567 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Timothy Hunter >Assignee: Timothy Hunter > Fix For: 2.2.0 > > > In order to debug performance issues when training mllib algorithms, > it is useful to log some metrics about the training dataset, the training > parameters, etc. > This ticket is an umbrella to add some simple logging messages to the most > common MLlib estimators. There should be no performance impact on the current > implementation, and the output is simply printed in the logs. > Here are some values that are of interest when debugging training tasks: > * number of features > * number of instances > * number of partitions > * number of classes > * input RDD/DF cache level > * hyper-parameters -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18206) Log instrumentation in MPC, NB, LDA, AFT, GLR, Isotonic, LinReg
[ https://issues.apache.org/jira/browse/SPARK-18206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-18206. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 15671 [https://github.com/apache/spark/pull/15671] > Log instrumentation in MPC, NB, LDA, AFT, GLR, Isotonic, LinReg > --- > > Key: SPARK-18206 > URL: https://issues.apache.org/jira/browse/SPARK-18206 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: zhengruifeng >Priority: Minor > Fix For: 2.2.0 > > > See parent JIRA -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7146) Should ML sharedParams be a public API?
[ https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7146: - Target Version/s: (was: 2.2.0) > Should ML sharedParams be a public API? > --- > > Key: SPARK-7146 > URL: https://issues.apache.org/jira/browse/SPARK-7146 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley > > Proposal: Make most of the Param traits in sharedParams.scala public. Mark > them as DeveloperApi. > Pros: > * Sharing the Param traits helps to encourage standardized Param names and > documentation. > Cons: > * Users have to be careful since parameters can have different meanings for > different algorithms. > * If the shared Params are public, then implementations could test for the > traits. It is unclear if we want users to rely on these traits, which are > somewhat experimental. > Currently, the shared params are private. > h3. UPDATED proposal > * Some Params are clearly safe to make public. We will do so. > * Some Params could be made public but may require caveats in the trait doc. > * Some Params have turned out not to be shared in practice. We can move > those Params to the classes which use them. > *Public shared params*: > * I/O column params > ** HasFeaturesCol > ** HasInputCol > ** HasInputCols > ** HasLabelCol > ** HasOutputCol > ** HasPredictionCol > ** HasProbabilityCol > ** HasRawPredictionCol > ** HasVarianceCol > ** HasWeightCol > * Algorithm settings > ** HasCheckpointInterval > ** HasElasticNetParam > ** HasFitIntercept > ** HasMaxIter > ** HasRegParam > ** HasSeed > ** HasStandardization (less common) > ** HasStepSize > ** HasTol > *Questionable params*: > * HasHandleInvalid (only used in StringIndexer, but might be more widely used > later on) > * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but > same meaning as Optimizer in LDA) > *Params to be removed from sharedParams*: > * HasThreshold (only used in LogisticRegression) > * HasThresholds (only used in ProbabilisticClassifier) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10931) PySpark ML Models should contain Param values
[ https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10931: -- Shepherd: (was: Joseph K. Bradley) > PySpark ML Models should contain Param values > - > > Key: SPARK-10931 > URL: https://issues.apache.org/jira/browse/SPARK-10931 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > PySpark spark.ml Models are generally wrappers around Java objects and do not > even contain Param values. This JIRA is for copying the Param values from > the Estimator to the model. > This can likely be solved by modifying Estimator.fit to copy Param values, > but should also include proper unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7424) spark.ml classification, regression abstractions should add metadata to output column
[ https://issues.apache.org/jira/browse/SPARK-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7424: - Target Version/s: (was: 2.2.0) > spark.ml classification, regression abstractions should add metadata to > output column > - > > Key: SPARK-7424 > URL: https://issues.apache.org/jira/browse/SPARK-7424 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Update ClassificationModel, ProbabilisticClassificationModel prediction to > include numClasses in output column metadata. > Update RegressionModel to specify output column metadata as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14501) spark.ml parity for fpm - frequent items
[ https://issues.apache.org/jira/browse/SPARK-14501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827052#comment-15827052 ] Joseph K. Bradley commented on SPARK-14501: --- I'm doing a general pass to encourage the Shepherd requirement from the roadmap process in [SPARK-18813]. Is someone interested in shepherding this issue, or shall I remove the target version? I'd really like to get this into 2.2 but don't have time to review it right now. Could another committer take it? > spark.ml parity for fpm - frequent items > > > Key: SPARK-14501 > URL: https://issues.apache.org/jira/browse/SPARK-14501 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Joseph K. Bradley > > This is an umbrella for porting the spark.mllib.fpm subpackage to spark.ml. > I am initially creating a single subtask, which will require a brief design > doc for the DataFrame-based API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10931) PySpark ML Models should contain Param values
[ https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10931: -- Target Version/s: (was: 2.2.0) > PySpark ML Models should contain Param values > - > > Key: SPARK-10931 > URL: https://issues.apache.org/jira/browse/SPARK-10931 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > PySpark spark.ml Models are generally wrappers around Java objects and do not > even contain Param values. This JIRA is for copying the Param values from > the Estimator to the model. > This can likely be solved by modifying Estimator.fit to copy Param values, > but should also include proper unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15571) Pipeline unit test improvements for 2.2
[ https://issues.apache.org/jira/browse/SPARK-15571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15571: -- Shepherd: Joseph K. Bradley > Pipeline unit test improvements for 2.2 > --- > > Key: SPARK-15571 > URL: https://issues.apache.org/jira/browse/SPARK-15571 > Project: Spark > Issue Type: Test > Components: ML, PySpark >Reporter: Joseph K. Bradley > > Issue: > * There are several pieces of standard functionality shared by all > algorithms: Params, UIDs, fit/transform/save/load, etc. Currently, these > pieces are generally tested in ad hoc tests for each algorithm. > * This has led to inconsistent coverage, especially within the Python API. > Goal: > * Standardize unit tests for Scala and Python to improve and consolidate test > coverage for Params, persistence, and other common functionality. > * Eliminate duplicate code. Improve test coverage. Simplify adding these > standard unit tests for future algorithms and APIs. > This will require several subtasks. If you identify an issue, please create > a subtask, or comment below if the issue needs to be discussed first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14659) OneHotEncoder support drop first category alphabetically in the encoded vector
[ https://issues.apache.org/jira/browse/SPARK-14659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827041#comment-15827041 ] Joseph K. Bradley commented on SPARK-14659: --- I'm doing a general pass to encourage the Shepherd requirement from the roadmap process in [SPARK-18813]. Is someone interested in shepherding this issue, or shall I remove the target version? > OneHotEncoder support drop first category alphabetically in the encoded > vector > --- > > Key: SPARK-14659 > URL: https://issues.apache.org/jira/browse/SPARK-14659 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang > > R formula drop the first category alphabetically when encode string/category > feature. Spark RFormula use OneHotEncoder to encode string/category feature > into vector, but only supporting "dropLast" by string/category frequencies. > This will cause SparkR produce different models compared with native R. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14706) Python ML persistence integration test
[ https://issues.apache.org/jira/browse/SPARK-14706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14706: -- Target Version/s: (was: 2.2.0) > Python ML persistence integration test > -- > > Key: SPARK-14706 > URL: https://issues.apache.org/jira/browse/SPARK-14706 > Project: Spark > Issue Type: Test > Components: ML, PySpark >Reporter: Joseph K. Bradley > > Goal: extend integration test in {{ml/tests.py}}. > In the {{PersistenceTest}} suite, there is a method {{_compare_pipelines}}. > This issue includes: > * Extending {{_compare_pipelines}} to handle CrossValidator, > TrainValidationSplit, and OneVsRest > * Adding an integration test in PersistenceTest which includes nested > meta-algorithms. E.g.: {{Pipeline[ CrossValidator[ TrainValidationSplit[ > OneVsRest[ LogisticRegression ] ] ] ]}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15799) Release SparkR on CRAN
[ https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827038#comment-15827038 ] Joseph K. Bradley commented on SPARK-15799: --- I'm doing a general pass to encourage the Shepherd requirement from the roadmap process in [SPARK-18813]. Is someone interested in shepherding this issue, or shall I remove the target version? > Release SparkR on CRAN > -- > > Key: SPARK-15799 > URL: https://issues.apache.org/jira/browse/SPARK-15799 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Xiangrui Meng > > Story: "As an R user, I would like to see SparkR released on CRAN, so I can > use SparkR easily in an existing R environment and have other packages built > on top of SparkR." > I made this JIRA with the following questions in mind: > * Are there known issues that prevent us releasing SparkR on CRAN? > * Do we want to package Spark jars in the SparkR release? > * Are there license issues? > * How does it fit into Spark's release process? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16578) Configurable hostname for RBackend
[ https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827036#comment-15827036 ] Joseph K. Bradley commented on SPARK-16578: --- I'm doing a general pass to encourage the Shepherd requirement from the roadmap process in [SPARK-18813]. Is someone interested in shepherding this issue, or shall I remove the target version? > Configurable hostname for RBackend > -- > > Key: SPARK-16578 > URL: https://issues.apache.org/jira/browse/SPARK-16578 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Junyang Qian > > One of the requirements that comes up with SparkR being a standalone package > is that users can now install just the R package on the client side and > connect to a remote machine which runs the RBackend class. > We should check if we can support this mode of execution and what are the > pros / cons of it -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17455) IsotonicRegression takes non-polynomial time for some inputs
[ https://issues.apache.org/jira/browse/SPARK-17455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-17455: -- Shepherd: Joseph K. Bradley > IsotonicRegression takes non-polynomial time for some inputs > > > Key: SPARK-17455 > URL: https://issues.apache.org/jira/browse/SPARK-17455 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2, 2.0.0 >Reporter: Nic Eggert > > The Pool Adjacent Violators Algorithm (PAVA) implementation that's currently > in MLlib can take O(N!) time for certain inputs, when it should have > worst-case complexity of O(N^2). > To reproduce this, I pulled the private method poolAdjacentViolators out of > mllib.regression.IsotonicRegression and into a benchmarking harness. > Given this input > {code} > val x = (1 to length).toArray.map(_.toDouble) > val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 > else yi} > val w = Array.fill(length)(1d) > val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, > x), w) => (y, x, w)} > {code} > I vary the length of the input to get these timings: > || Input Length || Time (us) || > | 100 | 1.35 | > | 200 | 3.14 | > | 400 | 116.10 | > | 800 | 2134225.90 | > (tests were performed using > https://github.com/sirthias/scala-benchmarking-template) > I can also confirm that I run into this issue on a real dataset I'm working > on when trying to calibrate random forest probability output. Some partitions > take > 12 hours to run. This isn't a skew issue, since the largest partitions > finish in minutes. I can only assume that some partitions cause something > approaching this worst-case complexity. > I'm working on a patch that borrows the implementation that is used in > scikit-learn and the R "iso" package, both of which handle this particular > input in linear time and are quadratic in the worst case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18348) Improve tree ensemble model summary
[ https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827023#comment-15827023 ] Joseph K. Bradley commented on SPARK-18348: --- I'm doing a general pass to enforce the Shepherd requirement from the roadmap process in [SPARK-18813]. Note that the process is a new proposal and can be updated as needed. Is someone interested in shepherding this issue, or shall I remove the target version? > Improve tree ensemble model summary > --- > > Key: SPARK-18348 > URL: https://issues.apache.org/jira/browse/SPARK-18348 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0, 2.1.0 >Reporter: Felix Cheung > > During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is > discovered and discussed that > - we don't have a good summary on nodes or trees for their observations, > loss, probability and so on > - we don't have a shared API with nicely formatted output > We believe this could be a shared API that benefits multiple language > bindings, including R, when available. > For example, here is what R {code}rpart{code} shows for model summary: > {code} > Call: > rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis, > method = "class") > n= 81 > CP nsplit rel errorxerror xstd > 1 0.17647059 0 1.000 1.000 0.2155872 > 2 0.01960784 1 0.8235294 0.9411765 0.2107780 > 3 0.0100 4 0.7647059 1.0588235 0.2200975 > Variable importance > StartAge Number > 64 24 12 > Node number 1: 81 observations,complexity param=0.1764706 > predicted class=absent expected loss=0.2098765 P(node) =1 > class counts:6417 >probabilities: 0.790 0.210 > left son=2 (62 obs) right son=3 (19 obs) > Primary splits: > Start < 8.5 to the right, improve=6.762330, (0 missing) > Number < 5.5 to the left, improve=2.866795, (0 missing) > Age< 39.5 to the left, improve=2.250212, (0 missing) > Surrogate splits: > Number < 6.5 to the left, agree=0.802, adj=0.158, (0 split) > Node number 2: 62 observations,complexity param=0.01960784 > predicted class=absent expected loss=0.09677419 P(node) =0.7654321 > class counts:56 6 >probabilities: 0.903 0.097 > left son=4 (29 obs) right son=5 (33 obs) > Primary splits: > Start < 14.5 to the right, improve=1.0205280, (0 missing) > Age< 55 to the left, improve=0.6848635, (0 missing) > Number < 4.5 to the left, improve=0.2975332, (0 missing) > Surrogate splits: > Number < 3.5 to the left, agree=0.645, adj=0.241, (0 split) > Age< 16 to the left, agree=0.597, adj=0.138, (0 split) > Node number 3: 19 observations > predicted class=present expected loss=0.4210526 P(node) =0.2345679 > class counts: 811 >probabilities: 0.421 0.579 > Node number 4: 29 observations > predicted class=absent expected loss=0 P(node) =0.3580247 > class counts:29 0 >probabilities: 1.000 0.000 > Node number 5: 33 observations,complexity param=0.01960784 > predicted class=absent expected loss=0.1818182 P(node) =0.4074074 > class counts:27 6 >probabilities: 0.818 0.182 > left son=10 (12 obs) right son=11 (21 obs) > Primary splits: > Age< 55 to the left, improve=1.2467530, (0 missing) > Start < 12.5 to the right, improve=0.2887701, (0 missing) > Number < 3.5 to the right, improve=0.1753247, (0 missing) > Surrogate splits: > Start < 9.5 to the left, agree=0.758, adj=0.333, (0 split) > Number < 5.5 to the right, agree=0.697, adj=0.167, (0 split) > Node number 10: 12 observations > predicted class=absent expected loss=0 P(node) =0.1481481 > class counts:12 0 >probabilities: 1.000 0.000 > Node number 11: 21 observations,complexity param=0.01960784 > predicted class=absent expected loss=0.2857143 P(node) =0.2592593 > class counts:15 6 >probabilities: 0.714 0.286 > left son=22 (14 obs) right son=23 (7 obs) > Primary splits: > Age< 111 to the right, improve=1.71428600, (0 missing) > Start < 12.5 to the right, improve=0.79365080, (0 missing) > Number < 3.5 to the right, improve=0.07142857, (0 missing) > Node number 22: 14 observations > predicted class=absent expected loss=0.1428571 P(node) =0.1728395 > class counts:12 2 >probabilities: 0.857 0.143 > Node number 23: 7 observations > predicted class=present expected loss=0.4285714 P(node) =0.08641975 > class counts: 3 4 >probabilities: 0.429 0.571 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)