[jira] [Updated] (SPARK-18812) Clarify "Spark ML"

2017-02-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18812:
--
Target Version/s: 2.1.0  (was: 2.1.1, 2.2.0)

> Clarify "Spark ML"
> --
>
> Key: SPARK-18812
> URL: https://issues.apache.org/jira/browse/SPARK-18812
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 2.1.0
>
>
> It is useful to add an FAQ entry to explain "Spark ML" and reduce confusion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects

2017-02-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17822:
--
Target Version/s: 2.1.0, 2.0.3  (was: 2.0.3, 2.1.1, 2.2.0)

> JVMObjectTracker.objMap may leak JVM objects
> 
>
> Key: SPARK-17822
> URL: https://issues.apache.org/jira/browse/SPARK-17822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>Assignee: Xiangrui Meng
> Fix For: 2.0.3, 2.1.0
>
> Attachments: screenshot-1.png
>
>
> JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we 
> observed that JVM objects that are not used anymore are still trapped in this 
> map, which prevents those object get GCed. 
> Seems it makes sense to use weak reference (like persistentRdds in 
> SparkContext). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18924) Improve collect/createDataFrame performance in SparkR

2017-02-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18924:
--
Target Version/s:   (was: 2.2.0)

> Improve collect/createDataFrame performance in SparkR
> -
>
> Key: SPARK-18924
> URL: https://issues.apache.org/jira/browse/SPARK-18924
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> SparkR has its own SerDe for data serialization between JVM and R.
> The SerDe on the JVM side is implemented in:
> * 
> [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala]
> * 
> [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala]
> The SerDe on the R side is implemented in:
> * 
> [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R]
> * 
> [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R]
> The serialization between JVM and R suffers from huge storage and computation 
> overhead. For example, a short round trip of 1 million doubles surprisingly 
> took 3 minutes on my laptop:
> {code}
> > system.time(collect(createDataFrame(data.frame(x=runif(100)
>user  system elapsed
>  14.224   0.582 189.135
> {code}
> Collecting a medium-sized DataFrame to local and continuing with a local R 
> workflow is a use case we should pay attention to. SparkR will never be able 
> to cover all existing features from CRAN packages. It is also unnecessary for 
> Spark to do so because not all features need scalability. 
> Several factors contribute to the serialization overhead:
> 1. The SerDe in R side is implemented using high-level R methods.
> 2. DataFrame columns are not efficiently serialized, primitive type columns 
> in particular.
> 3. Some overhead in the serialization protocol/impl.
> 1) might be discussed before because R packages like rJava exist before 
> SparkR. I'm not sure whether we have a license issue in depending on those 
> libraries. Another option is to switch to low-level R'C interface or Rcpp, 
> which again might have license issue. I'm not an expert here. If we have to 
> implement our own, there still exist much space for improvement, discussed 
> below.
> 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, 
> which collects rows to local and then constructs columns. However,
> * it ignores column types and results boxing/unboxing overhead
> * it collects all objects to driver and results high GC pressure
> A relatively simple change is to implement specialized column builder based 
> on column types, primitive types in particular. We need to handle null/NA 
> values properly. A simple data structure we can use is
> {code}
> val size: Int
> val nullIndexes: Array[Int]
> val notNullValues: Array[T] // specialized for primitive types
> {code}
> On the R side, we can use `readBin` and `writeBin` to read the entire vector 
> in a single method call. The speed seems reasonable (at the order of GB/s):
> {code}
> > x <- runif(1000) # 1e7, not 1e6
> > system.time(r <- writeBin(x, raw(0)))
>user  system elapsed
>   0.036   0.021   0.059
> > > system.time(y <- readBin(r, double(), 1000))
>user  system elapsed
>   0.015   0.007   0.024
> {code}
> This is just a proposal that needs to be discussed and formalized. But in 
> general, it should be feasible to obtain 20x or more performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19337) Documentation and examples for LinearSVC

2017-02-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870512#comment-15870512
 ] 

Joseph K. Bradley commented on SPARK-19337:
---

@yuhao yang  Do you have time to work on this?  Thanks!

> Documentation and examples for LinearSVC
> 
>
> Key: SPARK-19337
> URL: https://issues.apache.org/jira/browse/SPARK-19337
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> User guide + example code for LinearSVC



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14523) Feature parity for Statistics ML with MLlib

2017-02-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861789#comment-15861789
 ] 

Joseph K. Bradley commented on SPARK-14523:
---

I'd like to keep this open until we have linked tasks for the missing 
functionality.

[~hujiayin] This is for parity w.r.t. the RDD-based API, not for adding new 
functionality to MLlib.  I think there's already a JIRA for ARIMA somewhere.

> Feature parity for Statistics ML with MLlib
> ---
>
> Key: SPARK-14523
> URL: https://issues.apache.org/jira/browse/SPARK-14523
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: yuhao yang
>
> Some statistics functions have been supported by DataFrame directly. Use this 
> jira to discuss/design the statistics package in Spark.ML and its function 
> scope. Hypothesis test and correlation computation may still need to expose 
> independent interfaces.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-14523) Feature parity for Statistics ML with MLlib

2017-02-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reopened SPARK-14523:
---

> Feature parity for Statistics ML with MLlib
> ---
>
> Key: SPARK-14523
> URL: https://issues.apache.org/jira/browse/SPARK-14523
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: yuhao yang
>
> Some statistics functions have been supported by DataFrame directly. Use this 
> jira to discuss/design the statistics package in Spark.ML and its function 
> scope. Hypothesis test and correlation computation may still need to expose 
> independent interfaces.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18613) spark.ml LDA classes should not expose spark.mllib in APIs

2017-02-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-18613.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16860
[https://github.com/apache/spark/pull/16860]

> spark.ml LDA classes should not expose spark.mllib in APIs
> --
>
> Key: SPARK-18613
> URL: https://issues.apache.org/jira/browse/SPARK-18613
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Sue Ann Hong
>Priority: Critical
> Fix For: 2.2.0
>
>
> spark.ml.LDAModel exposes dependencies on spark.mllib in 2 methods, but it 
> should not:
> * {{def oldLocalModel: OldLocalLDAModel}}
> * {{def getModel: OldLDAModel}}
> This task is to deprecate those methods.  I recommend creating 
> {{private[ml]}} versions of the methods which are used internally in order to 
> avoid deprecation warnings.
> Setting target for 2.2, but I'm OK with getting it into 2.1 if we have time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest

2017-02-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861714#comment-15861714
 ] 

Joseph K. Bradley commented on SPARK-9478:
--

[~sethah] Thanks for researching this!  +1 for not using weights during bagging 
and using importance weights to compensate.  Intuitively, that seems like it 
should give better estimators for class conditional probabilities than the 
other option.

If you're splitting this into trees and forests, could you please target your 
PR against a subtask for trees?

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10802) Let ALS recommend for subset of data

2017-02-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860110#comment-15860110
 ] 

Joseph K. Bradley commented on SPARK-10802:
---

Linking related issue for feature parity in DataFrame-based API.


> Let ALS recommend for subset of data
> 
>
> Key: SPARK-10802
> URL: https://issues.apache.org/jira/browse/SPARK-10802
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> Currently MatrixFactorizationModel allows to get recommendations for
> - single user 
> - single product 
> - all users
> - all products
> recommendation for all users/products do a cartesian join inside.
> It would be useful in some cases to get recommendations for subset of 
> users/products by providing an RDD with which MatrixFactorizationModel could 
> do an intersection before doing a cartesian join. This would make it much 
> faster in situation where recommendations are needed only for subset of 
> users/products, and when the subset is still too large to make it feasible to 
> recommend one-by-one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19535) ALSModel recommendAll analogs

2017-02-09 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-19535:
-

 Summary: ALSModel recommendAll analogs
 Key: SPARK-19535
 URL: https://issues.apache.org/jira/browse/SPARK-19535
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0
Reporter: Joseph K. Bradley
Assignee: Sue Ann Hong


Add methods analogous to the spark.mllib MatrixFactorizationModel methods 
recommendProductsForUsers/UsersForProducts.

The initial implementation should be very simple, using DataFrame joins.  
Future work can add optimizations.

I recommend naming them:
* recommendForAllUsers
* recommendForAllItems




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-02-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860108#comment-15860108
 ] 

Joseph K. Bradley commented on SPARK-13857:
---

Hi all, catching up these many ALS discussions now.  This work to support 
evaluation and tuning for recommendation is great, but I'm worried about it not 
being resolved in time for 2.2.  I've heard a lot of requests for the plain 
functionality available in spark.mllib for recommendUsers/Products, so I'd 
recommend we just add those methods for now as a short-term solution.  Let's 
keep working on the evaluation/tuning plans too.  I'll create a JIRA for adding 
basic recommendUsers/Products methods.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10141) Number of tasks on executors still become negative after failures

2017-02-09 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-10141.
-
Resolution: Done

> Number of tasks on executors still become negative after failures
> -
>
> Key: SPARK-10141
> URL: https://issues.apache.org/jira/browse/SPARK-10141
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Priority: Minor
> Attachments: Screen Shot 2015-08-20 at 3.14.49 PM.png
>
>
> I hit this failure when running LDA on EC2 (after I made the model size 
> really big).
> I was using the LDAExample.scala code on an EC2 cluster with 16 workers 
> (r3.2xlarge), on a Wikipedia dataset:
> {code}
> Training set size (documents) 4534059
> Vocabulary size (terms)   1
> Training set size (tokens)895575317
> EM optimizer
> 1K topics
> {code}
> Failure message:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in 
> stage 22.0 failed 4 times, most recent failure: Lost task 55.3 in stage 22.0 
> (TID 2881, 10.0.202.128): java.io.IOException: Failed to connect to 
> /10.0.202.128:54740
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused: /10.0.202.128:54740
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
>   at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   ... 1 more
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1267)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1255)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1254)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1254)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:684)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1480)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1431)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:554)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1805)
>   at org.apache.spark.SparkContext.r

[jira] [Assigned] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN

2017-02-09 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-17975:
-

Assignee: Tathagata Das

> EMLDAOptimizer fails with ClassCastException on YARN
> 
>
> Key: SPARK-17975
> URL: https://issues.apache.org/jira/browse/SPARK-17975
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
> Environment: Centos 6, CDH 5.7, Java 1.7u80
>Reporter: Jeff Stein
>Assignee: Tathagata Das
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
> Attachments: docs.txt
>
>
> I'm able to reproduce the error consistently with a 2000 record text file 
> with each record having 1-5 terms and checkpointing enabled. It looks like 
> the problem was introduced with the resolution for SPARK-13355.
> The EdgeRDD class seems to be lying about it's type in a way that causes 
> RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an 
> RDD of Edge elements.
> {code}
> val spark = SparkSession.builder.appName("lda").getOrCreate()
> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints")
> val data: RDD[(Long, Vector)] = // snip
> data.setName("data").cache()
> val lda = new LDA
> val optimizer = new EMLDAOptimizer
> lda.setOptimizer(optimizer)
>   .setK(10)
>   .setMaxIterations(400)
>   .setAlpha(-1)
>   .setBeta(-1)
>   .setCheckpointInterval(7)
> val ldaModel = lda.run(data)
> {code}
> {noformat}
> 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID 
> 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be 
> cast to org.apache.spark.graphx.Edge
>   at 
> org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.a

[jira] [Resolved] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN

2017-02-09 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-17975.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1
   2.0.3

> EMLDAOptimizer fails with ClassCastException on YARN
> 
>
> Key: SPARK-17975
> URL: https://issues.apache.org/jira/browse/SPARK-17975
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
> Environment: Centos 6, CDH 5.7, Java 1.7u80
>Reporter: Jeff Stein
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
> Attachments: docs.txt
>
>
> I'm able to reproduce the error consistently with a 2000 record text file 
> with each record having 1-5 terms and checkpointing enabled. It looks like 
> the problem was introduced with the resolution for SPARK-13355.
> The EdgeRDD class seems to be lying about it's type in a way that causes 
> RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an 
> RDD of Edge elements.
> {code}
> val spark = SparkSession.builder.appName("lda").getOrCreate()
> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints")
> val data: RDD[(Long, Vector)] = // snip
> data.setName("data").cache()
> val lda = new LDA
> val optimizer = new EMLDAOptimizer
> lda.setOptimizer(optimizer)
>   .setK(10)
>   .setMaxIterations(400)
>   .setAlpha(-1)
>   .setBeta(-1)
>   .setCheckpointInterval(7)
> val ldaModel = lda.run(data)
> {code}
> {noformat}
> 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID 
> 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be 
> cast to org.apache.spark.graphx.Edge
>   at 
> org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-ma

[jira] [Updated] (SPARK-14804) Graph vertexRDD/EdgeRDD checkpoint results ClassCastException:

2017-02-09 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14804:
--
Fix Version/s: (was: 3.0.0)
   2.2.0

> Graph vertexRDD/EdgeRDD checkpoint results ClassCastException: 
> ---
>
> Key: SPARK-14804
> URL: https://issues.apache.org/jira/browse/SPARK-14804
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.1
>Reporter: SuYan
>Assignee: Tathagata Das
>Priority: Minor
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> {code}
> graph3.vertices.checkpoint()
> graph3.vertices.count()
> graph3.vertices.map(_._2).count()
> {code}
> 16/04/21 21:04:43 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 
> (TID 13, localhost): java.lang.ClassCastException: 
> org.apache.spark.graphx.impl.ShippableVertexPartition cannot be cast to 
> scala.Tuple2
>   at 
> com.xiaomi.infra.codelab.spark.Graph2$$anonfun$main$1.apply(Graph2.scala:80)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:91)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:219)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> look at the code:
> {code}
>   private[spark] def computeOrReadCheckpoint(split: Partition, context: 
> TaskContext): Iterator[T] =
>   {
> if (isCheckpointedAndMaterialized) {
>   firstParent[T].iterator(split, context)
> } else {
>   compute(split, context)
> }
>   }
>  private[spark] def isCheckpointedAndMaterialized: Boolean = isCheckpointed
>  override def isCheckpointed: Boolean = {
>firstParent[(PartitionID, EdgePartition[ED, VD])].isCheckpointed
>  }
> {code}
> for VertexRDD or EdgeRDD, first parent is its partitionRDD  
> RDD[ShippableVertexPartition[VD]]/RDD[(PartitionID, EdgePartition[ED, VD])]
> 1. we call vertexRDD.checkpoint, it partitionRDD will checkpoint, so 
> VertexRDD.isCheckpointedAndMaterialized=true.
> 2. then we call vertexRDD.iterator, because checkoint=true it called 
> firstParent.iterator(which is not CheckpointRDD, actually is partitionRDD). 
>  
> so returned iterator is iterator[ShippableVertexPartition] not expect 
> iterator[(VertexId, VD)]]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN

2017-02-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859984#comment-15859984
 ] 

Joseph K. Bradley commented on SPARK-17975:
---

Will do, thanks!

> EMLDAOptimizer fails with ClassCastException on YARN
> 
>
> Key: SPARK-17975
> URL: https://issues.apache.org/jira/browse/SPARK-17975
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
> Environment: Centos 6, CDH 5.7, Java 1.7u80
>Reporter: Jeff Stein
> Attachments: docs.txt
>
>
> I'm able to reproduce the error consistently with a 2000 record text file 
> with each record having 1-5 terms and checkpointing enabled. It looks like 
> the problem was introduced with the resolution for SPARK-13355.
> The EdgeRDD class seems to be lying about it's type in a way that causes 
> RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an 
> RDD of Edge elements.
> {code}
> val spark = SparkSession.builder.appName("lda").getOrCreate()
> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints")
> val data: RDD[(Long, Vector)] = // snip
> data.setName("data").cache()
> val lda = new LDA
> val optimizer = new EMLDAOptimizer
> lda.setOptimizer(optimizer)
>   .setK(10)
>   .setMaxIterations(400)
>   .setAlpha(-1)
>   .setBeta(-1)
>   .setCheckpointInterval(7)
> val ldaModel = lda.run(data)
> {code}
> {noformat}
> 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID 
> 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be 
> cast to org.apache.spark.graphx.Edge
>   at 
> org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Commented] (SPARK-10141) Number of tasks on executors still become negative after failures

2017-02-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859982#comment-15859982
 ] 

Joseph K. Bradley commented on SPARK-10141:
---

I'll close this if no one has seen it in Spark 2.0 or 2.1.  Thanks all!

> Number of tasks on executors still become negative after failures
> -
>
> Key: SPARK-10141
> URL: https://issues.apache.org/jira/browse/SPARK-10141
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Priority: Minor
> Attachments: Screen Shot 2015-08-20 at 3.14.49 PM.png
>
>
> I hit this failure when running LDA on EC2 (after I made the model size 
> really big).
> I was using the LDAExample.scala code on an EC2 cluster with 16 workers 
> (r3.2xlarge), on a Wikipedia dataset:
> {code}
> Training set size (documents) 4534059
> Vocabulary size (terms)   1
> Training set size (tokens)895575317
> EM optimizer
> 1K topics
> {code}
> Failure message:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in 
> stage 22.0 failed 4 times, most recent failure: Lost task 55.3 in stage 22.0 
> (TID 2881, 10.0.202.128): java.io.IOException: Failed to connect to 
> /10.0.202.128:54740
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused: /10.0.202.128:54740
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
>   at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   ... 1 more
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1267)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1255)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1254)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1254)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:684)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1480)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1431)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:554)
>   

[jira] [Assigned] (SPARK-18613) spark.ml LDA classes should not expose spark.mllib in APIs

2017-02-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-18613:
-

Assignee: Sue Ann Hong

> spark.ml LDA classes should not expose spark.mllib in APIs
> --
>
> Key: SPARK-18613
> URL: https://issues.apache.org/jira/browse/SPARK-18613
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Sue Ann Hong
>Priority: Critical
>
> spark.ml.LDAModel exposes dependencies on spark.mllib in 2 methods, but it 
> should not:
> * {{def oldLocalModel: OldLocalLDAModel}}
> * {{def getModel: OldLDAModel}}
> This task is to deprecate those methods.  I recommend creating 
> {{private[ml]}} versions of the methods which are used internally in order to 
> avoid deprecation warnings.
> Setting target for 2.2, but I'm OK with getting it into 2.1 if we have time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2017-02-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858373#comment-15858373
 ] 

Joseph K. Bradley commented on SPARK-17139:
---

@sethah  Yep, that looks like what I had in mind.  Thanks for writing it out 
clearly.  What do you think?

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19400) GLM fails for intercept only model

2017-02-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-19400.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16740
[https://github.com/apache/spark/pull/16740]

> GLM fails for intercept only model
> --
>
> Key: SPARK-19400
> URL: https://issues.apache.org/jira/browse/SPARK-19400
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
> Fix For: 2.2.0
>
>
> Intercept-only GLM fails for non-Gaussian family because of reducing an empty 
> array in IWLS. 
> {code}
> val dataset = Seq(
>   (1.0, 1.0, 2.0, 0.0, 5.0),
>   (0.5, 2.0, 1.0, 1.0, 2.0),
>   (1.0, 3.0, 0.5, 2.0, 1.0),
>   (2.0, 4.0, 1.5, 3.0, 3.0)
> ).toDF("y", "w", "off", "x1", "x2")
> val formula = new RFormula().setFormula("y ~ 1")
> val output = formula.fit(dataset).transform(dataset)
> val glr = new GeneralizedLinearRegression().setFamily("poisson")
> val model = glr.fit(output)
> java.lang.UnsupportedOperationException: empty.reduceLeft
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19400) GLM fails for intercept only model

2017-02-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-19400:
-

Assignee: Wayne Zhang

> GLM fails for intercept only model
> --
>
> Key: SPARK-19400
> URL: https://issues.apache.org/jira/browse/SPARK-19400
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
> Fix For: 2.2.0
>
>
> Intercept-only GLM fails for non-Gaussian family because of reducing an empty 
> array in IWLS. 
> {code}
> val dataset = Seq(
>   (1.0, 1.0, 2.0, 0.0, 5.0),
>   (0.5, 2.0, 1.0, 1.0, 2.0),
>   (1.0, 3.0, 0.5, 2.0, 1.0),
>   (2.0, 4.0, 1.5, 3.0, 3.0)
> ).toDF("y", "w", "off", "x1", "x2")
> val formula = new RFormula().setFormula("y ~ 1")
> val output = formula.fit(dataset).transform(dataset)
> val glr = new GeneralizedLinearRegression().setFamily("poisson")
> val model = glr.fit(output)
> java.lang.UnsupportedOperationException: empty.reduceLeft
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17629) Add local version of Word2Vec findSynonyms for spark.ml

2017-02-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-17629:
-

 Shepherd: Joseph K. Bradley
 Assignee: Asher Krim
Affects Version/s: 2.2.0
 Target Version/s: 2.2.0
  Component/s: ML
   Issue Type: New Feature  (was: Question)
  Summary: Add local version of Word2Vec findSynonyms for spark.ml  
(was: Should ml Word2Vec findSynonyms match the mllib implementation?)

> Add local version of Word2Vec findSynonyms for spark.ml
> ---
>
> Key: SPARK-17629
> URL: https://issues.apache.org/jira/browse/SPARK-17629
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Asher Krim
>Assignee: Asher Krim
>Priority: Minor
>
> ml Word2Vec's findSynonyms methods depart from mllib in that they return 
> distributed results, rather than the results directly:
> {code}
>   def findSynonyms(word: String, num: Int): DataFrame = {
> val spark = SparkSession.builder().getOrCreate()
> spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", 
> "similarity")
>   }
> {code}
> What was the reason for this decision? I would think that most users would 
> request a reasonably small number of results back, and want to use them 
> directly on the driver, similar to the _take_ method on dataframes. Returning 
> parallelized results creates a costly round trip for the data that doesn't 
> seem necessary.
> The original PR: https://github.com/apache/spark/pull/7263
> [~MechCoder] - do you perhaps recall the reason?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17498) StringIndexer.setHandleInvalid should have another option 'new'

2017-02-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15856918#comment-15856918
 ] 

Joseph K. Bradley commented on SPARK-17498:
---

Linking related issue for QuantileDiscretizer where we provide a handleInvalid 
option for putting invalid values in a special bucket.

> StringIndexer.setHandleInvalid should have another option 'new'
> ---
>
> Key: SPARK-17498
> URL: https://issues.apache.org/jira/browse/SPARK-17498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Miroslav Balaz
>
> That will map unseen label to maximum known label +1, IndexToString would map 
> that back to "" or NA if there is something like that in spark,



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17498) StringIndexer.setHandleInvalid should have another option 'new'

2017-02-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17498:
--
Summary: StringIndexer.setHandleInvalid should have another option 'new'  
(was: StringIndexer.setHandleInvalid sohuld have another option 'new')

> StringIndexer.setHandleInvalid should have another option 'new'
> ---
>
> Key: SPARK-17498
> URL: https://issues.apache.org/jira/browse/SPARK-17498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Miroslav Balaz
>
> That will map unseen label to maximum known label +1, IndexToString would map 
> that back to "" or NA if there is something like that in spark,



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2017-02-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15856612#comment-15856612
 ] 

Joseph K. Bradley edited comment on SPARK-17139 at 2/7/17 7:25 PM:
---

I'll offer a few thoughts first:
* A "ClassificationSummary" could be the same as a 
"MulticlassClassificationSummary" because binary is a special type of 
multiclass.
* Following the structure of abstractions for Prediction is reasonable.
* Separating binary and multiclass is reasonable; the separation is more 
significant for evaluation than for the Prediction abstractions.
* Abstract classes have been a pain in the case of Prediction abstractions, so 
I'd prefer we use traits.

The 2 alternatives I see are:
1. BinaryClassificationSummary inherits from ClassificationSummary.  No 
separate MulticlassClassificationSummary.
2. BinaryClassificationSummary and MulticlassClassificationSummary inherit from 
ClassificationSummary.

Both alternatives are semantically reasonable.  However, since 
ClassificationSummary = MulticlassClassificationSummary in terms of 
functionality, and since the Prediction abstractions combine binary and 
multiclass, I prefer option 1.

If we go with option 1, then we need 4 concrete classes:
* LogisticRegressionSummary
* LogisticRegressionTrainingSummary
* BinaryLogisticRegressionSummary
* BinaryLogisticRegressionTrainingSummary

We would definitely want binary summaries to inherit from their multiclass 
counterparts, and for training summaries to inherit from their general 
counterparts:
* LogisticRegressionSummary
* LogisticRegressionTrainingSummary: LogisticRegressionSummary
* BinaryLogisticRegressionSummary: LogisticRegressionSummary
* BinaryLogisticRegressionTrainingSummary: LogisticRegressionTrainingSummary, 
BinaryLogisticRegressionSummary

Of course, this is a problem.  But we could solve it by having all of these be 
traits, with concrete classes inheriting.  I.e., 
{{LogisticRegressionModel.summary}} could return {{trait 
LogisticRegressionTrainingSummary}}, which could be of concrete type 
{{LogisticRegressionTrainingSummaryImpl}} (multiclass) or 
{{BinaryLogisticRegressionTrainingSummaryImpl}} (binary).

I suspect MiMa will complain about this, but IIRC it's safe since all of these 
summaries have private constructors and can't be extended outside of Spark.

Btw, we could introduce a set of abstractions matching the Prediction ones, but 
that should probably happen under a separate JIRA.

What do you think?


was (Author: josephkb):
I'll offer a few thoughts first:
* A "ClassificationSummary" could be the same as a 
"MulticlassClassificationSummary" because binary is a special type of 
multiclass.
* Following the structure of abstractions for Prediction is reasonable.
* Separating binary and multiclass is reasonable; the separation is more 
significant for evaluation than for the Prediction abstractions.
* Abstract classes have been a pain in the case of Prediction abstractions, so 
I'd prefer we use traits.

The 2 alternatives I see are:
1. BinaryClassificationSummary inherits from ClassificationSummary.  No 
separate MulticlassClassificationSummary.
2. BinaryClassificationSummary and MulticlassClassificationSummary inherit from 
ClassificationSummary.

Both alternatives are semantically reasonable.  However, since 
ClassificationSummary = MulticlassClassificationSummary in terms of 
functionality, and since the Prediction abstractions combine binary and 
multiclass, I prefer option 1.

If we go with option 1, then we need 4 concrete classes:
* LogisticRegressionSummary
* LogisticRegressionTrainingSummary
* BinaryLogisticRegressionSummary
* BinaryLogisticRegressionTrainingSummary

We would definitely want binary summaries to inherit from their multiclass 
counterparts, and for training summaries to inherit from their general 
counterparts:
* LogisticRegressionSummary
* LogisticRegressionTrainingSummary: LogisticRegressionSummary
* BinaryLogisticRegressionSummary: LogisticRegressionSummary
* BinaryLogisticRegressionTrainingSummary: LogisticRegressionTrainingSummary, 
BinaryLogisticRegressionSummary

Of course, this is a problem.  But we could solve it by having all of these be 
traits, with concrete classes inheriting.  I.e., 
{{LogisticRegressionModel.summary}} could return {{trait 
LogisticRegressionTrainingSummary}}, which could be of concrete type 
{{LogisticRegressionTrainingSummaryImpl}} (multiclass) or 
{{BinaryLogisticRegressionTrainingSummaryImpl}} (binary).

I suspect MiMa will complain about this, but IIRC it's safe since all of these 
summaries have private constructors and can't be extended outside of Spark.

What do you think?

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
>

[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2017-02-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15856612#comment-15856612
 ] 

Joseph K. Bradley commented on SPARK-17139:
---

I'll offer a few thoughts first:
* A "ClassificationSummary" could be the same as a 
"MulticlassClassificationSummary" because binary is a special type of 
multiclass.
* Following the structure of abstractions for Prediction is reasonable.
* Separating binary and multiclass is reasonable; the separation is more 
significant for evaluation than for the Prediction abstractions.
* Abstract classes have been a pain in the case of Prediction abstractions, so 
I'd prefer we use traits.

The 2 alternatives I see are:
1. BinaryClassificationSummary inherits from ClassificationSummary.  No 
separate MulticlassClassificationSummary.
2. BinaryClassificationSummary and MulticlassClassificationSummary inherit from 
ClassificationSummary.

Both alternatives are semantically reasonable.  However, since 
ClassificationSummary = MulticlassClassificationSummary in terms of 
functionality, and since the Prediction abstractions combine binary and 
multiclass, I prefer option 1.

If we go with option 1, then we need 4 concrete classes:
* LogisticRegressionSummary
* LogisticRegressionTrainingSummary
* BinaryLogisticRegressionSummary
* BinaryLogisticRegressionTrainingSummary

We would definitely want binary summaries to inherit from their multiclass 
counterparts, and for training summaries to inherit from their general 
counterparts:
* LogisticRegressionSummary
* LogisticRegressionTrainingSummary: LogisticRegressionSummary
* BinaryLogisticRegressionSummary: LogisticRegressionSummary
* BinaryLogisticRegressionTrainingSummary: LogisticRegressionTrainingSummary, 
BinaryLogisticRegressionSummary

Of course, this is a problem.  But we could solve it by having all of these be 
traits, with concrete classes inheriting.  I.e., 
{{LogisticRegressionModel.summary}} could return {{trait 
LogisticRegressionTrainingSummary}}, which could be of concrete type 
{{LogisticRegressionTrainingSummaryImpl}} (multiclass) or 
{{BinaryLogisticRegressionTrainingSummaryImpl}} (binary).

I suspect MiMa will complain about this, but IIRC it's safe since all of these 
summaries have private constructors and can't be extended outside of Spark.

What do you think?

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10817) ML abstraction umbrella

2017-02-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10817:
--
Priority: Major  (was: Critical)

> ML abstraction umbrella
> ---
>
> Key: SPARK-10817
> URL: https://issues.apache.org/jira/browse/SPARK-10817
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> This is an umbrella for discussing and creating ML abstractions.  This was 
> originally handled under [SPARK-1856] and [SPARK-3702], under which we 
> created the Pipelines API and some Developer APIs for classification and 
> regression.
> This umbrella is for future work, including:
> * Stabilizing the classification and regression APIs
> * Discussing traits vs. abstract classes for abstraction APIs
> * Creating other abstractions not yet covered (clustering, multilabel 
> prediction, etc.)
> Note that [SPARK-3702] still has useful discussion and design docs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries

2017-02-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19498:
--
Description: 
Per the recent discussion on the dev list, this JIRA is for discussing how we 
can make MLlib DataFrame-based APIs more extensible, especially for the purpose 
of writing 3rd-party libraries with APIs extended from the MLlib APIs (for 
custom Transformers, Estimators, etc.).
* For people who have written such libraries, what issues have you run into?
* What APIs are not public or extensible enough?  Do they require changes 
before being made more public?
* Are APIs for non-Scala languages such as Java and Python friendly or 
extensive enough?

The easy answer is to make everything public, but that would be terrible of 
course in the long-term.  Let's discuss what is needed and how we can present 
stable, sufficient, and easy-to-use APIs for 3rd-party developers.

> Discussion: Making MLlib APIs extensible for 3rd party libraries
> 
>
> Key: SPARK-19498
> URL: https://issues.apache.org/jira/browse/SPARK-19498
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Per the recent discussion on the dev list, this JIRA is for discussing how we 
> can make MLlib DataFrame-based APIs more extensible, especially for the 
> purpose of writing 3rd-party libraries with APIs extended from the MLlib APIs 
> (for custom Transformers, Estimators, etc.).
> * For people who have written such libraries, what issues have you run into?
> * What APIs are not public or extensible enough?  Do they require changes 
> before being made more public?
> * Are APIs for non-Scala languages such as Java and Python friendly or 
> extensive enough?
> The easy answer is to make everything public, but that would be terrible of 
> course in the long-term.  Let's discuss what is needed and how we can present 
> stable, sufficient, and easy-to-use APIs for 3rd-party developers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries

2017-02-07 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-19498:
-

 Summary: Discussion: Making MLlib APIs extensible for 3rd party 
libraries
 Key: SPARK-19498
 URL: https://issues.apache.org/jira/browse/SPARK-19498
 Project: Spark
  Issue Type: Brainstorming
  Components: ML
Affects Versions: 2.2.0
Reporter: Joseph K. Bradley
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7258) spark.ml API taking Graph instead of DataFrame

2017-02-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-7258.

Resolution: Won't Fix

> spark.ml API taking Graph instead of DataFrame
> --
>
> Key: SPARK-7258
> URL: https://issues.apache.org/jira/browse/SPARK-7258
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, ML
>Reporter: Joseph K. Bradley
>
> It would be useful to have an API in ML Pipelines for working with Graphs, 
> not just DataFrames.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7258) spark.ml API taking Graph instead of DataFrame

2017-02-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15856318#comment-15856318
 ] 

Joseph K. Bradley commented on SPARK-7258:
--

This made more sense in the past...will close for now.

> spark.ml API taking Graph instead of DataFrame
> --
>
> Key: SPARK-7258
> URL: https://issues.apache.org/jira/browse/SPARK-7258
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, ML
>Reporter: Joseph K. Bradley
>
> It would be useful to have an API in ML Pipelines for working with Graphs, 
> not just DataFrames.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19467) PySpark ML shouldn't use circular imports

2017-02-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-19467:
-

Assignee: Maciej Szymkiewicz

> PySpark ML shouldn't use circular imports
> -
>
> Key: SPARK-19467
> URL: https://issues.apache.org/jira/browse/SPARK-19467
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 2.2.0
>
>
> {{pyspark.ml}} and {{pyspark.ml.pipeline}} contain circular imports with the 
> [former 
> one|https://github.com/apache/spark/blob/39f328ba3519b01940a7d1cdee851ba4e75ef31f/python/pyspark/ml/__init__.py#L23]:
> {code}
> from pyspark.ml.pipeline import Pipeline, PipelineModel
> {code}
> and the [latter 
> one|https://github.com/apache/spark/blob/39f328ba3519b01940a7d1cdee851ba4e75ef31f/python/pyspark/ml/pipeline.py#L24]:
> {code}
> from pyspark.ml import Estimator, Model, Transformer
> {code}
> This is unnecessary and can cause failures when working with external tools.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19467) PySpark ML shouldn't use circular imports

2017-02-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-19467.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16814
[https://github.com/apache/spark/pull/16814]

> PySpark ML shouldn't use circular imports
> -
>
> Key: SPARK-19467
> URL: https://issues.apache.org/jira/browse/SPARK-19467
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 2.2.0
>
>
> {{pyspark.ml}} and {{pyspark.ml.pipeline}} contain circular imports with the 
> [former 
> one|https://github.com/apache/spark/blob/39f328ba3519b01940a7d1cdee851ba4e75ef31f/python/pyspark/ml/__init__.py#L23]:
> {code}
> from pyspark.ml.pipeline import Pipeline, PipelineModel
> {code}
> and the [latter 
> one|https://github.com/apache/spark/blob/39f328ba3519b01940a7d1cdee851ba4e75ef31f/python/pyspark/ml/pipeline.py#L24]:
> {code}
> from pyspark.ml import Estimator, Model, Transformer
> {code}
> This is unnecessary and can cause failures when working with external tools.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15573) Backwards-compatible persistence for spark.ml

2017-02-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15855097#comment-15855097
 ] 

Joseph K. Bradley commented on SPARK-15573:
---

It's a good point that we can't make updates to older Spark releases for 
persistence.  However, I doubt that we would backport many such fixes for 
non-bugs.  The issue you reference is arguably a scalability limit, not a bug.  
Still, adding an internal ML persistence version is a good idea; I'd be OK with 
it.

> Backwards-compatible persistence for spark.ml
> -
>
> Key: SPARK-15573
> URL: https://issues.apache.org/jira/browse/SPARK-15573
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is for imposing backwards-compatible persistence for the 
> DataFrames-based API for MLlib.  I.e., we want to be able to load models 
> saved in previous versions of Spark.  We will not require loading models 
> saved in later versions of Spark.
> This requires:
> * Putting unit tests in place to check loading models from previous versions
> * Notifying all committers active on MLlib to be aware of this requirement in 
> the future
> The unit tests could be written as in spark.mllib, where we essentially 
> copied and pasted the save() code every time it changed.  This happens 
> rarely, so it should be acceptable, though other designs are fine.
> Subtasks of this JIRA should cover checking and adding tests for existing 
> cases, such as KMeansModel (whose format changed between 1.6 and 2.0).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15855090#comment-15855090
 ] 

Joseph K. Bradley commented on SPARK-19208:
---

You're right that sharing intermediate results will be necessary.

I'm happy with [~mlnick]'s VectorSummarizer API.  I also think that, if we 
wanted to use the API I suggested above, the version returning a single struct 
col would work: {{df.select(VectorSummary.summary("features", "weights"))}}.  
The new column could be constructed from intermediate columns which would not 
show up in the final output.  (Is this essentially the "private UDAF" 
[~podongfeng] is mentioning above?)

I'm OK either way.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs

2017-02-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15855060#comment-15855060
 ] 

Joseph K. Bradley commented on SPARK-12157:
---

I don't know of any Python UDF perf tests.  Ad hoc tests could suffice for 
now...

> Support numpy types as return values of Python UDFs
> ---
>
> Key: SPARK-12157
> URL: https://issues.apache.org/jira/browse/SPARK-12157
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
>Reporter: Justin Uang
>
> Currently, if I have a python UDF
> {code}
> import pyspark.sql.types as T
> import pyspark.sql.functions as F
> from pyspark.sql import Row
> import numpy as np
> argmax = F.udf(lambda x: np.argmax(x), T.IntegerType())
> df = sqlContext.createDataFrame([Row(array=[1,2,3])])
> df.select(argmax("array")).count()
> {code}
> I get an exception that is fairly opaque:
> {code}
> Caused by: net.razorvine.pickle.PickleException: expected zero arguments for 
> construction of ClassDict (for numpy.dtype)
> at 
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403)
> {code}
> Numpy types like np.int and np.float64 should automatically be cast to the 
> proper dtypes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16824) Add API docs for VectorUDT

2017-02-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15855057#comment-15855057
 ] 

Joseph K. Bradley commented on SPARK-16824:
---

I think we didn't document it since the future of UDTs becoming public APIs was 
uncertain.  VectorUDT is private in spark.ml.  Still, adding docs for the 
public spark.mllib VectorUDT sounds good to me.

> Add API docs for VectorUDT
> --
>
> Key: SPARK-16824
> URL: https://issues.apache.org/jira/browse/SPARK-16824
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following on the [discussion 
> here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153],
>  it appears that {{VectorUDT}} is missing documentation, at least in PySpark. 
> I'm not sure if this is intentional or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16824) Add API docs for VectorUDT

2017-02-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-16824:
--
Issue Type: Documentation  (was: Improvement)

> Add API docs for VectorUDT
> --
>
> Key: SPARK-16824
> URL: https://issues.apache.org/jira/browse/SPARK-16824
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following on the [discussion 
> here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153],
>  it appears that {{VectorUDT}} is missing documentation, at least in PySpark. 
> I'm not sure if this is intentional or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16824) Add API docs for VectorUDT

2017-02-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-16824:
--
Component/s: MLlib

> Add API docs for VectorUDT
> --
>
> Key: SPARK-16824
> URL: https://issues.apache.org/jira/browse/SPARK-16824
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following on the [discussion 
> here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153],
>  it appears that {{VectorUDT}} is missing documentation, at least in PySpark. 
> I'm not sure if this is intentional or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19247) Improve ml word2vec save/load scalability

2017-02-05 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-19247.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16607
[https://github.com/apache/spark/pull/16607]

> Improve ml word2vec save/load scalability
> -
>
> Key: SPARK-19247
> URL: https://issues.apache.org/jira/browse/SPARK-19247
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Asher Krim
>Assignee: Asher Krim
> Fix For: 2.2.0
>
>
> ml word2vec models can be somewhat large (~4gb is not uncommon). The current 
> save implementation saves the model as a single large datum, which can cause 
> rpc issues and fail to save the model.
> On the loading side, there are issues with loading this large datum as well. 
> This was already solved for mllib word2vec in 
> https://issues.apache.org/jira/browse/SPARK-11994, but the change was never 
> ported to the ml word2vec implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19247) Improve ml word2vec save/load scalability

2017-02-05 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-19247:
-

Assignee: Asher Krim

> Improve ml word2vec save/load scalability
> -
>
> Key: SPARK-19247
> URL: https://issues.apache.org/jira/browse/SPARK-19247
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Asher Krim
>Assignee: Asher Krim
>
> ml word2vec models can be somewhat large (~4gb is not uncommon). The current 
> save implementation saves the model as a single large datum, which can cause 
> rpc issues and fail to save the model.
> On the loading side, there are issues with loading this large datum as well. 
> This was already solved for mllib word2vec in 
> https://issues.apache.org/jira/browse/SPARK-11994, but the change was never 
> ported to the ml word2vec implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19247) Improve ml word2vec save/load scalability

2017-02-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19247:
--
 Shepherd: Joseph K. Bradley
Affects Version/s: 2.2.0
 Target Version/s: 2.2.0

> Improve ml word2vec save/load scalability
> -
>
> Key: SPARK-19247
> URL: https://issues.apache.org/jira/browse/SPARK-19247
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Asher Krim
>
> ml word2vec models can be somewhat large (~4gb is not uncommon). The current 
> save implementation saves the model as a single large datum, which can cause 
> rpc issues and fail to save the model.
> On the loading side, there are issues with loading this large datum as well. 
> This was already solved for mllib word2vec in 
> https://issues.apache.org/jira/browse/SPARK-11994, but the change was never 
> ported to the ml word2vec implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14503) spark.ml Scala API for FPGrowth

2017-02-02 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15850707#comment-15850707
 ] 

Joseph K. Bradley commented on SPARK-14503:
---

Sounds good, thank you!

> spark.ml Scala API for FPGrowth
> ---
>
> Key: SPARK-14503
> URL: https://issues.apache.org/jira/browse/SPARK-14503
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml 
> (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based 
> API, with details for this class.  The doc could also look ahead to the other 
> fpm classes, especially if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19389) Minor doc fixes, including Since tags in Python Params

2017-02-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-19389.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16723
[https://github.com/apache/spark/pull/16723]

> Minor doc fixes, including Since tags in Python Params
> --
>
> Key: SPARK-19389
> URL: https://issues.apache.org/jira/browse/SPARK-19389
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.2.0
>
>
> I spotted some doc issues, mainly in Python, when reviewing [SPARK-19336] 
> which were not related to that PR.  This PR fixes them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14503) spark.ml Scala API for FPGrowth

2017-02-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848747#comment-15848747
 ] 

Joseph K. Bradley commented on SPARK-14503:
---

There are a couple of design issues which have been mentioned in either the 
design doc or the PR, but which should probably be discussed in more detail in 
JIRA:
* Item type: It looks like this currently assumes every item is represented as 
a String.  I'd like us to support any Catalyst type.  If that's hard to do 
(until we port the implementation over to DataFrames), then just supporting 
String is OK as long as it's clearly documented.
* FPGrowth vs AssociationRules: The APIs are a bit fuzzy right now.  I’ve 
listed them out below.  The problem is that AssociationRules is tightly tied to 
FPGrowth.  While I like the idea of being able to use AssociationRules to 
analyze the output of multiple FPM algorithms, I don’t think it’s applicable to 
PrefixSpan since it does not take the ordering of the itemsets into account.  
I’d propose we provide a single API under the name "FPGrowth."
** Q: Have you heard of anyone needing the AssociationRules API without going 
through FPGrowth first?  If so, then we could expose the AssociationRules 
algorithm as a @DeveloperApi static method.

What does everyone think?

Current APIs
* FPGrowth
** Input to fit() and transform(): Seq(items)
** Output
*** transform(): —> same as AssociationRules
*** getFreqItems: {code}DataFrame["items", "freq"]{code}

* AssociationRules
** Input
*** fit(): (output of FPGrowth)
*** transform(): Seq(items) —> Not good that fit/transform take 
different inputs
** Output
*** transform(): predicted items for each Seq(items)
*** associationRules: {code}DataFrame["antecedent", "consequent", 
"confidence"]{code}

Proposal: Combine under FPGrowth
* FPGrowth
** Input to fit() and transform(): Seq(items)
** Output
*** transform(): predicted items for each Seq(items)
*** getFreqItems: {code}DataFrame["items", "freq"]{code}
*** associationRules: {code}DataFrame["antecedent", "consequent", 
"confidence"]{code}


> spark.ml Scala API for FPGrowth
> ---
>
> Key: SPARK-14503
> URL: https://issues.apache.org/jira/browse/SPARK-14503
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml 
> (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based 
> API, with details for this class.  The doc could also look ahead to the other 
> fpm classes, especially if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()

2017-01-31 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847786#comment-15847786
 ] 

Joseph K. Bradley edited comment on SPARK-12965 at 2/1/17 12:43 AM:


I'd say this is both a SQL and MLlib issue, where the MLlib issue is blocked by 
the SQL one.
* SQL: {{schema}} handles periods/quotes inconsistently relative to the rest of 
the Dataset API
* ML: StringIndexer could avoid using schema.fieldNames and instead use an API 
provided by StructType for checking for the existence of a field.  That said, 
that API needs to be added to StructType...

I'm going to update this issue to be for ML only to handle fixing StringIndexer 
and link to a separate JIRA for the SQL issue.



was (Author: josephkb):
I'd say this is both a SQL and MLlib issue.

I'm going to update this issue to be for ML only to handle fixing StringIndexer 
and link to a separate JIRA for the SQL issue.

> Indexer setInputCol() doesn't resolve column names like DataFrame.col()
> ---
>
> Key: SPARK-12965
> URL: https://issues.apache.org/jira/browse/SPARK-12965
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Joshua Taylor
> Attachments: SparkMLDotColumn.java
>
>
> The setInputCol() method doesn't seem to resolve column names in the same way 
> that other methods do.  E.g., Given a DataFrame df, {{df.col("`a.b`")}} will 
> return a column.  On a StringIndexer indexer, 
> {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting 
> and transforming seem to have no effect.  Running the following code produces:
> {noformat}
> +---+---++
> |a.b|a_b|a_bIndex|
> +---+---++
> |foo|foo| 0.0|
> |bar|bar| 1.0|
> +---+---++
> {noformat}
> but I think it should have another column, {{abIndex}} with the same contents 
> as a_bIndex.
> {code}
> public class SparkMLDotColumn {
>   public static void main(String[] args) {
>   // Get the contexts
>   SparkConf conf = new SparkConf()
>   .setMaster("local[*]")
>   .setAppName("test")
>   .set("spark.ui.enabled", "false");
>   JavaSparkContext sparkContext = new JavaSparkContext(conf);
>   SQLContext sqlContext = new SQLContext(sparkContext);
>   
>   // Create a schema with a single string column named "a.b"
>   StructType schema = new StructType(new StructField[] {
>   DataTypes.createStructField("a.b", 
> DataTypes.StringType, false)
>   });
>   // Create an empty RDD and DataFrame
>   List rows = Arrays.asList(RowFactory.create("foo"), 
> RowFactory.create("bar")); 
>   JavaRDD rdd = sparkContext.parallelize(rows);
>   DataFrame df = sqlContext.createDataFrame(rdd, schema);
>   
>   df = df.withColumn("a_b", df.col("`a.b`"));
>   
>   StringIndexer indexer0 = new StringIndexer();
>   indexer0.setInputCol("a_b");
>   indexer0.setOutputCol("a_bIndex");
>   df = indexer0.fit(df).transform(df);
>   
>   StringIndexer indexer1 = new StringIndexer();
>   indexer1.setInputCol("`a.b`");
>   indexer1.setOutputCol("abIndex");
>   df = indexer1.fit(df).transform(df);
>   
>   df.show();
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()

2017-01-31 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847786#comment-15847786
 ] 

Joseph K. Bradley commented on SPARK-12965:
---

I'd say this is both a SQL and MLlib issue.

I'm going to update this issue to be for ML only to handle fixing StringIndexer 
and link to a separate JIRA for the SQL issue.

> Indexer setInputCol() doesn't resolve column names like DataFrame.col()
> ---
>
> Key: SPARK-12965
> URL: https://issues.apache.org/jira/browse/SPARK-12965
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Joshua Taylor
> Attachments: SparkMLDotColumn.java
>
>
> The setInputCol() method doesn't seem to resolve column names in the same way 
> that other methods do.  E.g., Given a DataFrame df, {{df.col("`a.b`")}} will 
> return a column.  On a StringIndexer indexer, 
> {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting 
> and transforming seem to have no effect.  Running the following code produces:
> {noformat}
> +---+---++
> |a.b|a_b|a_bIndex|
> +---+---++
> |foo|foo| 0.0|
> |bar|bar| 1.0|
> +---+---++
> {noformat}
> but I think it should have another column, {{abIndex}} with the same contents 
> as a_bIndex.
> {code}
> public class SparkMLDotColumn {
>   public static void main(String[] args) {
>   // Get the contexts
>   SparkConf conf = new SparkConf()
>   .setMaster("local[*]")
>   .setAppName("test")
>   .set("spark.ui.enabled", "false");
>   JavaSparkContext sparkContext = new JavaSparkContext(conf);
>   SQLContext sqlContext = new SQLContext(sparkContext);
>   
>   // Create a schema with a single string column named "a.b"
>   StructType schema = new StructType(new StructField[] {
>   DataTypes.createStructField("a.b", 
> DataTypes.StringType, false)
>   });
>   // Create an empty RDD and DataFrame
>   List rows = Arrays.asList(RowFactory.create("foo"), 
> RowFactory.create("bar")); 
>   JavaRDD rdd = sparkContext.parallelize(rows);
>   DataFrame df = sqlContext.createDataFrame(rdd, schema);
>   
>   df = df.withColumn("a_b", df.col("`a.b`"));
>   
>   StringIndexer indexer0 = new StringIndexer();
>   indexer0.setInputCol("a_b");
>   indexer0.setOutputCol("a_bIndex");
>   df = indexer0.fit(df).transform(df);
>   
>   StringIndexer indexer1 = new StringIndexer();
>   indexer1.setInputCol("`a.b`");
>   indexer1.setOutputCol("abIndex");
>   df = indexer1.fit(df).transform(df);
>   
>   df.show();
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()

2017-01-31 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12965:
--
Affects Version/s: (was: 1.6.0)
   2.2.0
   1.6.3
   2.0.2
   2.1.0

> Indexer setInputCol() doesn't resolve column names like DataFrame.col()
> ---
>
> Key: SPARK-12965
> URL: https://issues.apache.org/jira/browse/SPARK-12965
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Joshua Taylor
> Attachments: SparkMLDotColumn.java
>
>
> The setInputCol() method doesn't seem to resolve column names in the same way 
> that other methods do.  E.g., Given a DataFrame df, {{df.col("`a.b`")}} will 
> return a column.  On a StringIndexer indexer, 
> {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting 
> and transforming seem to have no effect.  Running the following code produces:
> {noformat}
> +---+---++
> |a.b|a_b|a_bIndex|
> +---+---++
> |foo|foo| 0.0|
> |bar|bar| 1.0|
> +---+---++
> {noformat}
> but I think it should have another column, {{abIndex}} with the same contents 
> as a_bIndex.
> {code}
> public class SparkMLDotColumn {
>   public static void main(String[] args) {
>   // Get the contexts
>   SparkConf conf = new SparkConf()
>   .setMaster("local[*]")
>   .setAppName("test")
>   .set("spark.ui.enabled", "false");
>   JavaSparkContext sparkContext = new JavaSparkContext(conf);
>   SQLContext sqlContext = new SQLContext(sparkContext);
>   
>   // Create a schema with a single string column named "a.b"
>   StructType schema = new StructType(new StructField[] {
>   DataTypes.createStructField("a.b", 
> DataTypes.StringType, false)
>   });
>   // Create an empty RDD and DataFrame
>   List rows = Arrays.asList(RowFactory.create("foo"), 
> RowFactory.create("bar")); 
>   JavaRDD rdd = sparkContext.parallelize(rows);
>   DataFrame df = sqlContext.createDataFrame(rdd, schema);
>   
>   df = df.withColumn("a_b", df.col("`a.b`"));
>   
>   StringIndexer indexer0 = new StringIndexer();
>   indexer0.setInputCol("a_b");
>   indexer0.setOutputCol("a_bIndex");
>   df = indexer0.fit(df).transform(df);
>   
>   StringIndexer indexer1 = new StringIndexer();
>   indexer1.setInputCol("`a.b`");
>   indexer1.setOutputCol("abIndex");
>   df = indexer1.fit(df).transform(df);
>   
>   df.show();
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()

2017-01-31 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12965:
--
Component/s: (was: Spark Core)

> Indexer setInputCol() doesn't resolve column names like DataFrame.col()
> ---
>
> Key: SPARK-12965
> URL: https://issues.apache.org/jira/browse/SPARK-12965
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Joshua Taylor
> Attachments: SparkMLDotColumn.java
>
>
> The setInputCol() method doesn't seem to resolve column names in the same way 
> that other methods do.  E.g., Given a DataFrame df, {{df.col("`a.b`")}} will 
> return a column.  On a StringIndexer indexer, 
> {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting 
> and transforming seem to have no effect.  Running the following code produces:
> {noformat}
> +---+---++
> |a.b|a_b|a_bIndex|
> +---+---++
> |foo|foo| 0.0|
> |bar|bar| 1.0|
> +---+---++
> {noformat}
> but I think it should have another column, {{abIndex}} with the same contents 
> as a_bIndex.
> {code}
> public class SparkMLDotColumn {
>   public static void main(String[] args) {
>   // Get the contexts
>   SparkConf conf = new SparkConf()
>   .setMaster("local[*]")
>   .setAppName("test")
>   .set("spark.ui.enabled", "false");
>   JavaSparkContext sparkContext = new JavaSparkContext(conf);
>   SQLContext sqlContext = new SQLContext(sparkContext);
>   
>   // Create a schema with a single string column named "a.b"
>   StructType schema = new StructType(new StructField[] {
>   DataTypes.createStructField("a.b", 
> DataTypes.StringType, false)
>   });
>   // Create an empty RDD and DataFrame
>   List rows = Arrays.asList(RowFactory.create("foo"), 
> RowFactory.create("bar")); 
>   JavaRDD rdd = sparkContext.parallelize(rows);
>   DataFrame df = sqlContext.createDataFrame(rdd, schema);
>   
>   df = df.withColumn("a_b", df.col("`a.b`"));
>   
>   StringIndexer indexer0 = new StringIndexer();
>   indexer0.setInputCol("a_b");
>   indexer0.setOutputCol("a_bIndex");
>   df = indexer0.fit(df).transform(df);
>   
>   StringIndexer indexer1 = new StringIndexer();
>   indexer1.setInputCol("`a.b`");
>   indexer1.setOutputCol("abIndex");
>   df = indexer1.fit(df).transform(df);
>   
>   df.show();
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19416) Dataset.schema is inconsistent with Dataset in handling columns with periods

2017-01-31 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-19416:
-

 Summary: Dataset.schema is inconsistent with Dataset in handling 
columns with periods
 Key: SPARK-19416
 URL: https://issues.apache.org/jira/browse/SPARK-19416
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2, 1.6.3, 2.2.0
Reporter: Joseph K. Bradley
Priority: Minor


When you have a DataFrame with a column with a period in its name, the API is 
inconsistent about how to quote the column name.

Here's a reproduction:
{code}
import org.apache.spark.sql.functions.col
val rows = Seq(
  ("foo", 1),
  ("bar", 2)
)
val df = spark.createDataFrame(rows).toDF("a.b", "id")
{code}

These methods are all consistent:
{code}
df.select("a.b") // fails
df.select("`a.b`") // succeeds
df.select(col("a.b")) // fails
df.select(col("`a.b`")) // succeeds
df("a.b") // fails
df("`a.b`") // succeeds
{code}

But {{schema}} is inconsistent:
{code}
df.schema("a.b") // succeeds
df.schema("`a.b`") // fails
{code}


"fails" produces error messages like:
{code}
org.apache.spark.sql.AnalysisException: cannot resolve '`a.b`' given input 
columns: [a.b, id];;
'Project ['a.b]
+- Project [_1#1511 AS a.b#1516, _2#1512 AS id#1517]
   +- LocalRelation [_1#1511, _2#1512]

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1121)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1139)
at 
line9667c6d14e79417280e5882aa52e0de727.$read$$iw$$iw$$iw$$iw.(:34)
at 
line9667c6d14e79417280e5882aa52e0de727.$read$$iw$$iw$$iw.(:41)
at 
line9667c6d14e79417280e5882aa52e0de727.$read$$iw$$iw.(:43)
at line9667c6d14e79417280e5882aa52e0de727.$read$$iw.(:45)
at 
line9667c6d14e79417280e5882aa52e0de727.$eval$.$print$lzycompute(:7)
at line9667c6d14e79417280e5882aa52e0de727.$eval$.$print(:6)
{code}

"succeeds" produces:
{code}
org.apache.spark.sql.DataFrame = [a.b: string

[jira] [Updated] (SPARK-19247) Improve ml word2vec save/load scalability

2017-01-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19247:
--
Component/s: ML

> Improve ml word2vec save/load scalability
> -
>
> Key: SPARK-19247
> URL: https://issues.apache.org/jira/browse/SPARK-19247
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Asher Krim
>
> ml word2vec models can be somewhat large (~4gb is not uncommon). The current 
> save implementation saves the model as a single large datum, which can cause 
> rpc issues and fail to save the model.
> On the loading side, there are issues with loading this large datum as well. 
> This was already solved for mllib word2vec in 
> https://issues.apache.org/jira/browse/SPARK-11994, but the change was never 
> ported to the ml word2vec implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19294) improve LocalLDAModel save/load scaling for large models

2017-01-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19294:
--
Component/s: ML

> improve LocalLDAModel save/load scaling for large models
> 
>
> Key: SPARK-19294
> URL: https://issues.apache.org/jira/browse/SPARK-19294
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Asher Krim
>
> The LDA model in ml has some of the same problems addressed by 
> https://issues.apache.org/jira/browse/SPARK-19247 for word2vec.
> An LDA model is on order of `vocabSize` * `k`, which can easily reach 3gb for 
> k=1000 and vocabSize=3m. It's currently saved as a single datum in 1 
> partition. 
> Instead, we should represent the matrix as a list, and use the logic from 
> https://issues.apache.org/jira/browse/SPARK-11994 to pick a reasonable number 
> of partitions.
> cc [~josephkb]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19247) Improve ml word2vec save/load scalability

2017-01-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19247:
--
Summary: Improve ml word2vec save/load scalability  (was: improve ml 
word2vec save/load)

> Improve ml word2vec save/load scalability
> -
>
> Key: SPARK-19247
> URL: https://issues.apache.org/jira/browse/SPARK-19247
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Asher Krim
>
> ml word2vec models can be somewhat large (~4gb is not uncommon). The current 
> save implementation saves the model as a single large datum, which can cause 
> rpc issues and fail to save the model.
> On the loading side, there are issues with loading this large datum as well. 
> This was already solved for mllib word2vec in 
> https://issues.apache.org/jira/browse/SPARK-11994, but the change was never 
> ported to the ml word2vec implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19247) Improve ml word2vec save/load scalability

2017-01-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19247:
--
Issue Type: Improvement  (was: Bug)

> Improve ml word2vec save/load scalability
> -
>
> Key: SPARK-19247
> URL: https://issues.apache.org/jira/browse/SPARK-19247
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Asher Krim
>
> ml word2vec models can be somewhat large (~4gb is not uncommon). The current 
> save implementation saves the model as a single large datum, which can cause 
> rpc issues and fail to save the model.
> On the loading side, there are issues with loading this large datum as well. 
> This was already solved for mllib word2vec in 
> https://issues.apache.org/jira/browse/SPARK-11994, but the change was never 
> ported to the ml word2vec implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19294) improve LocalLDAModel save/load scaling for large models

2017-01-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19294:
--
Summary: improve LocalLDAModel save/load scaling for large models  (was: 
improve ml LDA save/load)

> improve LocalLDAModel save/load scaling for large models
> 
>
> Key: SPARK-19294
> URL: https://issues.apache.org/jira/browse/SPARK-19294
> Project: Spark
>  Issue Type: Bug
>Reporter: Asher Krim
>
> The LDA model in ml has some of the same problems addressed by 
> https://issues.apache.org/jira/browse/SPARK-19247 for word2vec.
> An LDA model is on order of `vocabSize` * `k`, which can easily reach 3gb for 
> k=1000 and vocabSize=3m. It's currently saved as a single datum in 1 
> partition. 
> Instead, we should represent the matrix as a list, and use the logic from 
> https://issues.apache.org/jira/browse/SPARK-11994 to pick a reasonable number 
> of partitions.
> cc [~josephkb]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19294) improve LocalLDAModel save/load scaling for large models

2017-01-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19294:
--
Issue Type: Improvement  (was: Bug)

> improve LocalLDAModel save/load scaling for large models
> 
>
> Key: SPARK-19294
> URL: https://issues.apache.org/jira/browse/SPARK-19294
> Project: Spark
>  Issue Type: Improvement
>Reporter: Asher Krim
>
> The LDA model in ml has some of the same problems addressed by 
> https://issues.apache.org/jira/browse/SPARK-19247 for word2vec.
> An LDA model is on order of `vocabSize` * `k`, which can easily reach 3gb for 
> k=1000 and vocabSize=3m. It's currently saved as a single datum in 1 
> partition. 
> Instead, we should represent the matrix as a list, and use the logic from 
> https://issues.apache.org/jira/browse/SPARK-11994 to pick a reasonable number 
> of partitions.
> cc [~josephkb]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19389) Minor doc fixes, including Since tags in Python Params

2017-01-27 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-19389:
-

 Summary: Minor doc fixes, including Since tags in Python Params
 Key: SPARK-19389
 URL: https://issues.apache.org/jira/browse/SPARK-19389
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML, PySpark
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor


I spotted some doc issues, mainly in Python, when reviewing [SPARK-19336] which 
were not related to that PR.  This PR fixes them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19336) LinearSVC Python API

2017-01-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-19336.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16694
[https://github.com/apache/spark/pull/16694]

> LinearSVC Python API
> 
>
> Key: SPARK-19336
> URL: https://issues.apache.org/jira/browse/SPARK-19336
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Miao Wang
> Fix For: 2.2.0
>
>
> Create a Python wrapper for spark.ml.classification.LinearSVC



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-01-27 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15843670#comment-15843670
 ] 

Joseph K. Bradley commented on SPARK-19208:
---

Thanks for writing out your ideas.  Here are my thoughts about the API:

*Reference API: Double column stats*
When working with Double columns (not Vectors), one would expect write things 
like: {{myDataFrame.select(min("x"), max("x"))}} to select 2 stats, min and 
max.  Here, min and max are functions provided by Spark SQL which return 
columns.

*Analogy*
We should probably provide an analogous API.  Here's what I imagine:
{code}
import org.apache.spark.ml.stat.VectorSummary
val df: DataFrame = ...

val results: DataFrame = df.select(VectorSummary.min("features"), 
VectorSummary.mean("features"))
val weightedResults: DataFrame = df.select(VectorSummary.min("features"), 
VectorSummary.mean("features", "weight"))
// Both of these result DataFrames contain 2 Vector columns.
{code}

I.e., we provide vectorized versions of stats functions.

If you want to put everything into a single function, then we could also have 
VectorSummary have a function "summary" which returns a struct type with every 
stat available:
{code}
val results = df.select(VectorSummary.summary("features", "weights"))
// results DataFrame contains 1 struct column, which has a Vector field for 
every statistic we provide.
{code}

Note: I removed "online" from the name since it the user does not need to know 
that it does online aggregation.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-01-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19208:
--
Summary: MultivariateOnlineSummarizer performance optimization  (was: 
MultivariateOnlineSummarizer perfermence optimization)

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19382) Test sparse vectors in LinearSVCSuite

2017-01-26 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-19382:
-

 Summary: Test sparse vectors in LinearSVCSuite
 Key: SPARK-19382
 URL: https://issues.apache.org/jira/browse/SPARK-19382
 Project: Spark
  Issue Type: Test
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor


Currently, LinearSVCSuite does not test sparse vectors.  We should.  I 
recommend that generateSVMInput be modified to create a mix of dense and sparse 
vectors, rather than adding an additional test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18218) Optimize BlockMatrix multiplication, which may cause OOM and low parallelism usage problem in several cases

2017-01-26 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18218:
--
Shepherd: Burak Yavuz  (was: Yanbo Liang)

> Optimize BlockMatrix multiplication, which may cause OOM and low parallelism 
> usage problem in several cases
> ---
>
> Key: SPARK-18218
> URL: https://issues.apache.org/jira/browse/SPARK-18218
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> After I take a deep look into `BlockMatrix.multiply` implementation, I found 
> that current implementation may cause some problem in special cases.
> Now let me use an extreme case to represent it:
> Suppose we have two blockMatrix A and B
> A has 1 blocks, numRowBlocks = 1,  numColBlocks = 1
> B also has 1 blocks, numRowBlocks = 1,  numColBlocks = 1
> Now if we call A.mulitiply(B), no matter how A and B is partitioned,
> the resultPartitioner will always contains only one partition,
> this muliplication implementation will shuffle 1 * 1 blocks into one 
> reducer, this will cause the parallism became 1, 
> what's worse, because `RDD.cogroup` will load the total group element into 
> memory, now at reducer-side, 1 * 1 blocks will be loaded into memory, 
> because they are all shuffled into the same group. It will easily cause 
> executor OOM.
> The above case is a little extreme, but other case, such as M*N dimensions 
> matrix A multiply N*P dimensions matrix B, when N is much larger than M and 
> P, we met the similar problem.
> The multiplication implementation do not handle the task partition properly, 
> it will cause:
> 1. when the middle dimension N is too large, it will cause reducer OOM.
> 2. even if OOM do not occur, it will still cause parallism too low.
> 3. when N is much large than M and P, and matrix A and B have many 
> partitions, it will cause too many partition on M and P dimension, it will 
> cause much larger shuffled data size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries

2017-01-26 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840711#comment-15840711
 ] 

Joseph K. Bradley commented on SPARK-4638:
--

Commenting here b/c of the recent dev list thread: Non-linear kernels for SVMs 
in Spark would be great to have.  The main barriers are:
* Kernelized SVM training is hard to distribute.  Naive methods require a lot 
of communication.  To get this feature into Spark, we'd need to do proper 
background research and write up a good design.
* Other ML algorithms are arguably more in demand and still need improvements 
(as of the date of this comment).  Tree ensembles are first-and-foremost in my 
mind.

> Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to 
> find non linear boundaries
> ---
>
> Key: SPARK-4638
> URL: https://issues.apache.org/jira/browse/SPARK-4638
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: madankumar s
>  Labels: Gaussian, Kernels, SVM
> Attachments: kernels-1.3.patch
>
>
> SPARK MLlib Classification Module:
> Add Kernel functionalities to SVM Classifier to find non linear patterns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18080) Locality Sensitive Hashing (LSH) Python API

2017-01-26 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18080:
--
Assignee: Yun Ni  (was: Yanbo Liang)

> Locality Sensitive Hashing (LSH) Python API
> ---
>
> Key: SPARK-18080
> URL: https://issues.apache.org/jira/browse/SPARK-18080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14804) Graph vertexRDD/EdgeRDD checkpoint results ClassCastException:

2017-01-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14804:
--
Assignee: Tathagata Das

> Graph vertexRDD/EdgeRDD checkpoint results ClassCastException: 
> ---
>
> Key: SPARK-14804
> URL: https://issues.apache.org/jira/browse/SPARK-14804
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.1
>Reporter: SuYan
>Assignee: Tathagata Das
>Priority: Minor
> Fix For: 2.0.3, 2.1.1, 3.0.0
>
>
> {code}
> graph3.vertices.checkpoint()
> graph3.vertices.count()
> graph3.vertices.map(_._2).count()
> {code}
> 16/04/21 21:04:43 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 
> (TID 13, localhost): java.lang.ClassCastException: 
> org.apache.spark.graphx.impl.ShippableVertexPartition cannot be cast to 
> scala.Tuple2
>   at 
> com.xiaomi.infra.codelab.spark.Graph2$$anonfun$main$1.apply(Graph2.scala:80)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:91)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:219)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> look at the code:
> {code}
>   private[spark] def computeOrReadCheckpoint(split: Partition, context: 
> TaskContext): Iterator[T] =
>   {
> if (isCheckpointedAndMaterialized) {
>   firstParent[T].iterator(split, context)
> } else {
>   compute(split, context)
> }
>   }
>  private[spark] def isCheckpointedAndMaterialized: Boolean = isCheckpointed
>  override def isCheckpointed: Boolean = {
>firstParent[(PartitionID, EdgePartition[ED, VD])].isCheckpointed
>  }
> {code}
> for VertexRDD or EdgeRDD, first parent is its partitionRDD  
> RDD[ShippableVertexPartition[VD]]/RDD[(PartitionID, EdgePartition[ED, VD])]
> 1. we call vertexRDD.checkpoint, it partitionRDD will checkpoint, so 
> VertexRDD.isCheckpointedAndMaterialized=true.
> 2. then we call vertexRDD.iterator, because checkoint=true it called 
> firstParent.iterator(which is not CheckpointRDD, actually is partitionRDD). 
>  
> so returned iterator is iterator[ShippableVertexPartition] not expect 
> iterator[(VertexId, VD)]]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN

2017-01-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838995#comment-15838995
 ] 

Joseph K. Bradley commented on SPARK-17975:
---

[SPARK-14804] was just fixed.  [~jvstein], do you have time to test master with 
your code to see if the bug you hit is fixed?  Thanks!

> EMLDAOptimizer fails with ClassCastException on YARN
> 
>
> Key: SPARK-17975
> URL: https://issues.apache.org/jira/browse/SPARK-17975
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
> Environment: Centos 6, CDH 5.7, Java 1.7u80
>Reporter: Jeff Stein
> Attachments: docs.txt
>
>
> I'm able to reproduce the error consistently with a 2000 record text file 
> with each record having 1-5 terms and checkpointing enabled. It looks like 
> the problem was introduced with the resolution for SPARK-13355.
> The EdgeRDD class seems to be lying about it's type in a way that causes 
> RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an 
> RDD of Edge elements.
> {code}
> val spark = SparkSession.builder.appName("lda").getOrCreate()
> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints")
> val data: RDD[(Long, Vector)] = // snip
> data.setName("data").cache()
> val lda = new LDA
> val optimizer = new EMLDAOptimizer
> lda.setOptimizer(optimizer)
>   .setK(10)
>   .setMaxIterations(400)
>   .setAlpha(-1)
>   .setBeta(-1)
>   .setCheckpointInterval(7)
> val ldaModel = lda.run(data)
> {code}
> {noformat}
> 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID 
> 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be 
> cast to org.apache.spark.graphx.Edge
>   at 
> org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-17265) EdgeRDD Difference throws an exception

2017-01-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838990#comment-15838990
 ] 

Joseph K. Bradley commented on SPARK-17265:
---

[SPARK-14804] was just fixed.  [~shishir167], are you able to test with the 
master branch to see if this issue is fixed?

> EdgeRDD Difference throws an exception
> --
>
> Key: SPARK-17265
> URL: https://issues.apache.org/jira/browse/SPARK-17265
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: windows, ubuntu
>Reporter: Shishir Kharel
>
> Subtracting two edge RDD throws and exception.
> val difference = graph1.edges.subtract(graph2.edges)
> gives
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 1.0 in stage 0.0 (TID 1, localhost): java.lang.ClassCastException: 
> org.apache.spark.graphx.Edge cannot be cast to scala.Tuple2
> at 
> org.apache.spark.rdd.RDD$$anonfun$subtract$3$$anon$3.getPartition(RDD.scala:968)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:152)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17265) EdgeRDD Difference throws an exception

2017-01-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838990#comment-15838990
 ] 

Joseph K. Bradley edited comment on SPARK-17265 at 1/26/17 1:41 AM:


[SPARK-14804] was just fixed.  [~shishir167], are you able to test with the 
master branch to see if the bug you hit is fixed?


was (Author: josephkb):
[SPARK-14804] was just fixed.  [~shishir167], are you able to test with the 
master branch to see if this issue is fixed?

> EdgeRDD Difference throws an exception
> --
>
> Key: SPARK-17265
> URL: https://issues.apache.org/jira/browse/SPARK-17265
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: windows, ubuntu
>Reporter: Shishir Kharel
>
> Subtracting two edge RDD throws and exception.
> val difference = graph1.edges.subtract(graph2.edges)
> gives
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 1.0 in stage 0.0 (TID 1, localhost): java.lang.ClassCastException: 
> org.apache.spark.graphx.Edge cannot be cast to scala.Tuple2
> at 
> org.apache.spark.rdd.RDD$$anonfun$subtract$3$$anon$3.getPartition(RDD.scala:968)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:152)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17877) Can not checkpoint connectedComponents resulting graph

2017-01-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838993#comment-15838993
 ] 

Joseph K. Bradley commented on SPARK-17877:
---

[SPARK-14804] was just fixed.  [~apivovarov], are you able to test with the 
master branch to see if the bug you hit is fixed?

> Can not checkpoint connectedComponents resulting graph
> --
>
> Key: SPARK-17877
> URL: https://issues.apache.org/jira/browse/SPARK-17877
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
>Reporter: Alexander Pivovarov
>Priority: Minor
>
> The following code demonstrates the issue
> {code}
> import org.apache.spark.graphx._
> val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
> -> "kelly"))
> val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, 
> "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> sc.setCheckpointDir("/tmp/check")
> val g = Graph(users, rel)
> g.checkpoint   // /tmp/check/b1f46ba5-357a-4d6d-8f4d-411b64b27c2f appears
> val gg = g.connectedComponents
> gg.checkpoint
> gg.vertices.collect
> gg.edges.collect
> gg.isCheckpointed  // res5: Boolean = false,   /tmp/check still contains only 
> 1 folder b1f46ba5-357a-4d6d-8f4d-411b64b27c2f
> {code}
> I think the last line should return true instead of false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18036) Decision Trees do not handle edge cases

2017-01-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-18036.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16377
[https://github.com/apache/spark/pull/16377]

> Decision Trees do not handle edge cases
> ---
>
> Key: SPARK-18036
> URL: https://issues.apache.org/jira/browse/SPARK-18036
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Assignee: Ilya Matiach
>Priority: Minor
> Fix For: 2.2.0
>
>
> Decision trees/GBT/RF do not handle edge cases such as constant features or 
> empty features. For example:
> {code}
> val dt = new DecisionTreeRegressor()
> val data = Seq(LabeledPoint(1.0, Vectors.dense(Array.empty[Double]))).toDF()
> dt.fit(data)
> java.lang.UnsupportedOperationException: empty.max
>   at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
>   at scala.collection.mutable.ArrayOps$ofInt.max(ArrayOps.scala:234)
>   at 
> org.apache.spark.ml.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:207)
>   at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:105)
>   at 
> org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:93)
>   at 
> org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:46)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   ... 52 elided
> {code}
> as well as 
> {code}
> val dt = new DecisionTreeRegressor()
> val data = Seq(LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0))).toDF()
> dt.fit(data)
> java.lang.UnsupportedOperationException: empty.maxBy
> at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236)
> at 
> scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37)
> at 
> org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:846)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18036) Decision Trees do not handle edge cases

2017-01-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18036:
--
Assignee: Ilya Matiach

> Decision Trees do not handle edge cases
> ---
>
> Key: SPARK-18036
> URL: https://issues.apache.org/jira/browse/SPARK-18036
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Assignee: Ilya Matiach
>Priority: Minor
>
> Decision trees/GBT/RF do not handle edge cases such as constant features or 
> empty features. For example:
> {code}
> val dt = new DecisionTreeRegressor()
> val data = Seq(LabeledPoint(1.0, Vectors.dense(Array.empty[Double]))).toDF()
> dt.fit(data)
> java.lang.UnsupportedOperationException: empty.max
>   at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
>   at scala.collection.mutable.ArrayOps$ofInt.max(ArrayOps.scala:234)
>   at 
> org.apache.spark.ml.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:207)
>   at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:105)
>   at 
> org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:93)
>   at 
> org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:46)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   ... 52 elided
> {code}
> as well as 
> {code}
> val dt = new DecisionTreeRegressor()
> val data = Seq(LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0))).toDF()
> dt.fit(data)
> java.lang.UnsupportedOperationException: empty.maxBy
> at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236)
> at 
> scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37)
> at 
> org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:846)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-18 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829014#comment-15829014
 ] 

Joseph K. Bradley commented on SPARK-19208:
---

+1 for [~mlnick]'s suggestion.  If we're optimizing performance for 
summarizers, let's use DataFrame ops.  With the right design, we could have all 
current statistics represented as columns in the result, but only materialize 
the ones we need.

> MaxAbsScaler and MinMaxScaler are very inefficient
> --
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19208) MaxAbsScaler and MinMaxScaler are very inefficient

2017-01-18 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19208:
--
Assignee: (was: Apache Spark)

> MaxAbsScaler and MinMaxScaler are very inefficient
> --
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14975) Predicted Probability per training instance for Gradient Boosted Trees

2017-01-18 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14975:
--
Summary: Predicted Probability per training instance for Gradient Boosted 
Trees  (was: Predicted Probability per training instance for Gradient Boosted 
Trees in mllib. )

> Predicted Probability per training instance for Gradient Boosted Trees
> --
>
> Key: SPARK-14975
> URL: https://issues.apache.org/jira/browse/SPARK-14975
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Partha Talukder
>Assignee: Ilya Matiach
>Priority: Minor
>  Labels: mllib
> Fix For: 2.2.0
>
>
> This function available for Logistic Regression, SVM etc. 
> (model.setThreshold()) but not for GBT.  In comparison to "gbm" package in R, 
> where we can specify the distribution and get predicted probabilities or 
> classes. I understand that this algorithm works with "Classification" and 
> "Regression" algo's. Is there any way where in GBT  we can get predicted 
> probabilities  or provide thresholds to the model?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14975) Predicted Probability per training instance for Gradient Boosted Trees

2017-01-18 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14975.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16441
[https://github.com/apache/spark/pull/16441]

> Predicted Probability per training instance for Gradient Boosted Trees
> --
>
> Key: SPARK-14975
> URL: https://issues.apache.org/jira/browse/SPARK-14975
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Partha Talukder
>Assignee: Ilya Matiach
>Priority: Minor
>  Labels: mllib
> Fix For: 2.2.0
>
>
> This function available for Logistic Regression, SVM etc. 
> (model.setThreshold()) but not for GBT.  In comparison to "gbm" package in R, 
> where we can specify the distribution and get predicted probabilities or 
> classes. I understand that this algorithm works with "Classification" and 
> "Regression" algo's. Is there any way where in GBT  we can get predicted 
> probabilities  or provide thresholds to the model?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8855) Python API for Association Rules

2017-01-18 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-8855.

Resolution: Won't Fix

> Python API for Association Rules
> 
>
> Key: SPARK-8855
> URL: https://issues.apache.org/jira/browse/SPARK-8855
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> A simple Python wrapper and doctests needs to be written for Association 
> Rules. The relevant method is {{FPGrowthModel.generateAssociationRules}}. The 
> code will likely live in {{fpm.py}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19281) spark.ml Python API for FPGrowth

2017-01-18 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-19281:
-

 Summary: spark.ml Python API for FPGrowth
 Key: SPARK-19281
 URL: https://issues.apache.org/jira/browse/SPARK-19281
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Joseph K. Bradley


See parent issue.  This is for a Python API *after* the Scala API has been 
designed and implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8855) Python API for Association Rules

2017-01-18 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828808#comment-15828808
 ] 

Joseph K. Bradley commented on SPARK-8855:
--

I'm going to close this issue in favor of the DataFrame-based API in 
[SPARK-14501]

> Python API for Association Rules
> 
>
> Key: SPARK-8855
> URL: https://issues.apache.org/jira/browse/SPARK-8855
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> A simple Python wrapper and doctests needs to be written for Association 
> Rules. The relevant method is {{FPGrowthModel.generateAssociationRules}}. The 
> code will likely live in {{fpm.py}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14503) spark.ml Scala API for FPGrowth

2017-01-18 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14503:
--
Summary: spark.ml Scala API for FPGrowth  (was: spark.ml API for FPGrowth)

> spark.ml Scala API for FPGrowth
> ---
>
> Key: SPARK-14503
> URL: https://issues.apache.org/jira/browse/SPARK-14503
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml 
> (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based 
> API, with details for this class.  The doc could also look ahead to the other 
> fpm classes, especially if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms

2017-01-18 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828803#comment-15828803
 ] 

Joseph K. Bradley commented on SPARK-17136:
---

CC [~avulanov], who has thought a lot about these issues too

> Design optimizer interface for ML algorithms
> 
>
> Key: SPARK-17136
> URL: https://issues.apache.org/jira/browse/SPARK-17136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> We should consider designing an interface that allows users to use their own 
> optimizers in some of the ML algorithms, similar to MLlib. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5256) Improving MLlib optimization APIs

2017-01-18 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5256:
-
Description: 
*Goal*: Improve APIs for optimization

*Motivation*: There have been several disjoint mentions of improving the 
optimization APIs to make them more pluggable, extensible, etc.  This JIRA is a 
place to discuss what API changes are necessary for the long term, and to 
provide links to other relevant JIRAs.

Eventually, I hope this leads to a design doc outlining:
* current issues
* requirements such as supporting many types of objective functions, 
optimization algorithms, and parameters to those algorithms
* ideal API
* breakdown of smaller JIRAs needed to achieve that API


  was:
*Goal*: Improve APIs for optimization

*Motivation*: There have been several disjoint mentions of improving the 
optimization APIs to make them more pluggable, extensible, etc.  This JIRA is a 
place to discuss what API changes are necessary for the long term, and to 
provide links to other relevant JIRAs.

Eventually, I hope this leads to a design doc outlining:
* current issues
* requirements such as supporting many types of objective functions, 
optimization algorithms, and parameters to those algorithms
* ideal API
* breakdown of smaller JIRAs needed to achieve that API

I will soon create an initial design doc, and I will try to watch this JIRA and 
include ideas from JIRA comments.



> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13610) Create a Transformer to disassemble vectors in DataFrames

2017-01-18 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828756#comment-15828756
 ] 

Joseph K. Bradley commented on SPARK-13610:
---

One more: Would these selected subsets of elements be chosen by hand, or would 
they need to be chosen automatically via code?

> Create a Transformer to disassemble vectors in DataFrames
> -
>
> Key: SPARK-13610
> URL: https://issues.apache.org/jira/browse/SPARK-13610
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 1.6.0
>Reporter: Andrew MacKinlay
>Priority: Minor
>
> It is possible to convert a standalone numeric field into a single-item 
> Vector, using VectorAssembler. However the inverse operation of retrieving a 
> single item from a vector and translating it into a field doesn't appear to 
> be possible. The workaround I've found is to leave the raw field value in the 
> DF, but I have found no other ways to get a field out of a vector (eg to 
> perform arithmetic on it). Happy to be proved wrong though. Creating a 
> user-defined function doesn't work (in Python at least; it gets a 
> pickleexception).This seems like a simple operation which should be supported 
> for various use cases. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19053) Supporting multiple evaluation metrics in DataFrame-based API: discussion

2017-01-18 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828755#comment-15828755
 ] 

Joseph K. Bradley commented on SPARK-19053:
---

After thinking about this more and hearing your thoughts, my top pick would be 
to define Metrics classes similar to the spark.mllib ones, with the current 
model summaries inheriting from these metrics classes.  (We should sketch out 
APIs before implementation, though, to make sure this setup will not break any 
APIs or force awkward APIs.)

Metrics classes could offer both aggregate and per-row metrics.

> Supporting multiple evaluation metrics in DataFrame-based API: discussion
> -
>
> Key: SPARK-19053
> URL: https://issues.apache.org/jira/browse/SPARK-19053
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is to discuss supporting the computation of multiple evaluation 
> metrics efficiently in the DataFrame-based API for MLlib.
> In the RDD-based API, RegressionMetrics and other *Metrics classes support 
> efficient computation of multiple metrics.
> In the DataFrame-based API, there are a few options:
> * model/result summaries (e.g., LogisticRegressionSummary): These currently 
> provide the desired functionality, but they require a model and do not let 
> users compute metrics manually from DataFrames of predictions and true labels.
> * Evaluator classes (e.g., RegressionEvaluator): These only support computing 
> a single metric in one pass over the data, but they do not require a model.
> * new class analogous to Metrics: We could introduce a class analogous to 
> Metrics.  Model/result summaries could use this internally as a replacement 
> for spark.mllib Metrics classes, or they could (maybe) inherit from these 
> classes.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14975) Predicted Probability per training instance for Gradient Boosted Trees in mllib.

2017-01-18 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14975:
--
Shepherd: Joseph K. Bradley

> Predicted Probability per training instance for Gradient Boosted Trees in 
> mllib. 
> -
>
> Key: SPARK-14975
> URL: https://issues.apache.org/jira/browse/SPARK-14975
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Partha Talukder
>Assignee: Ilya Matiach
>Priority: Minor
>  Labels: mllib
>
> This function available for Logistic Regression, SVM etc. 
> (model.setThreshold()) but not for GBT.  In comparison to "gbm" package in R, 
> where we can specify the distribution and get predicted probabilities or 
> classes. I understand that this algorithm works with "Classification" and 
> "Regression" algo's. Is there any way where in GBT  we can get predicted 
> probabilities  or provide thresholds to the model?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14975) Predicted Probability per training instance for Gradient Boosted Trees in mllib.

2017-01-18 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14975:
--
Assignee: Ilya Matiach

> Predicted Probability per training instance for Gradient Boosted Trees in 
> mllib. 
> -
>
> Key: SPARK-14975
> URL: https://issues.apache.org/jira/browse/SPARK-14975
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Partha Talukder
>Assignee: Ilya Matiach
>Priority: Minor
>  Labels: mllib
>
> This function available for Logistic Regression, SVM etc. 
> (model.setThreshold()) but not for GBT.  In comparison to "gbm" package in R, 
> where we can specify the distribution and get predicted probabilities or 
> classes. I understand that this algorithm works with "Classification" and 
> "Regression" algo's. Is there any way where in GBT  we can get predicted 
> probabilities  or provide thresholds to the model?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17747) WeightCol support non-double datatypes

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17747:
--
Shepherd: Joseph K. Bradley
Assignee: zhengruifeng
Target Version/s: 2.2.0

> WeightCol support non-double datatypes
> --
>
> Key: SPARK-17747
> URL: https://issues.apache.org/jira/browse/SPARK-17747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> WeightCol only support double type now, which should fit with other numeric 
> types, such as Int.
> {code}
> scala> df3.show(5)
> +-++--+
> |label|features|weight|
> +-++--+
> |  0.0|(692,[127,128,129...| 1|
> |  1.0|(692,[158,159,160...| 1|
> |  1.0|(692,[124,125,126...| 1|
> |  1.0|(692,[152,153,154...| 1|
> |  1.0|(692,[151,152,153...| 1|
> +-++--+
> only showing top 5 rows
> scala> val lr = new 
> LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
> lr: org.apache.spark.ml.classification.LogisticRegression = 
> logreg_ee0308a72919
> scala> val lrm = lr.fit(df3)
> 16/09/20 15:46:12 WARN LogisticRegression: LogisticRegression training 
> finished but the result is not converged because: max iterations reached
> lrm: org.apache.spark.ml.classification.LogisticRegressionModel = 
> logreg_ee0308a72919
> scala> val lr = new 
> LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setWeightCol("weight")
> lr: org.apache.spark.ml.classification.LogisticRegression = 
> logreg_ced7579d5680
> scala> val lrm = lr.fit(df3)
> 16/09/20 15:46:27 WARN BlockManager: Putting block rdd_211_0 failed
> 16/09/20 15:46:27 ERROR Executor: Exception in task 0.0 in stage 89.0 (TID 92)
> scala.MatchError: 
> [0.0,1,(692,[127,128,129,130,131,154,155,156,157,158,159,181,182,183,184,185,186,187,188,189,207,208,209,210,211,212,213,214,215,216,217,235,236,237,238,239,240,241,242,243,244,245,262,263,264,265,266,267,268,269,270,271,272,273,289,290,291,292,293,294,295,296,297,300,301,302,316,317,318,319,320,321,328,329,330,343,344,345,346,347,348,349,356,357,358,371,372,373,374,384,385,386,399,400,401,412,413,414,426,427,428,429,440,441,442,454,455,456,457,466,467,468,469,470,482,483,484,493,494,495,496,497,510,511,512,520,521,522,523,538,539,540,547,548,549,550,566,567,568,569,570,571,572,573,574,575,576,577,578,594,595,596,597,598,599,600,601,602,603,604,622,623,624,625,626,627,628,629,630,651,652,653,654,655,656,657],[51.0,159.0,253.0,159.0,50.0,48.0,238.0,252.0,252.0,252.0,237.0,54.0,227.0,253.0,252.0,239.0,233.0,252.0,57.0,6.0,10.0,60.0,224.0,252.0,253.0,252.0,202.0,84.0,252.0,253.0,122.0,163.0,252.0,252.0,252.0,253.0,252.0,252.0,96.0,189.0,253.0,167.0,51.0,238.0,253.0,253.0,190.0,114.0,253.0,228.0,47.0,79.0,255.0,168.0,48.0,238.0,252.0,252.0,179.0,12.0,75.0,121.0,21.0,253.0,243.0,50.0,38.0,165.0,253.0,233.0,208.0,84.0,253.0,252.0,165.0,7.0,178.0,252.0,240.0,71.0,19.0,28.0,253.0,252.0,195.0,57.0,252.0,252.0,63.0,253.0,252.0,195.0,198.0,253.0,190.0,255.0,253.0,196.0,76.0,246.0,252.0,112.0,253.0,252.0,148.0,85.0,252.0,230.0,25.0,7.0,135.0,253.0,186.0,12.0,85.0,252.0,223.0,7.0,131.0,252.0,225.0,71.0,85.0,252.0,145.0,48.0,165.0,252.0,173.0,86.0,253.0,225.0,114.0,238.0,253.0,162.0,85.0,252.0,249.0,146.0,48.0,29.0,85.0,178.0,225.0,253.0,223.0,167.0,56.0,85.0,252.0,252.0,252.0,229.0,215.0,252.0,252.0,252.0,196.0,130.0,28.0,199.0,252.0,252.0,253.0,252.0,252.0,233.0,145.0,25.0,128.0,252.0,253.0,252.0,141.0,37.0])]
>  (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>   at 
> org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
>   at 
> org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.co

[jira] [Resolved] (SPARK-14567) Add instrumentation logs to MLlib training algorithms

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14567.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

> Add instrumentation logs to MLlib training algorithms
> -
>
> Key: SPARK-14567
> URL: https://issues.apache.org/jira/browse/SPARK-14567
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
> Fix For: 2.2.0
>
>
> In order to debug performance issues when training mllib algorithms,
> it is useful to log some metrics about the training dataset, the training 
> parameters, etc.
> This ticket is an umbrella to add some simple logging messages to the most 
> common MLlib estimators. There should be no performance impact on the current 
> implementation, and the output is simply printed in the logs.
> Here are some values that are of interest when debugging training tasks:
> * number of features
> * number of instances
> * number of partitions
> * number of classes
> * input RDD/DF cache level
> * hyper-parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18206) Log instrumentation in MPC, NB, LDA, AFT, GLR, Isotonic, LinReg

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-18206.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 15671
[https://github.com/apache/spark/pull/15671]

> Log instrumentation in MPC, NB, LDA, AFT, GLR, Isotonic, LinReg
> ---
>
> Key: SPARK-18206
> URL: https://issues.apache.org/jira/browse/SPARK-18206
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.2.0
>
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7146) Should ML sharedParams be a public API?

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7146:
-
Target Version/s:   (was: 2.2.0)

> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Proposal: Make most of the Param traits in sharedParams.scala public.  Mark 
> them as DeveloperApi.
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> h3. UPDATED proposal
> * Some Params are clearly safe to make public.  We will do so.
> * Some Params could be made public but may require caveats in the trait doc.
> * Some Params have turned out not to be shared in practice.  We can move 
> those Params to the classes which use them.
> *Public shared params*:
> * I/O column params
> ** HasFeaturesCol
> ** HasInputCol
> ** HasInputCols
> ** HasLabelCol
> ** HasOutputCol
> ** HasPredictionCol
> ** HasProbabilityCol
> ** HasRawPredictionCol
> ** HasVarianceCol
> ** HasWeightCol
> * Algorithm settings
> ** HasCheckpointInterval
> ** HasElasticNetParam
> ** HasFitIntercept
> ** HasMaxIter
> ** HasRegParam
> ** HasSeed
> ** HasStandardization (less common)
> ** HasStepSize
> ** HasTol
> *Questionable params*:
> * HasHandleInvalid (only used in StringIndexer, but might be more widely used 
> later on)
> * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but 
> same meaning as Optimizer in LDA)
> *Params to be removed from sharedParams*:
> * HasThreshold (only used in LogisticRegression)
> * HasThresholds (only used in ProbabilisticClassifier)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10931) PySpark ML Models should contain Param values

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10931:
--
Shepherd:   (was: Joseph K. Bradley)

> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7424) spark.ml classification, regression abstractions should add metadata to output column

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7424:
-
Target Version/s:   (was: 2.2.0)

> spark.ml classification, regression abstractions should add metadata to 
> output column
> -
>
> Key: SPARK-7424
> URL: https://issues.apache.org/jira/browse/SPARK-7424
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Update ClassificationModel, ProbabilisticClassificationModel prediction to 
> include numClasses in output column metadata.
> Update RegressionModel to specify output column metadata as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14501) spark.ml parity for fpm - frequent items

2017-01-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827052#comment-15827052
 ] 

Joseph K. Bradley commented on SPARK-14501:
---

I'm doing a general pass to encourage the Shepherd requirement from the roadmap 
process in [SPARK-18813].

Is someone interested in shepherding this issue, or shall I remove the target 
version?

I'd really like to get this into 2.2 but don't have time to review it right 
now.  Could another committer take it?

> spark.ml parity for fpm - frequent items
> 
>
> Key: SPARK-14501
> URL: https://issues.apache.org/jira/browse/SPARK-14501
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is an umbrella for porting the spark.mllib.fpm subpackage to spark.ml.
> I am initially creating a single subtask, which will require a brief design 
> doc for the DataFrame-based API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10931) PySpark ML Models should contain Param values

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10931:
--
Target Version/s:   (was: 2.2.0)

> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15571) Pipeline unit test improvements for 2.2

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15571:
--
Shepherd: Joseph K. Bradley

> Pipeline unit test improvements for 2.2
> ---
>
> Key: SPARK-15571
> URL: https://issues.apache.org/jira/browse/SPARK-15571
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> Issue:
> * There are several pieces of standard functionality shared by all 
> algorithms: Params, UIDs, fit/transform/save/load, etc.  Currently, these 
> pieces are generally tested in ad hoc tests for each algorithm.
> * This has led to inconsistent coverage, especially within the Python API.
> Goal:
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality.
> * Eliminate duplicate code.  Improve test coverage.  Simplify adding these 
> standard unit tests for future algorithms and APIs.
> This will require several subtasks.  If you identify an issue, please create 
> a subtask, or comment below if the issue needs to be discussed first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14659) OneHotEncoder support drop first category alphabetically in the encoded vector

2017-01-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827041#comment-15827041
 ] 

Joseph K. Bradley commented on SPARK-14659:
---

I'm doing a general pass to encourage the Shepherd requirement from the roadmap 
process in [SPARK-18813].

Is someone interested in shepherding this issue, or shall I remove the target 
version?

> OneHotEncoder support drop first category alphabetically in the encoded 
> vector 
> ---
>
> Key: SPARK-14659
> URL: https://issues.apache.org/jira/browse/SPARK-14659
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> R formula drop the first category alphabetically when encode string/category 
> feature. Spark RFormula use OneHotEncoder to encode string/category feature 
> into vector, but only supporting "dropLast" by string/category frequencies. 
> This will cause SparkR produce different models compared with native R.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14706) Python ML persistence integration test

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14706:
--
Target Version/s:   (was: 2.2.0)

> Python ML persistence integration test
> --
>
> Key: SPARK-14706
> URL: https://issues.apache.org/jira/browse/SPARK-14706
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> Goal: extend integration test in {{ml/tests.py}}.
> In the {{PersistenceTest}} suite, there is a method {{_compare_pipelines}}.  
> This issue includes:
> * Extending {{_compare_pipelines}} to handle CrossValidator, 
> TrainValidationSplit, and OneVsRest
> * Adding an integration test in PersistenceTest which includes nested 
> meta-algorithms.  E.g.: {{Pipeline[ CrossValidator[ TrainValidationSplit[ 
> OneVsRest[ LogisticRegression ] ] ] ]}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2017-01-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827038#comment-15827038
 ] 

Joseph K. Bradley commented on SPARK-15799:
---

I'm doing a general pass to encourage the Shepherd requirement from the roadmap 
process in [SPARK-18813].

Is someone interested in shepherding this issue, or shall I remove the target 
version?

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16578) Configurable hostname for RBackend

2017-01-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827036#comment-15827036
 ] 

Joseph K. Bradley commented on SPARK-16578:
---

I'm doing a general pass to encourage the Shepherd requirement from the roadmap 
process in [SPARK-18813].

Is someone interested in shepherding this issue, or shall I remove the target 
version?

> Configurable hostname for RBackend
> --
>
> Key: SPARK-16578
> URL: https://issues.apache.org/jira/browse/SPARK-16578
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Junyang Qian
>
> One of the requirements that comes up with SparkR being a standalone package 
> is that users can now install just the R package on the client side and 
> connect to a remote machine which runs the RBackend class.
> We should check if we can support this mode of execution and what are the 
> pros / cons of it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17455) IsotonicRegression takes non-polynomial time for some inputs

2017-01-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17455:
--
Shepherd: Joseph K. Bradley

> IsotonicRegression takes non-polynomial time for some inputs
> 
>
> Key: SPARK-17455
> URL: https://issues.apache.org/jira/browse/SPARK-17455
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2, 2.0.0
>Reporter: Nic Eggert
>
> The Pool Adjacent Violators Algorithm (PAVA) implementation that's currently 
> in MLlib can take O(N!) time for certain inputs, when it should have 
> worst-case complexity of O(N^2).
> To reproduce this, I pulled the private method poolAdjacentViolators out of 
> mllib.regression.IsotonicRegression and into a benchmarking harness.
> Given this input
> {code}
> val x = (1 to length).toArray.map(_.toDouble)
> val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 
> else yi}
> val w = Array.fill(length)(1d)
> val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, 
> x), w) => (y, x, w)}
> {code}
> I vary the length of the input to get these timings:
> || Input Length || Time (us) ||
> | 100 | 1.35 |
> | 200 | 3.14 | 
> | 400 | 116.10 |
> | 800 | 2134225.90 |
> (tests were performed using 
> https://github.com/sirthias/scala-benchmarking-template)
> I can also confirm that I run into this issue on a real dataset I'm working 
> on when trying to calibrate random forest probability output. Some partitions 
> take > 12 hours to run. This isn't a skew issue, since the largest partitions 
> finish in minutes. I can only assume that some partitions cause something 
> approaching this worst-case complexity.
> I'm working on a patch that borrows the implementation that is used in 
> scikit-learn and the R "iso" package, both of which handle this particular 
> input in linear time and are quadratic in the worst case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18348) Improve tree ensemble model summary

2017-01-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827023#comment-15827023
 ] 

Joseph K. Bradley commented on SPARK-18348:
---

I'm doing a general pass to enforce the Shepherd requirement from the roadmap 
process in [SPARK-18813].  Note that the process is a new proposal and can be 
updated as needed.

Is someone interested in shepherding this issue, or shall I remove the target 
version?


> Improve tree ensemble model summary
> ---
>
> Key: SPARK-18348
> URL: https://issues.apache.org/jira/browse/SPARK-18348
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Felix Cheung
>
> During work on R APIs for tree ensemble models (eg. Random Forest, GBT) it is 
> discovered and discussed that
> - we don't have a good summary on nodes or trees for their observations, 
> loss, probability and so on
> - we don't have a shared API with nicely formatted output
> We believe this could be a shared API that benefits multiple language 
> bindings, including R, when available.
> For example, here is what R {code}rpart{code} shows for model summary:
> {code}
> Call:
> rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
> method = "class")
>   n= 81
>   CP nsplit rel errorxerror  xstd
> 1 0.17647059  0 1.000 1.000 0.2155872
> 2 0.01960784  1 0.8235294 0.9411765 0.2107780
> 3 0.0100  4 0.7647059 1.0588235 0.2200975
> Variable importance
>  StartAge Number
> 64 24 12
> Node number 1: 81 observations,complexity param=0.1764706
>   predicted class=absent   expected loss=0.2098765  P(node) =1
> class counts:6417
>probabilities: 0.790 0.210
>   left son=2 (62 obs) right son=3 (19 obs)
>   Primary splits:
>   Start  < 8.5  to the right, improve=6.762330, (0 missing)
>   Number < 5.5  to the left,  improve=2.866795, (0 missing)
>   Age< 39.5 to the left,  improve=2.250212, (0 missing)
>   Surrogate splits:
>   Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)
> Node number 2: 62 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
> class counts:56 6
>probabilities: 0.903 0.097
>   left son=4 (29 obs) right son=5 (33 obs)
>   Primary splits:
>   Start  < 14.5 to the right, improve=1.0205280, (0 missing)
>   Age< 55   to the left,  improve=0.6848635, (0 missing)
>   Number < 4.5  to the left,  improve=0.2975332, (0 missing)
>   Surrogate splits:
>   Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
>   Age< 16   to the left,  agree=0.597, adj=0.138, (0 split)
> Node number 3: 19 observations
>   predicted class=present  expected loss=0.4210526  P(node) =0.2345679
> class counts: 811
>probabilities: 0.421 0.579
> Node number 4: 29 observations
>   predicted class=absent   expected loss=0  P(node) =0.3580247
> class counts:29 0
>probabilities: 1.000 0.000
> Node number 5: 33 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
> class counts:27 6
>probabilities: 0.818 0.182
>   left son=10 (12 obs) right son=11 (21 obs)
>   Primary splits:
>   Age< 55   to the left,  improve=1.2467530, (0 missing)
>   Start  < 12.5 to the right, improve=0.2887701, (0 missing)
>   Number < 3.5  to the right, improve=0.1753247, (0 missing)
>   Surrogate splits:
>   Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
>   Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)
> Node number 10: 12 observations
>   predicted class=absent   expected loss=0  P(node) =0.1481481
> class counts:12 0
>probabilities: 1.000 0.000
> Node number 11: 21 observations,complexity param=0.01960784
>   predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
> class counts:15 6
>probabilities: 0.714 0.286
>   left son=22 (14 obs) right son=23 (7 obs)
>   Primary splits:
>   Age< 111  to the right, improve=1.71428600, (0 missing)
>   Start  < 12.5 to the right, improve=0.79365080, (0 missing)
>   Number < 3.5  to the right, improve=0.07142857, (0 missing)
> Node number 22: 14 observations
>   predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
> class counts:12 2
>probabilities: 0.857 0.143
> Node number 23: 7 observations
>   predicted class=present  expected loss=0.4285714  P(node) =0.08641975
> class counts: 3 4
>probabilities: 0.429 0.571
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    5   6   7   8   9   10   11   12   13   14   >