[jira] [Commented] (SPARK-2574) Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner
[ https://issues.apache.org/jira/browse/SPARK-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067815#comment-14067815 ] Sandeep Singh commented on SPARK-2574: -- [~sandyr] we can rewrite mergeCombiners as (c1: ArrayBuffer[V], c2: ArrayBuffer[V]) => c1 ++= c2 , Instead of (c1: ArrayBuffer[V], c2: ArrayBuffer[V]) => c1 ++ c2 > Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner > -- > > Key: SPARK-2574 > URL: https://issues.apache.org/jira/browse/SPARK-2574 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Sandy Ryza > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2597) Improve the code related to Table Scan
[ https://issues.apache.org/jira/browse/SPARK-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067814#comment-14067814 ] Yin Huai commented on SPARK-2597: - Hive uses HiveInputFormat as the wrapper of different InputFormats. We may want to have a similar approach (HiveInputFormat cannot be used directly). > Improve the code related to Table Scan > -- > > Key: SPARK-2597 > URL: https://issues.apache.org/jira/browse/SPARK-2597 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > There are a several issues with the current code related to Table Scan. > 1. HadoopTableReader and HiveTableScan are used together to deal with Hive > tables. It is not clear why we do the Hive-specific work in two different > places. > 2. HadoopTableReader creates a RDD for every Hive partition and then Union > these RDDs. Is it the right way to handle partitioned tables? > 3. Right now, we ship initializeLocalJobConfFunc to every task to set some > local properties. Can we avoid it? > I think it will be good to improve the code related to Table Scan. Also, it > is important to make sure we do not introduce performance issues with the > proposed changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2597) Improve the code related to Table Scan
Yin Huai created SPARK-2597: --- Summary: Improve the code related to Table Scan Key: SPARK-2597 URL: https://issues.apache.org/jira/browse/SPARK-2597 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai There are a several issues with the current code related to Table Scan. 1. HadoopTableReader and HiveTableScan are used together to deal with Hive tables. It is not clear why we do the Hive-specific work in two different places. 2. HadoopTableReader creates a RDD for every Hive partition and then Union these RDDs. Is it the right way to handle partitioned tables? 3. Right now, we ship initializeLocalJobConfFunc to every task to set some local properties. Can we avoid it? I think it will be good to improve the code related to Table Scan. Also, it is important to make sure we do not introduce performance issues with the proposed changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2524) missing document about spark.deploy.retainedDrivers
[ https://issues.apache.org/jira/browse/SPARK-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2524. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1443 [https://github.com/apache/spark/pull/1443] > missing document about spark.deploy.retainedDrivers > --- > > Key: SPARK-2524 > URL: https://issues.apache.org/jira/browse/SPARK-2524 > Project: Spark > Issue Type: Bug > Components: Deploy >Reporter: Lianhui Wang > Fix For: 1.1.0 > > > The configuration on spark.deploy.retainedDrivers is undocumented but > actually used > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L60 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2524) missing document about spark.deploy.retainedDrivers
[ https://issues.apache.org/jira/browse/SPARK-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2524: --- Assignee: Lianhui Wang > missing document about spark.deploy.retainedDrivers > --- > > Key: SPARK-2524 > URL: https://issues.apache.org/jira/browse/SPARK-2524 > Project: Spark > Issue Type: Bug > Components: Deploy >Reporter: Lianhui Wang >Assignee: Lianhui Wang > Fix For: 1.1.0 > > > The configuration on spark.deploy.retainedDrivers is undocumented but > actually used > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L60 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2587) Error message is incorrect in make-distribution.sh
[ https://issues.apache.org/jira/browse/SPARK-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2587. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1489 [https://github.com/apache/spark/pull/1489] > Error message is incorrect in make-distribution.sh > -- > > Key: SPARK-2587 > URL: https://issues.apache.org/jira/browse/SPARK-2587 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Mark Wagner >Assignee: Mark Wagner >Priority: Minor > Fix For: 1.1.0 > > > SPARK-2526 removed some options in favor of using Maven profiles, but it now > gives incorrect guidance for those that try to use the old --with-hive flag: > "--with-hive' is no longer supported, use Maven option -Pyarn" -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2587) Error message is incorrect in make-distribution.sh
[ https://issues.apache.org/jira/browse/SPARK-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2587: --- Assignee: Mark Wagner > Error message is incorrect in make-distribution.sh > -- > > Key: SPARK-2587 > URL: https://issues.apache.org/jira/browse/SPARK-2587 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Mark Wagner >Assignee: Mark Wagner >Priority: Minor > Fix For: 1.1.0 > > > SPARK-2526 removed some options in favor of using Maven profiles, but it now > gives incorrect guidance for those that try to use the old --with-hive flag: > "--with-hive' is no longer supported, use Maven option -Pyarn" -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.
[ https://issues.apache.org/jira/browse/SPARK-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067765#comment-14067765 ] Apache Spark commented on SPARK-2226: - User 'willb' has created a pull request for this issue: https://github.com/apache/spark/pull/1497 > HAVING should be able to contain aggregate expressions that don't appear in > the aggregation list. > -- > > Key: SPARK-2226 > URL: https://issues.apache.org/jira/browse/SPARK-2226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: William Benton > > https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q > This test file contains the following query: > {code} > SELECT key FROM src GROUP BY key HAVING max(value) > "val_255"; > {code} > Once we fixed this issue, we should whitelist having.q. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2596) Populate pull requests on JIRA automatically
[ https://issues.apache.org/jira/browse/SPARK-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2596. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1496 [https://github.com/apache/spark/pull/1496] > Populate pull requests on JIRA automatically > > > Key: SPARK-2596 > URL: https://issues.apache.org/jira/browse/SPARK-2596 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Patrick Wendell >Assignee: Patrick Wendell > Fix For: 1.1.0 > > > For a bunch of reasons we should automatically populate a JIRA with > information about new pull requests when they arrive. I've written a small > python script to do this that we can run from Jenkins every 5 or 10 minutes > to keep things in Sync. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1682) Add gradient descent w/o sampling and RDA L1 updater
[ https://issues.apache.org/jira/browse/SPARK-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067739#comment-14067739 ] Apache Spark commented on SPARK-1682: - User 'dongwang218' has created a pull request for this issue: https://github.com/apache/spark/pull/643 > Add gradient descent w/o sampling and RDA L1 updater > > > Key: SPARK-1682 > URL: https://issues.apache.org/jira/browse/SPARK-1682 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Dong Wang > > The GradientDescent optimizer does sampling before a gradient step. When > input data is already shuffled beforehand, it is possible to scan data and > make gradient descent for each data instance. This could be potentially more > efficient. > Add enhanced RDA L1 updater, which could produce even sparse solutions with > comparable quality compared with L1. Reference: > Lin Xiao, "Dual Averaging Methods for Regularized Stochastic Learning and > Online Optimization", Journal of Machine Learning Research 11 (2010) > 2543-2596. > Small fix: add options to BinaryClassification example to read and write > model file -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (SPARK-2596) Populate pull requests on JIRA automatically
[ https://issues.apache.org/jira/browse/SPARK-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2596: --- Comment: was deleted (was: This is a test: http://google.com) > Populate pull requests on JIRA automatically > > > Key: SPARK-2596 > URL: https://issues.apache.org/jira/browse/SPARK-2596 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Patrick Wendell >Assignee: Patrick Wendell > > For a bunch of reasons we should automatically populate a JIRA with > information about new pull requests when they arrive. I've written a small > python script to do this that we can run from Jenkins every 5 or 10 minutes > to keep things in Sync. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2596) Populate pull requests on JIRA automatically
[ https://issues.apache.org/jira/browse/SPARK-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067731#comment-14067731 ] Patrick Wendell commented on SPARK-2596: This is a test: http://google.com > Populate pull requests on JIRA automatically > > > Key: SPARK-2596 > URL: https://issues.apache.org/jira/browse/SPARK-2596 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Patrick Wendell >Assignee: Patrick Wendell > > For a bunch of reasons we should automatically populate a JIRA with > information about new pull requests when they arrive. I've written a small > python script to do this that we can run from Jenkins every 5 or 10 minutes > to keep things in Sync. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1022) Add unit tests for kafka streaming
[ https://issues.apache.org/jira/browse/SPARK-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067730#comment-14067730 ] Apache Spark commented on SPARK-1022: - User 'tdas' has created a pull request for this issue: [https://github.com/apache/spark/pull/557|https://github.com/apache/spark/pull/557] > Add unit tests for kafka streaming > -- > > Key: SPARK-1022 > URL: https://issues.apache.org/jira/browse/SPARK-1022 > Project: Spark > Issue Type: Bug >Reporter: Patrick Wendell >Assignee: Saisai Shao > > It would be nice if we could add unit tests to verify elements of kafka's > stream. Right now we do integration tests only which makes it hard to upgrade > versions of kafka. The place to start here would be to look at how kafka > tests itself and see if the functionality can be exposed to third party users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1630) PythonRDDs don't handle nulls gracefully
[ https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067729#comment-14067729 ] Apache Spark commented on SPARK-1630: - User 'kalpit' has created a pull request for this issue: [https://github.com/apache/spark/pull/554|https://github.com/apache/spark/pull/554] > PythonRDDs don't handle nulls gracefully > > > Key: SPARK-1630 > URL: https://issues.apache.org/jira/browse/SPARK-1630 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.0, 0.9.1 >Reporter: Kalpit Shah > Fix For: 1.1.0 > > Original Estimate: 2h > Remaining Estimate: 2h > > If PythonRDDs receive a null element in iterators, they currently NPE. It > would be better do log a DEBUG message and skip the write of NULL elements. > Here are the 2 stack traces : > 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread > Thread[stdin writer for python,5,main] > java.lang.NullPointerException > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88) > - > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.writeToFile. > : java.lang.NullPointerException > at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285) > at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280) > at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1597) Add a version of reduceByKey that takes the Partitioner as a second argument
[ https://issues.apache.org/jira/browse/SPARK-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067727#comment-14067727 ] Apache Spark commented on SPARK-1597: - User 'techaddict' has created a pull request for this issue: [https://github.com/apache/spark/pull/550|https://github.com/apache/spark/pull/550] > Add a version of reduceByKey that takes the Partitioner as a second argument > > > Key: SPARK-1597 > URL: https://issues.apache.org/jira/browse/SPARK-1597 > Project: Spark > Issue Type: Bug >Reporter: Matei Zaharia >Assignee: Sandeep Singh >Priority: Blocker > > Most of our shuffle methods can take a Partitioner or a number of partitions > as a second argument, but for some reason reduceByKey takes the Partitioner > as a *first* argument: > http://spark.apache.org/docs/0.9.1/api/core/#org.apache.spark.rdd.PairRDDFunctions. > We should deprecate that version and add one where the Partitioner is the > second argument. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1623) SPARK-1623. Broadcast cleaner should use getCanonicalPath when deleting files by name
[ https://issues.apache.org/jira/browse/SPARK-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067725#comment-14067725 ] Apache Spark commented on SPARK-1623: - User 'nsuthar' has created a pull request for this issue: [https://github.com/apache/spark/pull/546|https://github.com/apache/spark/pull/546] > SPARK-1623. Broadcast cleaner should use getCanonicalPath when deleting files > by name > - > > Key: SPARK-1623 > URL: https://issues.apache.org/jira/browse/SPARK-1623 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Niraj Suthar > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2596) Populate pull requests on JIRA automatically
Patrick Wendell created SPARK-2596: -- Summary: Populate pull requests on JIRA automatically Key: SPARK-2596 URL: https://issues.apache.org/jira/browse/SPARK-2596 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell For a bunch of reasons we should automatically populate a JIRA with information about new pull requests when they arrive. I've written a small python script to do this that we can run from Jenkins every 5 or 10 minutes to keep things in Sync. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1795) Add recursive directory file search to fileInputStream
[ https://issues.apache.org/jira/browse/SPARK-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067722#comment-14067722 ] Apache Spark commented on SPARK-1795: - User 'patrickotoole' has created a pull request for this issue: [https://github.com/apache/spark/pull/537|https://github.com/apache/spark/pull/537] > Add recursive directory file search to fileInputStream > -- > > Key: SPARK-1795 > URL: https://issues.apache.org/jira/browse/SPARK-1795 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Rick OToole > > When writing logs, they are often partitioned into a hierarchical directory > structure. This change will allow spark streaming to monitor all > sub-directories of a parent directory to find new files as they are added. > See https://github.com/apache/spark/pull/537 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1612) Potential resource leaks in Utils.copyStream and Utils.offsetBytes
[ https://issues.apache.org/jira/browse/SPARK-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067720#comment-14067720 ] Patrick Wendell commented on SPARK-1612: A pull request has been posted for this issue: Author: zsxwing URL: [https://github.com/apache/spark/pull/535|https://github.com/apache/spark/pull/535] > Potential resource leaks in Utils.copyStream and Utils.offsetBytes > -- > > Key: SPARK-1612 > URL: https://issues.apache.org/jira/browse/SPARK-1612 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Labels: easyfix > > Should move the "close" statements into a "finally" block. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (SPARK-1580) ALS: Estimate communication and computation costs given a partitioner
[ https://issues.apache.org/jira/browse/SPARK-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1580: --- Comment: was deleted (was: A pull request has been posted for this issue:Author: tmyklebuURL: https://github.com/apache/spark/pull/493) > ALS: Estimate communication and computation costs given a partitioner > - > > Key: SPARK-1580 > URL: https://issues.apache.org/jira/browse/SPARK-1580 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Tor Myklebust >Priority: Minor > > It would be nice to be able to estimate the amount of work needed to solve an > ALS problem. The chief components of this "work" are computation time---time > spent forming and solving the least squares problems---and communication > cost---the number of bytes sent across the network. Communication cost > depends heavily on how the users and products are partitioned. > We currently do not try to cluster users or products so that fewer feature > vectors need to be communicated. This is intended as a first step toward > that end---we ought to be able to tell whether one partitioning is better > than another. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1581) Allow One Flume Avro RPC Server for Each Worker rather than Just One Worker
[ https://issues.apache.org/jira/browse/SPARK-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067719#comment-14067719 ] Patrick Wendell commented on SPARK-1581: A pull request has been posted for this issue:Author: christopheclcURL: https://github.com/apache/spark/pull/495 > Allow One Flume Avro RPC Server for Each Worker rather than Just One Worker > --- > > Key: SPARK-1581 > URL: https://issues.apache.org/jira/browse/SPARK-1581 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Christophe Clapp >Priority: Minor > Labels: Flume > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1580) ALS: Estimate communication and computation costs given a partitioner
[ https://issues.apache.org/jira/browse/SPARK-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067718#comment-14067718 ] Patrick Wendell commented on SPARK-1580: A pull request has been posted for this issue:Author: tmyklebuURL: https://github.com/apache/spark/pull/493 > ALS: Estimate communication and computation costs given a partitioner > - > > Key: SPARK-1580 > URL: https://issues.apache.org/jira/browse/SPARK-1580 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Tor Myklebust >Priority: Minor > > It would be nice to be able to estimate the amount of work needed to solve an > ALS problem. The chief components of this "work" are computation time---time > spent forming and solving the least squares problems---and communication > cost---the number of bytes sent across the network. Communication cost > depends heavily on how the users and products are partitioned. > We currently do not try to cluster users or products so that fewer feature > vectors need to be communicated. This is intended as a first step toward > that end---we ought to be able to tell whether one partitioning is better > than another. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support
[ https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067696#comment-14067696 ] Chris Fregly commented on SPARK-1981: - [~pwendell] is there anything i need to do within the spark_ec2 scripts to makes sure kinesis is built and/or enabled when EC2 instances are created? i want to make sure i'm covering all the bases. > Add AWS Kinesis streaming support > - > > Key: SPARK-1981 > URL: https://issues.apache.org/jira/browse/SPARK-1981 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Chris Fregly >Assignee: Chris Fregly > > Add AWS Kinesis support to Spark Streaming. > Initial discussion occured here: https://github.com/apache/spark/pull/223 > I discussed this with Parviz from AWS recently and we agreed that I would > take this over. > Look for a new PR that takes into account all the feedback from the earlier > PR including spark-1.0-compliant implementation, AWS-license-aware build > support, tests, comments, and style guide compliance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2595) The driver run garbage collection, when the executor throws OutOfMemoryError exception
[ https://issues.apache.org/jira/browse/SPARK-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067653#comment-14067653 ] Guoqiang Li commented on SPARK-2595: Sorry I removed it. > The driver run garbage collection, when the executor throws OutOfMemoryError > exception > -- > > Key: SPARK-2595 > URL: https://issues.apache.org/jira/browse/SPARK-2595 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Guoqiang Li > > [SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation > GC-based cleaning only consider the memory usage of the drive. We should > consider more factors to trigger gc. eg: executor exit code, task exception, > task gc time . -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2595) The driver run garbage collection, when the executor throws OutOfMemoryError exception
[ https://issues.apache.org/jira/browse/SPARK-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-2595: --- Description: [SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation GC-based cleaning only consider the memory usage of the drive. We should consider more factors to trigger gc. eg: executor exit code, task exception, task gc time . was: [SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation GC-based cleaning only consider the memory usage of the drive. We should consider more factors to trigger gc. eg: executor exit code, task exception, task gc time . [~pwendell]'s proposal: if we detect memory pressure on the executors we should try to trigger a GC on the driver so that if there happen to be RDD's that have gone out of scope on the driver side, their associated cache blocks will be cleaned up on executors and free up memory. > The driver run garbage collection, when the executor throws OutOfMemoryError > exception > -- > > Key: SPARK-2595 > URL: https://issues.apache.org/jira/browse/SPARK-2595 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Guoqiang Li > > [SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation > GC-based cleaning only consider the memory usage of the drive. We should > consider more factors to trigger gc. eg: executor exit code, task exception, > task gc time . -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2595) The driver run garbage collection, when the executor throws OutOfMemoryError exception
[ https://issues.apache.org/jira/browse/SPARK-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067653#comment-14067653 ] Guoqiang Li edited comment on SPARK-2595 at 7/19/14 7:45 PM: - Sorry, I removed it. was (Author: gq): Sorry I removed it. > The driver run garbage collection, when the executor throws OutOfMemoryError > exception > -- > > Key: SPARK-2595 > URL: https://issues.apache.org/jira/browse/SPARK-2595 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Guoqiang Li > > [SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation > GC-based cleaning only consider the memory usage of the drive. We should > consider more factors to trigger gc. eg: executor exit code, task exception, > task gc time . -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2595) The driver run garbage collection, when the executor throws OutOfMemoryError exception
[ https://issues.apache.org/jira/browse/SPARK-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067651#comment-14067651 ] Patrick Wendell commented on SPARK-2595: I was not proposing that we should do this. I was just attempting to summarize what the existing patch does. > The driver run garbage collection, when the executor throws OutOfMemoryError > exception > -- > > Key: SPARK-2595 > URL: https://issues.apache.org/jira/browse/SPARK-2595 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Guoqiang Li > > [SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation > GC-based cleaning only consider the memory usage of the drive. We should > consider more factors to trigger gc. eg: executor exit code, task exception, > task gc time . > [~pwendell]'s proposal: > if we detect memory pressure on the executors we should try to trigger a GC > on the driver so that if there happen to be RDD's that have gone out of scope > on the driver side, their associated cache blocks will be cleaned up on > executors and free up memory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2595) The driver run garbage collection, when the executor throws OutOfMemoryError exception
[ https://issues.apache.org/jira/browse/SPARK-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-2595: --- Description: [SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation GC-based cleaning only consider the memory usage of the drive. We should consider more factors to trigger gc. eg: executor exit code, task exception, task gc time . [~pwendell]'s proposal: if we detect memory pressure on the executors we should try to trigger a GC on the driver so that if there happen to be RDD's that have gone out of scope on the driver side, their associated cache blocks will be cleaned up on executors and free up memory. was: [SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation GC-based cleaning only consider the memory usage of the drive. We should consider more factors to trigger gc.includes executor exit code, task exception, task gc time . [~pwendell]'s proposal: if we detect memory pressure on the executors we should try to trigger a GC on the driver so that if there happen to be RDD's that have gone out of scope on the driver side, their associated cache blocks will be cleaned up on executors and free up memory. > The driver run garbage collection, when the executor throws OutOfMemoryError > exception > -- > > Key: SPARK-2595 > URL: https://issues.apache.org/jira/browse/SPARK-2595 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Guoqiang Li > > [SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation > GC-based cleaning only consider the memory usage of the drive. We should > consider more factors to trigger gc. eg: executor exit code, task exception, > task gc time . > [~pwendell]'s proposal: > if we detect memory pressure on the executors we should try to trigger a GC > on the driver so that if there happen to be RDD's that have gone out of scope > on the driver side, their associated cache blocks will be cleaned up on > executors and free up memory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2595) The driver run garbage collection, when the executor throws OutOfMemoryError exception
Guoqiang Li created SPARK-2595: -- Summary: The driver run garbage collection, when the executor throws OutOfMemoryError exception Key: SPARK-2595 URL: https://issues.apache.org/jira/browse/SPARK-2595 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Guoqiang Li [SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation GC-based cleaning only consider the memory usage of the drive. We should consider more factors to trigger gc.includes executor exit code, task exception, task gc time . [~pwendell]'s proposal: if we detect memory pressure on the executors we should try to trigger a GC on the driver so that if there happen to be RDD's that have gone out of scope on the driver side, their associated cache blocks will be cleaned up on executors and free up memory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2591) Add config property to disable incremental collection used in Thrift server
[ https://issues.apache.org/jira/browse/SPARK-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067563#comment-14067563 ] Michael Armbrust commented on SPARK-2591: - We should benchmark this and make sure that there is measurable benefit to collecting all of the results at once. I'd like to avoid additional configuration options where possible. > Add config property to disable incremental collection used in Thrift server > --- > > Key: SPARK-2591 > URL: https://issues.apache.org/jira/browse/SPARK-2591 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Lian >Priority: Minor > > {{SparkSQLOperationManager}} uses {{RDD.toLocalIterator}} to collect the > result set one partition at a time. This is useful to avoid OOM when the > result is large, but introduces extra job scheduling costs as each partition > is collected with a separate job. Users may want to disable this when the > result set is expected to be small. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2594) Add CACHE TABLE AS SELECT ...
Michael Armbrust created SPARK-2594: --- Summary: Add CACHE TABLE AS SELECT ... Key: SPARK-2594 URL: https://issues.apache.org/jira/browse/SPARK-2594 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file
[ https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2576: Target Version/s: 1.1.0 > slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark > QL query on HDFS CSV file > -- > > Key: SPARK-2576 > URL: https://issues.apache.org/jira/browse/SPARK-2576 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.0.1 > Environment: One Mesos 0.19 master without zookeeper and 4 mesos > slaves. > JDK 1.7.51 and Scala 2.10.4 on all nodes. > HDFS from CDH5.0.3 > Spark version: I tried both with the pre-built CDH5 spark package available > from http://spark.apache.org/downloads.html and by packaging spark with sbt > 0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here > http://mesosphere.io/learn/run-spark-on-mesos/ > All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have >Reporter: Svend Vanderveken > > Execution of SQL query against HDFS systematically throws a class not found > exception on slave nodes when executing . > (this was originally reported on the user list: > http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html) > Sample code (ran from spark-shell): > {code} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Car(timestamp: Long, objectid: String, isGreen: Boolean) > // I get the same error when pointing to the folder > "hdfs://vm28:8020/test/cardata" > val data = sc.textFile("hdfs://vm28:8020/test/cardata/part-0") > val cars = data.map(_.split(",")).map ( ar => Car(ar(0).toLong, ar(1), > ar(2).toBoolean)) > cars.registerAsTable("mcars") > val allgreens = sqlContext.sql("SELECT objectid from mcars where isGreen = > true") > allgreens.collect.take(10).foreach(println) > {code} > Stack trace on the slave nodes: > {code} > I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0 > I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave > 20140714-142853-485682442-5050-25487-2 > 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as > executor ID 20140714-142853-485682442-5050-25487-2 > 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: > mesos,mnubohadoop > 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(mesos, > mnubohadoop) > 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started > 14/07/16 13:01:17 INFO Remoting: Starting remoting > 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://spark@vm23:38230] > 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: > [akka.tcp://spark@vm23:38230] > 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: > akka.tcp://spark@vm28:41632/user/MapOutputTracker > 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: > akka.tcp://spark@vm28:41632/user/BlockManagerMaster > 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at > /tmp/spark-local-20140716130117-8ea0 > 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 > MB. > 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id > = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501) > 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager > 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager > 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is > /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633 > 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server > 14/07/16 13:01:18 INFO Executor: Using REPL class URI: http://vm28:33973 > 14/07/16 13:01:18 INFO Executor: Running task ID 2 > 14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0 > 14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with > curMem=0, maxMem=309225062 > 14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to > memory (estimated size 122.6 KB, free 294.8 MB) > 14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took > 0.294602722 s > 14/07/16 13:01:19 INFO HadoopRDD: Input split: > hdfs://vm28:8020/test/cardata/part-0:23960450+23960451 > I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown > 14/07/16 13:01:20 ERROR Executor: Exception in task ID 2 > java.lang.NoClassDefFoundError: $line11/$read$ > at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19) > at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19) > at scala.collection.Iterator$$anon$11.next(
[jira] [Resolved] (SPARK-2591) Add config property to disable incremental collection used in Thrift server
[ https://issues.apache.org/jira/browse/SPARK-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2591. - Resolution: Duplicate > Add config property to disable incremental collection used in Thrift server > --- > > Key: SPARK-2591 > URL: https://issues.apache.org/jira/browse/SPARK-2591 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Lian >Priority: Minor > > {{SparkSQLOperationManager}} uses {{RDD.toLocalIterator}} to collect the > result set one partition at a time. This is useful to avoid OOM when the > result is large, but introduces extra job scheduling costs as each partition > is collected with a separate job. Users may want to disable this when the > result set is expected to be small. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark
[ https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067540#comment-14067540 ] Helena Edelson commented on SPARK-2593: --- I should note that I'd be happy to do the changes. I am a committer to Akka Cluster. > Add ability to pass an existing Akka ActorSystem into Spark > --- > > Key: SPARK-2593 > URL: https://issues.apache.org/jira/browse/SPARK-2593 > Project: Spark > Issue Type: Brainstorming > Components: Spark Core >Reporter: Helena Edelson > > As a developer I want to pass an existing ActorSystem into StreamingContext > in load-time so that I do not have 2 actor systems running on a node. > This would mean having spark's actor system on its own named-dispatchers as > well as exposing the new private creation of its own actor system. > If it makes sense... > I would like to create an Akka Extension that wraps around Spark/Spark > Streaming and Cassandra. So the creation would simply be this for a user > val extension = SparkCassandra(system) > and using is as easy as: > import extension._ > spark. // do work or, > streaming. // do work > > and all config comes from reference.conf and user overrides of that. > The conf file would pick up settings from the deployed environment first, > then fallback to -D with a final fallback to configured settings. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark
Helena Edelson created SPARK-2593: - Summary: Add ability to pass an existing Akka ActorSystem into Spark Key: SPARK-2593 URL: https://issues.apache.org/jira/browse/SPARK-2593 Project: Spark Issue Type: Brainstorming Components: Spark Core Reporter: Helena Edelson As a developer I want to pass an existing ActorSystem into StreamingContext in load-time so that I do not have 2 actor systems running on a node. This would mean having spark's actor system on its own named-dispatchers as well as exposing the new private creation of its own actor system. If it makes sense... I would like to create an Akka Extension that wraps around Spark/Spark Streaming and Cassandra. So the creation would simply be this for a user val extension = SparkCassandra(system) and using is as easy as: import extension._ spark. // do work or, streaming. // do work and all config comes from reference.conf and user overrides of that. The conf file would pick up settings from the deployed environment first, then fallback to -D with a final fallback to configured settings. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067512#comment-14067512 ] Sean Owen commented on SPARK-2420: -- https://github.com/srowen/spark/commit/f111393131008b72f641233ee9f5cb6f6cb4ff10 In terms of rectifying compile errors, downgrading to 11 is straightforward. There is one non-trivial change. Previously takeOrdered had used Guava's Ordering.leastOf(Iterator,int) to take k smallest elements from an Iterator, then add those k to a BoundedPriorityQueue. This method is not available in 11. However, it does not seem necessary to select the smallest k before putting into a priority queue bounded to size k? the result is the same if I understand correctly. Staring at the code, I think Guava's optimization makes the whole process O(n + k log k) where n is the number of elements in the iterator, whereas the straightforward approach is O(n log k). I'd imagine the straightforward approach wins for small k, even. Not sure if there is some history on this particular choice. > Change Spark build to minimize library conflicts > > > Key: SPARK-2420 > URL: https://issues.apache.org/jira/browse/SPARK-2420 > Project: Spark > Issue Type: Wish > Components: Build >Affects Versions: 1.0.0 >Reporter: Xuefu Zhang > Attachments: spark_1.0.0.patch > > > During the prototyping of HIVE-7292, many library conflicts showed up because > Spark build contains versions of libraries that's vastly different from > current major Hadoop version. It would be nice if we can choose versions > that's in line with Hadoop or shading them in the assembly. Here are the wish > list: > 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 > 2. Shading Spark's jetty and servlet dependency in the assembly. > 3. guava version difference. Spark is using a higher version. I'm not sure > what's the best solution for this. > The list may grow as HIVE-7292 proceeds. > For information only, the attached is a patch that we applied on Spark in > order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1997) Update breeze to version 0.8.1
[ https://issues.apache.org/jira/browse/SPARK-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067504#comment-14067504 ] Guoqiang Li commented on SPARK-1997: I'm sorry, came late. the breeze 0.8.1 jar has {{6916}} files. dependencies changes in breeze 0.8.1: ||changes ||packages||license|| |additional|org.scalamacros:quasiquotes_2.10:2.0.0-M8|BSD-like| |additional|com.typesafe.scala-logging:scala-logging-slf4j_2.10:2.1.2|Apache 2.0| |remove|ccom.typesafe:scalalogging-slf4j_2.10:1.0.1|Apache 2.0| |upgrade|org.scalanlp:breeze-macros_2.10:0.7.4|Apache 2.0| > Update breeze to version 0.8.1 > -- > > Key: SPARK-1997 > URL: https://issues.apache.org/jira/browse/SPARK-1997 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Guoqiang Li >Assignee: Guoqiang Li > > {{breeze 0.7}} does not support {{scala 2.11}} . -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.
[ https://issues.apache.org/jira/browse/SPARK-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067503#comment-14067503 ] William Benton commented on SPARK-2226: --- [~rxin], yes, and I'm mostly done. I'll post a PR soon! > HAVING should be able to contain aggregate expressions that don't appear in > the aggregation list. > -- > > Key: SPARK-2226 > URL: https://issues.apache.org/jira/browse/SPARK-2226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: William Benton > > https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q > This test file contains the following query: > {code} > SELECT key FROM src GROUP BY key HAVING max(value) > "val_255"; > {code} > Once we fixed this issue, we should whitelist having.q. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2552) Stabilize the computation of logistic function in pyspark
[ https://issues.apache.org/jira/browse/SPARK-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067449#comment-14067449 ] Xiangrui Meng commented on SPARK-2552: -- PR: https://github.com/apache/spark/pull/1493 > Stabilize the computation of logistic function in pyspark > - > > Key: SPARK-2552 > URL: https://issues.apache.org/jira/browse/SPARK-2552 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Reporter: Xiangrui Meng > Labels: Starter > > exp(1000) throws an error in python. For logistic function, we can use either > 1 / ( 1 + exp( -x ) ) or 1 - 1 / (1 + exp( x ) ) to compute its value which > ensuring exp always takes a negative value. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1997) Update breeze to version 0.8.1
[ https://issues.apache.org/jira/browse/SPARK-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067450#comment-14067450 ] Xiangrui Meng commented on SPARK-1997: -- PR: https://github.com/apache/spark/pull/940 > Update breeze to version 0.8.1 > -- > > Key: SPARK-1997 > URL: https://issues.apache.org/jira/browse/SPARK-1997 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Guoqiang Li >Assignee: Guoqiang Li > > {{breeze 0.7}} does not support {{scala 2.11}} . -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2495) Ability to re-create ML models
[ https://issues.apache.org/jira/browse/SPARK-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067445#comment-14067445 ] Xiangrui Meng commented on SPARK-2495: -- I sent out a PR for linear models: https://github.com/apache/spark/pull/1492 . For MatrixFactorizationModel, one thing we are not sure is the type of ids. But we definitely should make those constructors available in v1.1. > Ability to re-create ML models > -- > > Key: SPARK-2495 > URL: https://issues.apache.org/jira/browse/SPARK-2495 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.0.1 >Reporter: Alexander Albul >Assignee: Alexander Albul > > Hi everyone. > Previously (prior to Spark 1.0) we was working with MLib like this: > 1) Calculate model (costly operation) > 2) Take model and collect it's fields like weights, intercept e.t.c. > 3) Store model somewhere in our format > 4) Do predictions by loading model attributes, creating new model and > predicting using it. > Now i see that model's constructors have *private* modifier and cannot be > created from outside. > If you want to hide implementation details and keep this constructor as > "developer api", why not to create at least method, which will take weights, > intercept (for example) an materialize that model? > A good example of model that i am talking about is: *LinearRegressionModel* > I know that *LinearRegressionWithSGD* class have *createModel* method but the > problem is that it have *protected* modifier as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2552) Stabilize the computation of logistic function in pyspark
[ https://issues.apache.org/jira/browse/SPARK-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067419#comment-14067419 ] Xiangrui Meng commented on SPARK-2552: -- It is not necessary to check the ranges because exp never underflows on a negative number. So the function is just {code} def logistic(x): if x > 0: return 1 / (1 + math.exp(-x)) else return 1 - 1 / (1 + math.exp(x)) {code} > Stabilize the computation of logistic function in pyspark > - > > Key: SPARK-2552 > URL: https://issues.apache.org/jira/browse/SPARK-2552 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Reporter: Xiangrui Meng > Labels: Starter > > exp(1000) throws an error in python. For logistic function, we can use either > 1 / ( 1 + exp( -x ) ) or 1 - 1 / (1 + exp( x ) ) to compute its value which > ensuring exp always takes a negative value. -- This message was sent by Atlassian JIRA (v6.2#6252)