[jira] [Assigned] (SPARK-12455) Add ExpressionDescription to window functions
[ https://issues.apache.org/jira/browse/SPARK-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12455: Assignee: Apache Spark (was: Herman van Hovell) > Add ExpressionDescription to window functions > - > > Key: SPARK-12455 > URL: https://issues.apache.org/jira/browse/SPARK-12455 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
[ https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12465: Assignee: Apache Spark > Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir > -- > > Key: SPARK-12465 > URL: https://issues.apache.org/jira/browse/SPARK-12465 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen >Assignee: Apache Spark > > Remove spark.deploy.mesos.zookeeper.dir and use existing configuration > spark.deploy.zookeeper.dir for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
[ https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12465: Assignee: (was: Apache Spark) > Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir > -- > > Key: SPARK-12465 > URL: https://issues.apache.org/jira/browse/SPARK-12465 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen > > Remove spark.deploy.mesos.zookeeper.dir and use existing configuration > spark.deploy.zookeeper.dir for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12457) Add ExpressionDescription to collection functions
[ https://issues.apache.org/jira/browse/SPARK-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12457: Assignee: (was: Apache Spark) > Add ExpressionDescription to collection functions > - > > Key: SPARK-12457 > URL: https://issues.apache.org/jira/browse/SPARK-12457 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12457) Add ExpressionDescription to collection functions
[ https://issues.apache.org/jira/browse/SPARK-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12457: Assignee: Apache Spark > Add ExpressionDescription to collection functions > - > > Key: SPARK-12457 > URL: https://issues.apache.org/jira/browse/SPARK-12457 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12464) Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url
[ https://issues.apache.org/jira/browse/SPARK-12464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067045#comment-15067045 ] Apache Spark commented on SPARK-12464: -- User 'tnachen' has created a pull request for this issue: https://github.com/apache/spark/pull/10057 > Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url > -- > > Key: SPARK-12464 > URL: https://issues.apache.org/jira/browse/SPARK-12464 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen >Assignee: Apache Spark > > Remove spark.deploy.mesos.zookeeper.url and use existing configuration > spark.deploy.zookeeper.url for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode
[ https://issues.apache.org/jira/browse/SPARK-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12463: Assignee: (was: Apache Spark) > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > > > Key: SPARK-12463 > URL: https://issues.apache.org/jira/browse/SPARK-12463 > Project: Spark > Issue Type: Task >Reporter: Timothy Chen > > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > configuration for cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
[ https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12465: Assignee: Apache Spark > Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir > -- > > Key: SPARK-12465 > URL: https://issues.apache.org/jira/browse/SPARK-12465 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen >Assignee: Apache Spark > > Remove spark.deploy.mesos.zookeeper.dir and use existing configuration > spark.deploy.zookeeper.dir for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12321) JSON format for logical/physical execution plans
[ https://issues.apache.org/jira/browse/SPARK-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12321. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10311 [https://github.com/apache/spark/pull/10311] > JSON format for logical/physical execution plans > > > Key: SPARK-12321 > URL: https://issues.apache.org/jira/browse/SPARK-12321 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12398) Smart truncation of DataFrame / Dataset toString
[ https://issues.apache.org/jira/browse/SPARK-12398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12398. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10373 [https://github.com/apache/spark/pull/10373] > Smart truncation of DataFrame / Dataset toString > > > Key: SPARK-12398 > URL: https://issues.apache.org/jira/browse/SPARK-12398 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > Labels: starter > Fix For: 2.0.0 > > > When a DataFrame or Dataset has a long schema, we should intelligently > truncate to avoid flooding the screen with unreadable information. > {code} > // Standard output > [a: int, b: int] > // Truncate many top level fields > [a: int, b, string ... 10 more fields] > // Truncate long inner structs > [a: struct] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode
[ https://issues.apache.org/jira/browse/SPARK-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12463: Assignee: Apache Spark > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > > > Key: SPARK-12463 > URL: https://issues.apache.org/jira/browse/SPARK-12463 > Project: Spark > Issue Type: Task >Reporter: Timothy Chen >Assignee: Apache Spark > > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > configuration for cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
[ https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067059#comment-15067059 ] Apache Spark commented on SPARK-12465: -- User 'tnachen' has created a pull request for this issue: https://github.com/apache/spark/pull/10057 > Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir > -- > > Key: SPARK-12465 > URL: https://issues.apache.org/jira/browse/SPARK-12465 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen > > Remove spark.deploy.mesos.zookeeper.dir and use existing configuration > spark.deploy.zookeeper.dir for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation
[ https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12468: Assignee: (was: Apache Spark) > getParamMap in Pyspark ML API returns empty dictionary in example for > Documentation > --- > > Key: SPARK-12468 > URL: https://issues.apache.org/jira/browse/SPARK-12468 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Zachary Brown >Priority: Minor > > The `extractParamMap()` method for a model that has been fit returns an empty > dictionary, e.g. (from the [Pyspark ML API > Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)): > ```python > from pyspark.mllib.linalg import Vectors > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.param import Param, Params > # Prepare training data from a list of (label, features) tuples. > training = sqlContext.createDataFrame([ > (1.0, Vectors.dense([0.0, 1.1, 0.1])), > (0.0, Vectors.dense([2.0, 1.0, -1.0])), > (0.0, Vectors.dense([2.0, 1.3, 1.0])), > (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) > # Create a LogisticRegression instance. This instance is an Estimator. > lr = LogisticRegression(maxIter=10, regParam=0.01) > # Print out the parameters, documentation, and any default values. > print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" > # Learn a LogisticRegression model. This uses the parameters stored in lr. > model1 = lr.fit(training) > # Since model1 is a Model (i.e., a transformer produced by an Estimator), > # we can view the parameters it used during fit(). > # This prints the parameter (name: value) pairs, where names are unique IDs > for this > # LogisticRegression instance. > print "Model 1 was fit using parameters: " > print model1.extractParamMap() > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12454) Add ExpressionDescription to expressions are registered in FunctionRegistry
Yin Huai created SPARK-12454: Summary: Add ExpressionDescription to expressions are registered in FunctionRegistry Key: SPARK-12454 URL: https://issues.apache.org/jira/browse/SPARK-12454 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai ExpressionDescription is a annotation that contains doc of a function and when users use {{describe function}}, users can see the doc defined in this annotation. You can take a look at {{Upper}} as an example. However, we still have lots of expression that do not have ExpressionDescription. It will be great to take a look at expressions registered in FunctionRegistry and add ExpressionDescription to those expression that do not have it.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12455) Add ExpressionDescription to window functions
Yin Huai created SPARK-12455: Summary: Add ExpressionDescription to window functions Key: SPARK-12455 URL: https://issues.apache.org/jira/browse/SPARK-12455 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Herman van Hovell -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12362) Create a full-fledged built-in SQL parser
[ https://issues.apache.org/jira/browse/SPARK-12362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066864#comment-15066864 ] Nong Li commented on SPARK-12362: - I think it makes sense to inline the hive ql parser into spark sql. This satisfies the requirements in a pretty good way. It is maximally HiveQL compatible and what the existing spark sql integration is built on. The parser uses antlr and looks to be easy to extend going forward. Inlining it would involve taking some of the existing code in the hive.ql.parse package, restricting it to the code that deals with parsing and not semantic analysis. > Create a full-fledged built-in SQL parser > - > > Key: SPARK-12362 > URL: https://issues.apache.org/jira/browse/SPARK-12362 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > > Spark currently has two SQL parsers it is using: a simple one based on Scala > parser combinator, and another one based on Hive. > Neither is a good long term solution. The parser combinator one has bad error > messages for users and does not warn when there are conflicts in the defined > grammar. The Hive one depends directly on Hive itself, and as a result, it is > very difficult to introduce new grammar or fix bugs. > The goal of the ticket is to create a single SQL query parser that is > powerful enough to replace the existing ones. The requirements for the new > parser are: > 1. Can support almost all of HiveQL > 2. Can support all existing SQL parser built using Scala parser combinators > 3. Can be used for expression parsing in addition to SQL query parsing > 4. Can provide good error messages for incorrect syntax > Rather than building one from scratch, we should investigate whether we can > leverage existing open source projects such as Hive (by inlining the parser > part) or Calcite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12457) Add ExpressionDescription to collection functions
[ https://issues.apache.org/jira/browse/SPARK-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066892#comment-15066892 ] Xiao Li commented on SPARK-12457: - Let me pick this one? : ) > Add ExpressionDescription to collection functions > - > > Key: SPARK-12457 > URL: https://issues.apache.org/jira/browse/SPARK-12457 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12396) Once driver client registered successfully,it still retry to connected to master.
[ https://issues.apache.org/jira/browse/SPARK-12396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067023#comment-15067023 ] Apache Spark commented on SPARK-12396: -- User 'echoTomei' has created a pull request for this issue: https://github.com/apache/spark/pull/10407 > Once driver client registered successfully,it still retry to connected to > master. > - > > Key: SPARK-12396 > URL: https://issues.apache.org/jira/browse/SPARK-12396 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1, 1.5.2 >Reporter: echo >Priority: Minor > Original Estimate: 12h > Remaining Estimate: 12h > > As description in AppClient.scala,Once driver connect to a master > successfully, all scheduling work and Futures will be cancelled. But at > currently,it still try to connect to master. And it should not happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12430) Temporary folders do not get deleted after Task completes causing problems with disk space.
[ https://issues.apache.org/jira/browse/SPARK-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fede Bar updated SPARK-12430: - Component/s: Spark Core > Temporary folders do not get deleted after Task completes causing problems > with disk space. > --- > > Key: SPARK-12430 > URL: https://issues.apache.org/jira/browse/SPARK-12430 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1, 1.5.2 > Environment: Ubuntu server >Reporter: Fede Bar > > We are experiencing an issue with automatic /tmp folder deletion after > framework completes. Completing a M/R job using Spark 1.5.2 (same behavior as > Spark 1.5.1) over Mesos will not delete some temporary folders causing free > disk space on server to exhaust. > Behavior of M/R job using Spark 1.4.1 over Mesos cluster: > - Launched using spark-submit on one cluster node. > - Following folders are created: */tmp/mesos/slaves/id#* , */tmp/spark-#/* , > */tmp/spark-#/blockmgr-#* > - When task is completed */tmp/spark-#/* gets deleted along with > */tmp/spark-#/blockmgr-#* sub-folder. > Behavior of M/R job using Spark 1.5.2 over Mesos cluster (same identical job): > - Launched using spark-submit on one cluster node. > - Following folders are created: */tmp/mesos/mesos/slaves/id** * , > */tmp/spark-***/ * ,{color:red} /tmp/blockmgr-***{color} > - When task is completed */tmp/spark-***/ * gets deleted but NOT shuffle > container folder {color:red} /tmp/blockmgr-***{color} > Unfortunately, {color:red} /tmp/blockmgr-***{color} can account for several > GB depending on the job that ran. Over time this causes disk space to become > full with consequences that we all know. > Running a shell script would probably work but it is difficult to identify > folders in use by a running M/R or stale folders. I did notice similar issues > opened by other users marked as "resolved", but none seems to exactly match > the above behavior. > I really hope someone has insights on how to fix it. > Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12430) Temporary folders do not get deleted after Task completes causing problems with disk space.
[ https://issues.apache.org/jira/browse/SPARK-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fede Bar updated SPARK-12430: - Fix Version/s: (was: 1.4.1) > Temporary folders do not get deleted after Task completes causing problems > with disk space. > --- > > Key: SPARK-12430 > URL: https://issues.apache.org/jira/browse/SPARK-12430 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1, 1.5.2 > Environment: Ubuntu server >Reporter: Fede Bar > > We are experiencing an issue with automatic /tmp folder deletion after > framework completes. Completing a M/R job using Spark 1.5.2 (same behavior as > Spark 1.5.1) over Mesos will not delete some temporary folders causing free > disk space on server to exhaust. > Behavior of M/R job using Spark 1.4.1 over Mesos cluster: > - Launched using spark-submit on one cluster node. > - Following folders are created: */tmp/mesos/slaves/id#* , */tmp/spark-#/* , > */tmp/spark-#/blockmgr-#* > - When task is completed */tmp/spark-#/* gets deleted along with > */tmp/spark-#/blockmgr-#* sub-folder. > Behavior of M/R job using Spark 1.5.2 over Mesos cluster (same identical job): > - Launched using spark-submit on one cluster node. > - Following folders are created: */tmp/mesos/mesos/slaves/id** * , > */tmp/spark-***/ * ,{color:red} /tmp/blockmgr-***{color} > - When task is completed */tmp/spark-***/ * gets deleted but NOT shuffle > container folder {color:red} /tmp/blockmgr-***{color} > Unfortunately, {color:red} /tmp/blockmgr-***{color} can account for several > GB depending on the job that ran. Over time this causes disk space to become > full with consequences that we all know. > Running a shell script would probably work but it is difficult to identify > folders in use by a running M/R or stale folders. I did notice similar issues > opened by other users marked as "resolved", but none seems to exactly match > the above behavior. > I really hope someone has insights on how to fix it. > Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
[ https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12453. --- Resolution: Duplicate > Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version > > > Key: SPARK-12453 > URL: https://issues.apache.org/jira/browse/SPARK-12453 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Martin Schade >Priority: Critical > Labels: easyfix > > The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS > Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0). > AWS KCL 1.3.0 references AWS Java SDK version 1.9.37. > Using 1.9.16 in combination with 1.3.0 does fail to get data out of the > stream. > I tested Spark Streaming with 1.9.37 and it works fine. > Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also > fails, so it is due to the specific versions used in 1.5.2 and not a Spark > related implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12462) Add ExpressionDescription to misc non-aggregate functions
Yin Huai created SPARK-12462: Summary: Add ExpressionDescription to misc non-aggregate functions Key: SPARK-12462 URL: https://issues.apache.org/jira/browse/SPARK-12462 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12430) Temporary folders do not get deleted after Task completes causing problems with disk space.
[ https://issues.apache.org/jira/browse/SPARK-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fede Bar updated SPARK-12430: - Component/s: (was: Spark Submit) (was: Shuffle) (was: Block Manager) > Temporary folders do not get deleted after Task completes causing problems > with disk space. > --- > > Key: SPARK-12430 > URL: https://issues.apache.org/jira/browse/SPARK-12430 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1, 1.5.2 > Environment: Ubuntu server >Reporter: Fede Bar > > We are experiencing an issue with automatic /tmp folder deletion after > framework completes. Completing a M/R job using Spark 1.5.2 (same behavior as > Spark 1.5.1) over Mesos will not delete some temporary folders causing free > disk space on server to exhaust. > Behavior of M/R job using Spark 1.4.1 over Mesos cluster: > - Launched using spark-submit on one cluster node. > - Following folders are created: */tmp/mesos/slaves/id#* , */tmp/spark-#/* , > */tmp/spark-#/blockmgr-#* > - When task is completed */tmp/spark-#/* gets deleted along with > */tmp/spark-#/blockmgr-#* sub-folder. > Behavior of M/R job using Spark 1.5.2 over Mesos cluster (same identical job): > - Launched using spark-submit on one cluster node. > - Following folders are created: */tmp/mesos/mesos/slaves/id** * , > */tmp/spark-***/ * ,{color:red} /tmp/blockmgr-***{color} > - When task is completed */tmp/spark-***/ * gets deleted but NOT shuffle > container folder {color:red} /tmp/blockmgr-***{color} > Unfortunately, {color:red} /tmp/blockmgr-***{color} can account for several > GB depending on the job that ran. Over time this causes disk space to become > full with consequences that we all know. > Running a shell script would probably work but it is difficult to identify > folders in use by a running M/R or stale folders. I did notice similar issues > opened by other users marked as "resolved", but none seems to exactly match > the above behavior. > I really hope someone has insights on how to fix it. > Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12464) Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url
[ https://issues.apache.org/jira/browse/SPARK-12464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067062#comment-15067062 ] Andrew Or commented on SPARK-12464: --- By the way for future reference you probably don't need a separate issue for each config. Just have an issue that says `Remove spark.deploy.mesos.* and use spark.deploy.* instead`. Since you already opened these we can just keep them. > Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url > -- > > Key: SPARK-12464 > URL: https://issues.apache.org/jira/browse/SPARK-12464 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen >Assignee: Apache Spark > > Remove spark.deploy.mesos.zookeeper.url and use existing configuration > spark.deploy.zookeeper.url for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12457) Add ExpressionDescription to collection functions
Yin Huai created SPARK-12457: Summary: Add ExpressionDescription to collection functions Key: SPARK-12457 URL: https://issues.apache.org/jira/browse/SPARK-12457 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12464) Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url
[ https://issues.apache.org/jira/browse/SPARK-12464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12464: Assignee: Apache Spark > Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url > -- > > Key: SPARK-12464 > URL: https://issues.apache.org/jira/browse/SPARK-12464 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen >Assignee: Apache Spark > > Remove spark.deploy.mesos.zookeeper.url and use existing configuration > spark.deploy.zookeeper.url for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
[ https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066857#comment-15066857 ] Timothy Hunter commented on SPARK-12247: Thanks for working on it, [~BenFradet]! > Documentation for spark.ml's ALS and collaborative filtering in general > --- > > Key: SPARK-12247 > URL: https://issues.apache.org/jira/browse/SPARK-12247 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > > We need to add a section in the documentation about collaborative filtering > in the dataframe API: > - copy explanations about collaborative filtering and ALS from spark.mllib > - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12362) Create a full-fledged built-in SQL parser
[ https://issues.apache.org/jira/browse/SPARK-12362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066883#comment-15066883 ] Reynold Xin commented on SPARK-12362: - +1 > Create a full-fledged built-in SQL parser > - > > Key: SPARK-12362 > URL: https://issues.apache.org/jira/browse/SPARK-12362 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > > Spark currently has two SQL parsers it is using: a simple one based on Scala > parser combinator, and another one based on Hive. > Neither is a good long term solution. The parser combinator one has bad error > messages for users and does not warn when there are conflicts in the defined > grammar. The Hive one depends directly on Hive itself, and as a result, it is > very difficult to introduce new grammar or fix bugs. > The goal of the ticket is to create a single SQL query parser that is > powerful enough to replace the existing ones. The requirements for the new > parser are: > 1. Can support almost all of HiveQL > 2. Can support all existing SQL parser built using Scala parser combinators > 3. Can be used for expression parsing in addition to SQL query parsing > 4. Can provide good error messages for incorrect syntax > Rather than building one from scratch, we should investigate whether we can > leverage existing open source projects such as Hive (by inlining the parser > part) or Calcite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12455) Add ExpressionDescription to window functions
[ https://issues.apache.org/jira/browse/SPARK-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12455: Assignee: Apache Spark (was: Herman van Hovell) > Add ExpressionDescription to window functions > - > > Key: SPARK-12455 > URL: https://issues.apache.org/jira/browse/SPARK-12455 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12458) Add ExpressionDescription to datetime functions
Yin Huai created SPARK-12458: Summary: Add ExpressionDescription to datetime functions Key: SPARK-12458 URL: https://issues.apache.org/jira/browse/SPARK-12458 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12459) Add ExpressionDescription to string functions
Yin Huai created SPARK-12459: Summary: Add ExpressionDescription to string functions Key: SPARK-12459 URL: https://issues.apache.org/jira/browse/SPARK-12459 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12429) Update documentation to show how to use accumulators and broadcasts with Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12429: Assignee: Apache Spark (was: Shixiong Zhu) > Update documentation to show how to use accumulators and broadcasts with > Spark Streaming > > > Key: SPARK-12429 > URL: https://issues.apache.org/jira/browse/SPARK-12429 > Project: Spark > Issue Type: Documentation > Components: Documentation, Streaming >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Accumulators and Broadcasts with Spark Streaming cannot work perfectly when > restarting on driver failures. We need to add some example to guide the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12461) Add ExpressionDescription to math functions
Yin Huai created SPARK-12461: Summary: Add ExpressionDescription to math functions Key: SPARK-12461 URL: https://issues.apache.org/jira/browse/SPARK-12461 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12460) Add ExpressionDescription to aggregate functions
Yin Huai created SPARK-12460: Summary: Add ExpressionDescription to aggregate functions Key: SPARK-12460 URL: https://issues.apache.org/jira/browse/SPARK-12460 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark
Irakli Machabeli created SPARK-12467: Summary: Get rid of sorting in Row's constructor in pyspark Key: SPARK-12467 URL: https://issues.apache.org/jira/browse/SPARK-12467 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.5.2 Reporter: Irakli Machabeli Priority: Minor Current implementation of Row's __new__ sorts columns by name First of all there is no obvious reason to sort, second, if one converts dataframe to rdd and than back to dataframe, order of column changes. While this is not a bug, nevetheless it makes looking at the data really inconvenient. def __new__(self, *args, **kwargs): if args and kwargs: raise ValueError("Can not use both args " "and kwargs to create Row") if args: # create row class or objects return tuple.__new__(self, args) elif kwargs: # create row objects names = sorted(kwargs.keys()) # just get rid of sorting here!!! row = tuple.__new__(self, [kwargs[n] for n in names]) row.__fields__ = names return row else: raise ValueError("No args or kwargs") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12231) Failed to generate predicate Error when using dropna
[ https://issues.apache.org/jira/browse/SPARK-12231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067033#comment-15067033 ] Apache Spark commented on SPARK-12231: -- User 'kevinyu98' has created a pull request for this issue: https://github.com/apache/spark/pull/10388 > Failed to generate predicate Error when using dropna > > > Key: SPARK-12231 > URL: https://issues.apache.org/jira/browse/SPARK-12231 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.2, 1.6.0 > Environment: python version: 2.7.9 > os: ubuntu 14.04 >Reporter: yahsuan, chang > > code to reproduce error > # write.py > {code} > import pyspark > sc = pyspark.SparkContext() > sqlc = pyspark.SQLContext(sc) > df = sqlc.range(10) > df1 = df.withColumn('a', df['id'] * 2) > df1.write.partitionBy('id').parquet('./data') > {code} > # read.py > {code} > import pyspark > sc = pyspark.SparkContext() > sqlc = pyspark.SQLContext(sc) > df2 = sqlc.read.parquet('./data') > df2.dropna().count() > {code} > $ spark-submit write.py > $ spark-submit read.py > # error message > {code} > 15/12/08 17:20:34 ERROR Filter: Failed to generate predicate, fallback to > interpreted org.apache.spark.sql.catalyst.errors.package$TreeNodeException: > Binding attribute, tree: a#0L > ... > {code} > If write data without partitionBy, the error won't happen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation
Zachary Brown created SPARK-12468: - Summary: getParamMap in Pyspark ML API returns empty dictionary in example for Documentation Key: SPARK-12468 URL: https://issues.apache.org/jira/browse/SPARK-12468 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.5.2 Reporter: Zachary Brown Priority: Minor The `extractParamMap()` method for a model that has been fit returns an empty dictionary, e.g. (from the [Pyspark ML API Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)): ```python from pyspark.mllib.linalg import Vectors from pyspark.ml.classification import LogisticRegression from pyspark.ml.param import Param, Params # Prepare training data from a list of (label, features) tuples. training = sqlContext.createDataFrame([ (1.0, Vectors.dense([0.0, 1.1, 0.1])), (0.0, Vectors.dense([2.0, 1.0, -1.0])), (0.0, Vectors.dense([2.0, 1.3, 1.0])), (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) # Create a LogisticRegression instance. This instance is an Estimator. lr = LogisticRegression(maxIter=10, regParam=0.01) # Print out the parameters, documentation, and any default values. print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" # Learn a LogisticRegression model. This uses the parameters stored in lr. model1 = lr.fit(training) # Since model1 is a Model (i.e., a transformer produced by an Estimator), # we can view the parameters it used during fit(). # This prints the parameter (name: value) pairs, where names are unique IDs for this # LogisticRegression instance. print "Model 1 was fit using parameters: " print model1.extractParamMap() ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
[ https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12465: Assignee: Apache Spark > Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir > -- > > Key: SPARK-12465 > URL: https://issues.apache.org/jira/browse/SPARK-12465 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen >Assignee: Apache Spark > > Remove spark.deploy.mesos.zookeeper.dir and use existing configuration > spark.deploy.zookeeper.dir for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode
[ https://issues.apache.org/jira/browse/SPARK-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12463: Assignee: (was: Apache Spark) > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > > > Key: SPARK-12463 > URL: https://issues.apache.org/jira/browse/SPARK-12463 > Project: Spark > Issue Type: Task >Reporter: Timothy Chen > > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > configuration for cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode
[ https://issues.apache.org/jira/browse/SPARK-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12463: Assignee: Apache Spark > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > > > Key: SPARK-12463 > URL: https://issues.apache.org/jira/browse/SPARK-12463 > Project: Spark > Issue Type: Task >Reporter: Timothy Chen >Assignee: Apache Spark > > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > configuration for cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12455) Add ExpressionDescription to window functions
[ https://issues.apache.org/jira/browse/SPARK-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12455: Assignee: Herman van Hovell (was: Apache Spark) > Add ExpressionDescription to window functions > - > > Key: SPARK-12455 > URL: https://issues.apache.org/jira/browse/SPARK-12455 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Herman van Hovell > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
Timothy Chen created SPARK-12465: Summary: Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir Key: SPARK-12465 URL: https://issues.apache.org/jira/browse/SPARK-12465 Project: Spark Issue Type: Task Components: Mesos Reporter: Timothy Chen Remove spark.deploy.mesos.zookeeper.dir and use existing configuration spark.deploy.zookeeper.dir for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12464) Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url
Timothy Chen created SPARK-12464: Summary: Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url Key: SPARK-12464 URL: https://issues.apache.org/jira/browse/SPARK-12464 Project: Spark Issue Type: Task Components: Mesos Reporter: Timothy Chen Remove spark.deploy.mesos.zookeeper.url and use existing configuration spark.deploy.zookeeper.url for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12331) R^2 for regression through the origin
[ https://issues.apache.org/jira/browse/SPARK-12331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12331: Assignee: Apache Spark > R^2 for regression through the origin > - > > Key: SPARK-12331 > URL: https://issues.apache.org/jira/browse/SPARK-12331 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Imran Younus >Assignee: Apache Spark >Priority: Minor > > The value of R^2 (coefficient of determination) obtained from > LinearRegressionModel is not consistent with R and statsmodels when the > fitIntercept is false i.e., regression through the origin. In this case, both > R and statsmodels use the definition of R^2 given by eq(4') in the following > review paper: > https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf > Here is the definition from this paper: > R^2 = \sum(\hat( y)_i^2)/\sum(y_i^2) > The paper also describes why this should be the case. I've double checked > that the value of R^2 from statsmodels and R are consistent with this > definition. On the other hand, scikit-learn doesn't use the above definition. > I would recommend using the above definition in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12331) R^2 for regression through the origin
[ https://issues.apache.org/jira/browse/SPARK-12331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12331: Assignee: (was: Apache Spark) > R^2 for regression through the origin > - > > Key: SPARK-12331 > URL: https://issues.apache.org/jira/browse/SPARK-12331 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Imran Younus >Priority: Minor > > The value of R^2 (coefficient of determination) obtained from > LinearRegressionModel is not consistent with R and statsmodels when the > fitIntercept is false i.e., regression through the origin. In this case, both > R and statsmodels use the definition of R^2 given by eq(4') in the following > review paper: > https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf > Here is the definition from this paper: > R^2 = \sum(\hat( y)_i^2)/\sum(y_i^2) > The paper also describes why this should be the case. I've double checked > that the value of R^2 from statsmodels and R are consistent with this > definition. On the other hand, scikit-learn doesn't use the above definition. > I would recommend using the above definition in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)
[ https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nong Li updated SPARK-12394: Attachment: BucketedTables.pdf Here is a design for how we can supported bucketed tables. > Support writing out pre-hash-partitioned data and exploit that in join > optimizations to avoid shuffle (i.e. bucketing in Hive) > -- > > Key: SPARK-12394 > URL: https://issues.apache.org/jira/browse/SPARK-12394 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: BucketedTables.pdf > > > In many cases users know ahead of time the columns that they will be joining > or aggregating on. Ideally they should be able to leverage this information > and pre-shuffle the data so that subsequent queries do not require a shuffle. > Hive supports this functionality by allowing the user to define buckets, > which are hash partitioning of the data based on some key. > - Allow the user to specify a set of columns when caching or writing out data > - Allow the user to specify some parallelism > - Shuffle the data when writing / caching such that its distributed by these > columns > - When planning/executing a query, use this distribution to avoid another > shuffle when reading, assuming the join or aggregation is compatible with the > columns specified > - Should work with existing save modes: append, overwrite, etc > - Should work at least with all Hadoops FS data sources > - Should work with any data source when caching -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12279) Requesting a HBase table with kerberos is not working
[ https://issues.apache.org/jira/browse/SPARK-12279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067020#comment-15067020 ] Y Bodnar commented on SPARK-12279: -- Hi [~pbeauvois] It's weird there's no messages related to HBase tokens. According to yarn.Client "Attempting to fetch HBase security token." message should appear. {code:title=Client.scala|borderStyle=solid} def obtainTokenForHBase(conf: Configuration, credentials: Credentials): Unit = { if (UserGroupInformation.isSecurityEnabled) { val mirror = universe.runtimeMirror(getClass.getClassLoader) try { val confCreate = mirror.classLoader. loadClass("org.apache.hadoop.hbase.HBaseConfiguration"). getMethod("create", classOf[Configuration]) val obtainToken = mirror.classLoader. loadClass("org.apache.hadoop.hbase.security.token.TokenUtil"). getMethod("obtainToken", classOf[Configuration]) logDebug("Attempting to fetch HBase security token.") {code} I would suggest to try 2 things: 1. Check UserGroupInformation.isSecurityEnabled from your code. If it's false then no attempt to obtain ticket is being made 2. Print HBaseConfiguration and check security related options (like hbase.scurity.authentication) to see if they're properly set and hbase-site.xml you provide is actually applied > Requesting a HBase table with kerberos is not working > - > > Key: SPARK-12279 > URL: https://issues.apache.org/jira/browse/SPARK-12279 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.2 > Environment: Spark 1.5.2 / HBase 1.1.2 / Hadoop 2.7.1 / Zookeeper > 3.4.5 / Authentication done through Kerberos >Reporter: Pierre Beauvois > > I can't read a HBase table with Spark 1.5.2. > I added the option "spark.driver.extraClassPath" in the spark-defaults.conf > which contains the HBASE_CONF_DIR as below: > spark.driver.extraClassPath = /opt/application/Hbase/current/conf/ > On the driver, I started spark-shell (I was running it in yarn-client mode) > {code} > [my_user@uabigspark01 ~]$ spark-shell -v --name HBaseTest --jars > /opt/application/Hbase/current/lib/hbase-common-1.1.2.jar,/opt/application/Hbase/current/lib/hbase-server-1.1.2.jar,/opt/application/Hbase/current/lib/hbase-client-1.1.2.jar,/opt/application/Hbase/current/lib/hbase-protocol-1.1.2.jar,/opt/application/Hbase/current/lib/protobuf-java-2.5.0.jar,/opt/application/Hbase/current/lib/htrace-core-3.1.0-incubating.jar,/opt/application/Hbase/current/lib/hbase-annotations-1.1.2.jar,/opt/application/Hbase/current/lib/guava-12.0.1.jar > {code} > Then I ran the following lines: > {code} > scala> import org.apache.spark._ > import org.apache.spark._ > scala> import org.apache.spark.rdd.NewHadoopRDD > import org.apache.spark.rdd.NewHadoopRDD > scala> import org.apache.hadoop.fs.Path > import org.apache.hadoop.fs.Path > scala> import org.apache.hadoop.hbase.util.Bytes > import org.apache.hadoop.hbase.util.Bytes > scala> import org.apache.hadoop.hbase.HColumnDescriptor > import org.apache.hadoop.hbase.HColumnDescriptor > scala> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} > import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} > scala> import org.apache.hadoop.hbase.client.{HBaseAdmin, Put, HTable, Result} > import org.apache.hadoop.hbase.client.{HBaseAdmin, Put, HTable, Result} > scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat > import org.apache.hadoop.hbase.mapreduce.TableInputFormat > scala> import org.apache.hadoop.hbase.io.ImmutableBytesWritable > import org.apache.hadoop.hbase.io.ImmutableBytesWritable > scala> val conf = HBaseConfiguration.create() > conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, > core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, > yarn-site.xml, hdfs-default.xml, hdfs-site.xml, hbase-default.xml, > hbase-site.xml > scala> conf.addResource(new > Path("/opt/application/Hbase/current/conf/hbase-site.xml")) > scala> conf.set("hbase.zookeeper.quorum", "FQDN1:2181,FQDN2:2181,FQDN3:2181") > scala> conf.set(TableInputFormat.INPUT_TABLE, "user:noheader") > scala> val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], > classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], > classOf[org.apache.hadoop.hbase.client.Result]) > 2015-12-09 15:17:58,890 INFO [main] storage.MemoryStore: > ensureFreeSpace(266248) called with curMem=0, maxMem=556038881 > 2015-12-09 15:17:58,892 INFO [main] storage.MemoryStore: Block broadcast_0 > stored as values in memory (estimated size 260.0 KB, free 530.0 MB) > 2015-12-09 15:17:59,196 INFO [main] storage.MemoryStore: > ensureFreeSpace(32808) called with curMem=266248, maxMem=556038881 > 2015-12-09 15:17:59,197 INFO [main]
[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
[ https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067021#comment-15067021 ] Benjamin Fradet commented on SPARK-12247: - Ok thanks, I'll rework the examples accordingly. > Documentation for spark.ml's ALS and collaborative filtering in general > --- > > Key: SPARK-12247 > URL: https://issues.apache.org/jira/browse/SPARK-12247 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > > We need to add a section in the documentation about collaborative filtering > in the dataframe API: > - copy explanations about collaborative filtering and ALS from spark.mllib > - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation
[ https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067080#comment-15067080 ] Zachary Brown commented on SPARK-12468: --- Found a possible fix for this by modifying the `_fit()` method of the JavaEstimator class in `python/pyspark/ml/wrapper.py` to update the paramMap of the returned model. Created a pull request for it here: https://github.com/apache/spark/pull/10419 > getParamMap in Pyspark ML API returns empty dictionary in example for > Documentation > --- > > Key: SPARK-12468 > URL: https://issues.apache.org/jira/browse/SPARK-12468 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Zachary Brown >Priority: Minor > > The `extractParamMap()` method for a model that has been fit returns an empty > dictionary, e.g. (from the [Pyspark ML API > Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)): > ```python > from pyspark.mllib.linalg import Vectors > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.param import Param, Params > # Prepare training data from a list of (label, features) tuples. > training = sqlContext.createDataFrame([ > (1.0, Vectors.dense([0.0, 1.1, 0.1])), > (0.0, Vectors.dense([2.0, 1.0, -1.0])), > (0.0, Vectors.dense([2.0, 1.3, 1.0])), > (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) > # Create a LogisticRegression instance. This instance is an Estimator. > lr = LogisticRegression(maxIter=10, regParam=0.01) > # Print out the parameters, documentation, and any default values. > print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" > # Learn a LogisticRegression model. This uses the parameters stored in lr. > model1 = lr.fit(training) > # Since model1 is a Model (i.e., a transformer produced by an Estimator), > # we can view the parameters it used during fit(). > # This prints the parameter (name: value) pairs, where names are unique IDs > for this > # LogisticRegression instance. > print "Model 1 was fit using parameters: " > print model1.extractParamMap() > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12452) Add exception details to TaskCompletionListener/TaskContext
[ https://issues.apache.org/jira/browse/SPARK-12452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neelesh Shastry updated SPARK-12452: Component/s: (was: Streaming) Spark Core > Add exception details to TaskCompletionListener/TaskContext > --- > > Key: SPARK-12452 > URL: https://issues.apache.org/jira/browse/SPARK-12452 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Neelesh Shastry >Priority: Minor > > TaskCompletionListeners are called without success/failure details. > If we change this > {code} > trait TaskCompletionListener extends EventListener { > def onTaskCompletion(context: TaskContext) > } > class TaskContextImpl { > > private[spark] def markTaskCompleted(throwable:Option[Throwable]): Unit > > listener.onTaskCompletion(this,throwable) > } > {code} > to something like > {code} > trait TaskCompletionListener extends EventListener { > def onTaskCompletion(context: TaskContext, throwable:Option[Throwable]=None) > } > {code} > .. and in Task.scala > {code} >val results = Try(runTask(context)) >var throwable:Option[Throwable] = None > try { > runTask(context) > > }catch{ > case t:Throwable => throwable=t > } > finally { > context.markTaskCompleted(throwable) > TaskContext.unset() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode
Timothy Chen created SPARK-12463: Summary: Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode Key: SPARK-12463 URL: https://issues.apache.org/jira/browse/SPARK-12463 Project: Spark Issue Type: Task Reporter: Timothy Chen Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode configuration for cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
[ https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12453: Assignee: Apache Spark > Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version > > > Key: SPARK-12453 > URL: https://issues.apache.org/jira/browse/SPARK-12453 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Martin Schade >Assignee: Apache Spark >Priority: Critical > Labels: easyfix > > The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS > Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0). > AWS KCL 1.3.0 references AWS Java SDK version 1.9.37. > Using 1.9.16 in combination with 1.3.0 does fail to get data out of the > stream. > I tested Spark Streaming with 1.9.37 and it works fine. > Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also > fails, so it is due to the specific versions used in 1.5.2 and not a Spark > related implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
[ https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066852#comment-15066852 ] Apache Spark commented on SPARK-12453: -- User 'Schadix' has created a pull request for this issue: https://github.com/apache/spark/pull/10416 > Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version > > > Key: SPARK-12453 > URL: https://issues.apache.org/jira/browse/SPARK-12453 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Martin Schade >Priority: Critical > Labels: easyfix > > The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS > Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0). > AWS KCL 1.3.0 references AWS Java SDK version 1.9.37. > Using 1.9.16 in combination with 1.3.0 does fail to get data out of the > stream. > I tested Spark Streaming with 1.9.37 and it works fine. > Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also > fails, so it is due to the specific versions used in 1.5.2 and not a Spark > related implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
[ https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066854#comment-15066854 ] Timothy Hunter commented on SPARK-12247: If we could import all the code that builds the ratings dataframe {{val ratings = sc.textFile(params.ratings).map(Rating.parseRating).cache()}}, that would be ideal. > Documentation for spark.ml's ALS and collaborative filtering in general > --- > > Key: SPARK-12247 > URL: https://issues.apache.org/jira/browse/SPARK-12247 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > > We need to add a section in the documentation about collaborative filtering > in the dataframe API: > - copy explanations about collaborative filtering and ALS from spark.mllib > - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
[ https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12453: Assignee: (was: Apache Spark) > Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version > > > Key: SPARK-12453 > URL: https://issues.apache.org/jira/browse/SPARK-12453 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Martin Schade >Priority: Critical > Labels: easyfix > > The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS > Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0). > AWS KCL 1.3.0 references AWS Java SDK version 1.9.37. > Using 1.9.16 in combination with 1.3.0 does fail to get data out of the > stream. > I tested Spark Streaming with 1.9.37 and it works fine. > Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also > fails, so it is due to the specific versions used in 1.5.2 and not a Spark > related implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12466) Harmless Master NPE in tests
[ https://issues.apache.org/jira/browse/SPARK-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12466: Assignee: Apache Spark (was: Andrew Or) > Harmless Master NPE in tests > > > Key: SPARK-12466 > URL: https://issues.apache.org/jira/browse/SPARK-12466 > Project: Spark > Issue Type: Bug > Components: Deploy, Tests >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Apache Spark > Fix For: 1.6.1, 2.0.0 > > > {code} > [info] ReplayListenerSuite: > [info] - Simple replay (58 milliseconds) > java.lang.NullPointerException > at > org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982) > at > org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980) > at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) > at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > at scala.concurrent.Promise$class.complete(Promise.scala:55) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > [info] - End-to-end replay (10 seconds, 755 milliseconds) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull > caused by https://github.com/apache/spark/pull/10284 > Thanks to [~ted_yu] for reporting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12396) Once driver client registered successfully,it still retry to connected to master.
[ https://issues.apache.org/jira/browse/SPARK-12396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12396: Assignee: Apache Spark > Once driver client registered successfully,it still retry to connected to > master. > - > > Key: SPARK-12396 > URL: https://issues.apache.org/jira/browse/SPARK-12396 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1, 1.5.2 >Reporter: echo >Assignee: Apache Spark >Priority: Minor > Original Estimate: 12h > Remaining Estimate: 12h > > As description in AppClient.scala,Once driver connect to a master > successfully, all scheduling work and Futures will be cancelled. But at > currently,it still try to connect to master. And it should not happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12396) Once driver client registered successfully,it still retry to connected to master.
[ https://issues.apache.org/jira/browse/SPARK-12396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12396: Assignee: (was: Apache Spark) > Once driver client registered successfully,it still retry to connected to > master. > - > > Key: SPARK-12396 > URL: https://issues.apache.org/jira/browse/SPARK-12396 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1, 1.5.2 >Reporter: echo >Priority: Minor > Original Estimate: 12h > Remaining Estimate: 12h > > As description in AppClient.scala,Once driver connect to a master > successfully, all scheduling work and Futures will be cancelled. But at > currently,it still try to connect to master. And it should not happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
[ https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12465: Assignee: Apache Spark > Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir > -- > > Key: SPARK-12465 > URL: https://issues.apache.org/jira/browse/SPARK-12465 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen >Assignee: Apache Spark > > Remove spark.deploy.mesos.zookeeper.dir and use existing configuration > spark.deploy.zookeeper.dir for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode
[ https://issues.apache.org/jira/browse/SPARK-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067065#comment-15067065 ] Apache Spark commented on SPARK-12463: -- User 'tnachen' has created a pull request for this issue: https://github.com/apache/spark/pull/10057 > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > > > Key: SPARK-12463 > URL: https://issues.apache.org/jira/browse/SPARK-12463 > Project: Spark > Issue Type: Task >Reporter: Timothy Chen > > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > configuration for cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
[ https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12465: Assignee: (was: Apache Spark) > Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir > -- > > Key: SPARK-12465 > URL: https://issues.apache.org/jira/browse/SPARK-12465 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen > > Remove spark.deploy.mesos.zookeeper.dir and use existing configuration > spark.deploy.zookeeper.dir for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12456) Add ExpressionDescription to misc functions
Yin Huai created SPARK-12456: Summary: Add ExpressionDescription to misc functions Key: SPARK-12456 URL: https://issues.apache.org/jira/browse/SPARK-12456 Project: Spark Issue Type: Sub-task Reporter: Yin Huai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12454) Add ExpressionDescription to expressions are registered in FunctionRegistry
[ https://issues.apache.org/jira/browse/SPARK-12454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12454: - Description: ExpressionDescription is a annotation that contains doc of a function and when users use {{describe function}}, users can see the doc defined in this annotation. You can take a look at {{Upper}} as an example. However, we still have lots of expression that do not have ExpressionDescription. It will be great to take a look at expressions registered in FunctionRegistry and add ExpressionDescription to those expression that do not have it.. A list of expressions (and their categories) registered in function registry can be found at https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L117-L296. was: ExpressionDescription is a annotation that contains doc of a function and when users use {{describe function}}, users can see the doc defined in this annotation. You can take a look at {{Upper}} as an example. However, we still have lots of expression that do not have ExpressionDescription. It will be great to take a look at expressions registered in FunctionRegistry and add ExpressionDescription to those expression that do not have it.. > Add ExpressionDescription to expressions are registered in FunctionRegistry > --- > > Key: SPARK-12454 > URL: https://issues.apache.org/jira/browse/SPARK-12454 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai > > ExpressionDescription is a annotation that contains doc of a function and > when users use {{describe function}}, users can see the doc defined in this > annotation. You can take a look at {{Upper}} as an example. > However, we still have lots of expression that do not have > ExpressionDescription. It will be great to take a look at expressions > registered in FunctionRegistry and add ExpressionDescription to those > expression that do not have it.. > A list of expressions (and their categories) registered in function registry > can be found at > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L117-L296. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12331) R^2 for regression through the origin
[ https://issues.apache.org/jira/browse/SPARK-12331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066898#comment-15066898 ] Apache Spark commented on SPARK-12331: -- User 'iyounus' has created a pull request for this issue: https://github.com/apache/spark/pull/10384 > R^2 for regression through the origin > - > > Key: SPARK-12331 > URL: https://issues.apache.org/jira/browse/SPARK-12331 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Imran Younus >Priority: Minor > > The value of R^2 (coefficient of determination) obtained from > LinearRegressionModel is not consistent with R and statsmodels when the > fitIntercept is false i.e., regression through the origin. In this case, both > R and statsmodels use the definition of R^2 given by eq(4') in the following > review paper: > https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf > Here is the definition from this paper: > R^2 = \sum(\hat( y)_i^2)/\sum(y_i^2) > The paper also describes why this should be the case. I've double checked > that the value of R^2 from statsmodels and R are consistent with this > definition. On the other hand, scikit-learn doesn't use the above definition. > I would recommend using the above definition in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12466) Harmless Master NPE in tests
[ https://issues.apache.org/jira/browse/SPARK-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12466: Assignee: Andrew Or (was: Apache Spark) > Harmless Master NPE in tests > > > Key: SPARK-12466 > URL: https://issues.apache.org/jira/browse/SPARK-12466 > Project: Spark > Issue Type: Bug > Components: Deploy, Tests >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 1.6.1, 2.0.0 > > > {code} > [info] ReplayListenerSuite: > [info] - Simple replay (58 milliseconds) > java.lang.NullPointerException > at > org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982) > at > org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980) > at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) > at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > at scala.concurrent.Promise$class.complete(Promise.scala:55) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > [info] - End-to-end replay (10 seconds, 755 milliseconds) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull > caused by https://github.com/apache/spark/pull/10284 > Thanks to [~ted_yu] for reporting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12321) JSON format for logical/physical execution plans
[ https://issues.apache.org/jira/browse/SPARK-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-12321: - Assignee: Wenchen Fan > JSON format for logical/physical execution plans > > > Key: SPARK-12321 > URL: https://issues.apache.org/jira/browse/SPARK-12321 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
[ https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12465: Assignee: (was: Apache Spark) > Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir > -- > > Key: SPARK-12465 > URL: https://issues.apache.org/jira/browse/SPARK-12465 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen > > Remove spark.deploy.mesos.zookeeper.dir and use existing configuration > spark.deploy.zookeeper.dir for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12469) Consistent Accumulators for Spark
holdenk created SPARK-12469: --- Summary: Consistent Accumulators for Spark Key: SPARK-12469 URL: https://issues.apache.org/jira/browse/SPARK-12469 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: holdenk Tasks executed on Spark workers are unable to modify values from the driver, and accumulators are the one exception for this. Accumulators in Spark are implemented in such a way that when a stage is recomputed (say for cache eviction) the accumulator will be updated a second time. This makes accumulators inside of transformations more difficult to use for things like counting invalid records (one of the primary potential use cases of collecting side information during a transformation). However in some cases this counting during re-evaluation is exactly the behaviour we want (say in tracking total execution time for a particular function). Spark would benefit from a version of accumulators which did not double count even if stages were re-executed. Motivating example: {code} val parseTime = sc.accumulator(0L) val parseFailures = sc.accumulator(0L) val parsedData = sc.textFile(...).flatMap { line => val start = System.currentTimeMillis() val parsed = Try(parse(line)) if (parsed.isFailure) parseFailures += 1 parseTime += System.currentTimeMillis() - start parsed.toOption } parsedData.cache() val resultA = parsedData.map(...).filter(...).count() // some intervening code. Almost anything could happen here -- some of parsedData may // get kicked out of the cache, or an executor where data was cached might get lost val resultB = parsedData.filter(...).map(...).flatMap(...).count() // now we look at the accumulators {code} Here we would want parseFailures to only have been added to once for every line which failed to parse. Unfortunately, the current Spark accumulator API doesn’t support the current parseFailures use case since if some data had been evicted its possible that it will be double counted. See the full design document at https://docs.google.com/document/d/1lR_l1g3zMVctZXrcVjFusq2iQVpr4XvRK_UUDsDr6nk/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12451) Regexp functions don't support patterns containing '*/'
[ https://issues.apache.org/jira/browse/SPARK-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066840#comment-15066840 ] Xiao Li commented on SPARK-12451: - This is a duplicate of https://issues.apache.org/jira/browse/SPARK-11352 The problem has been resolved. You can get the fix in 1.5.3 and 1.6 Thanks! > Regexp functions don't support patterns containing '*/' > --- > > Key: SPARK-12451 > URL: https://issues.apache.org/jira/browse/SPARK-12451 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: William Dee > > When using the regexp functions in Spark SQL, patterns containing '*/' create > runtime errors in the auto generated code. This is due to the fact that the > code generator creates a multiline comment containing, amongst other things, > the pattern. > Here is an excerpt from my stacktrace to illustrate: (Helpfully, the stack > trace includes all of the auto-generated code) > {code} > Caused by: org.codehaus.commons.compiler.CompileException: Line 232, Column > 54: Unexpected token "," in primary > at org.codehaus.janino.Parser.compileException(Parser.java:3125) > at org.codehaus.janino.Parser.parsePrimary(Parser.java:2512) > at org.codehaus.janino.Parser.parseUnaryExpression(Parser.java:2252) > at > org.codehaus.janino.Parser.parseMultiplicativeExpression(Parser.java:2211) > at org.codehaus.janino.Parser.parseAdditiveExpression(Parser.java:2190) > at org.codehaus.janino.Parser.parseShiftExpression(Parser.java:2169) > at > org.codehaus.janino.Parser.parseRelationalExpression(Parser.java:2072) > at org.codehaus.janino.Parser.parseEqualityExpression(Parser.java:2046) > at org.codehaus.janino.Parser.parseAndExpression(Parser.java:2025) > at > org.codehaus.janino.Parser.parseExclusiveOrExpression(Parser.java:2004) > at > org.codehaus.janino.Parser.parseInclusiveOrExpression(Parser.java:1983) > at > org.codehaus.janino.Parser.parseConditionalAndExpression(Parser.java:1962) > at > org.codehaus.janino.Parser.parseConditionalOrExpression(Parser.java:1941) > at > org.codehaus.janino.Parser.parseConditionalExpression(Parser.java:1922) > at > org.codehaus.janino.Parser.parseAssignmentExpression(Parser.java:1901) > at org.codehaus.janino.Parser.parseExpression(Parser.java:1886) > at org.codehaus.janino.Parser.parseBlockStatement(Parser.java:1149) > at org.codehaus.janino.Parser.parseBlockStatements(Parser.java:1085) > at > org.codehaus.janino.Parser.parseMethodDeclarationRest(Parser.java:938) > at org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:620) > at org.codehaus.janino.Parser.parseClassBody(Parser.java:515) > at org.codehaus.janino.Parser.parseClassDeclarationRest(Parser.java:481) > at org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:577) > at > org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:192) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:84) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:77) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:387) > ... line 232 ... > /* regexp_replace(input[46, StringType],^.*/,) */ > > /* input[46, StringType] */ > > boolean isNull31 = i.isNullAt(46); > UTF8String primitive32 = isNull31 ? null : (i.getUTF8String(46)); > > boolean isNull24 = true; > UTF8String primitive25 = null; > if (!isNull31) { > /* ^.*/ */ > > /* expression: ^.*/ */ > Object obj35 = expressions[4].eval(i); > boolean isNull33 = obj35 == null; > UTF8String primitive34 = null; > if (!isNull33) { > primitive34 = (UTF8String) obj35; > } > ... > {code} > Note the multiple multiline comments, these obviously break when the regex > pattern contains the end-of-comment token '*/' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12455) Add ExpressionDescription to window functions
[ https://issues.apache.org/jira/browse/SPARK-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12455: Assignee: Herman van Hovell (was: Apache Spark) > Add ExpressionDescription to window functions > - > > Key: SPARK-12455 > URL: https://issues.apache.org/jira/browse/SPARK-12455 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Herman van Hovell > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12466) Harmless Master NPE in tests
Andrew Or created SPARK-12466: - Summary: Harmless Master NPE in tests Key: SPARK-12466 URL: https://issues.apache.org/jira/browse/SPARK-12466 Project: Spark Issue Type: Bug Components: Deploy, Tests Affects Versions: 1.6.0 Reporter: Andrew Or Assignee: Andrew Or Fix For: 1.6.1, 2.0.0 {code} [info] ReplayListenerSuite: [info] - Simple replay (58 milliseconds) java.lang.NullPointerException at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982) at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) at scala.concurrent.Promise$class.complete(Promise.scala:55) at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) [info] - End-to-end replay (10 seconds, 755 milliseconds) {code} https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull caused by https://github.com/apache/spark/pull/10284 Thanks to [~ted_yu] for reporting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12457) Add ExpressionDescription to collection functions
[ https://issues.apache.org/jira/browse/SPARK-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067022#comment-15067022 ] Apache Spark commented on SPARK-12457: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/10418 > Add ExpressionDescription to collection functions > - > > Key: SPARK-12457 > URL: https://issues.apache.org/jira/browse/SPARK-12457 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
[ https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12247: Assignee: (was: Apache Spark) > Documentation for spark.ml's ALS and collaborative filtering in general > --- > > Key: SPARK-12247 > URL: https://issues.apache.org/jira/browse/SPARK-12247 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > > We need to add a section in the documentation about collaborative filtering > in the dataframe API: > - copy explanations about collaborative filtering and ALS from spark.mllib > - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
[ https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12247: Assignee: Apache Spark > Documentation for spark.ml's ALS and collaborative filtering in general > --- > > Key: SPARK-12247 > URL: https://issues.apache.org/jira/browse/SPARK-12247 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter >Assignee: Apache Spark > > We need to add a section in the documentation about collaborative filtering > in the dataframe API: > - copy explanations about collaborative filtering and ALS from spark.mllib > - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation
[ https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12468: Assignee: Apache Spark > getParamMap in Pyspark ML API returns empty dictionary in example for > Documentation > --- > > Key: SPARK-12468 > URL: https://issues.apache.org/jira/browse/SPARK-12468 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Zachary Brown >Assignee: Apache Spark >Priority: Minor > > The `extractParamMap()` method for a model that has been fit returns an empty > dictionary, e.g. (from the [Pyspark ML API > Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)): > ```python > from pyspark.mllib.linalg import Vectors > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.param import Param, Params > # Prepare training data from a list of (label, features) tuples. > training = sqlContext.createDataFrame([ > (1.0, Vectors.dense([0.0, 1.1, 0.1])), > (0.0, Vectors.dense([2.0, 1.0, -1.0])), > (0.0, Vectors.dense([2.0, 1.3, 1.0])), > (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) > # Create a LogisticRegression instance. This instance is an Estimator. > lr = LogisticRegression(maxIter=10, regParam=0.01) > # Print out the parameters, documentation, and any default values. > print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" > # Learn a LogisticRegression model. This uses the parameters stored in lr. > model1 = lr.fit(training) > # Since model1 is a Model (i.e., a transformer produced by an Estimator), > # we can view the parameters it used during fit(). > # This prints the parameter (name: value) pairs, where names are unique IDs > for this > # LogisticRegression instance. > print "Model 1 was fit using parameters: " > print model1.extractParamMap() > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5882) Add a test for GraphLoader.edgeListFile
[ https://issues.apache.org/jira/browse/SPARK-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-5882. -- Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Add a test for GraphLoader.edgeListFile > --- > > Key: SPARK-5882 > URL: https://issues.apache.org/jira/browse/SPARK-5882 > Project: Spark > Issue Type: Test > Components: GraphX >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Trivial > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12470) Incorrect calculation of row size in o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner
[ https://issues.apache.org/jira/browse/SPARK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pete Robbins updated SPARK-12470: - Component/s: SQL Summary: Incorrect calculation of row size in o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner (was: Incorrect calculation of row size in o.a.s.catalyst.expressions.codegen.GenerateUnsafeRowJoiner) > Incorrect calculation of row size in > o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner > --- > > Key: SPARK-12470 > URL: https://issues.apache.org/jira/browse/SPARK-12470 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Pete Robbins >Priority: Minor > > While looking into https://issues.apache.org/jira/browse/SPARK-12319 I > noticed that the row size is incorrectly calculated. > The "sizeReduction" value is calculated in words: >// The number of words we can reduce when we concat two rows together. > // The only reduction comes from merging the bitset portion of the two > rows, saving 1 word. > val sizeReduction = bitset1Words + bitset2Words - outputBitsetWords > but then it is subtracted from the size of the row in bytes: >|out.pointTo(buf, ${schema1.size + schema2.size}, sizeInBytes - > $sizeReduction); > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11823) HiveThriftBinaryServerSuite tests timing out, leaves hanging processes
[ https://issues.apache.org/jira/browse/SPARK-11823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067314#comment-15067314 ] Josh Rosen commented on SPARK-11823: It looks like this has caused a huge number of timeouts in the Master Maven Hadoop 2.4 builds this week: https://spark-tests.appspot.com/jobs/Spark-Master-Maven-with-YARN%20%C2%BB%20hadoop-2.4%2Cspark-test I'm going to pull some logs and take a look. > HiveThriftBinaryServerSuite tests timing out, leaves hanging processes > -- > > Key: SPARK-11823 > URL: https://issues.apache.org/jira/browse/SPARK-11823 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: shane knapp > Attachments: > spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-05.out, > stack.log > > > i've noticed on a few branches that the HiveThriftBinaryServerSuite tests > time out, and when that happens, the build is aborted but the tests leave > behind hanging processes that eat up cpu and ram. > most recently, i discovered this happening w/the 1.6 SBT build, specifically > w/the hadoop 2.0 profile: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.6-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/56/console > [~vanzin] grabbed the jstack log, which i've attached to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12414) Remove closure serializer
[ https://issues.apache.org/jira/browse/SPARK-12414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12414: -- Issue Type: Sub-task (was: Bug) Parent: SPARK-11806 > Remove closure serializer > - > > Key: SPARK-12414 > URL: https://issues.apache.org/jira/browse/SPARK-12414 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or > > There is a config `spark.closure.serializer` that accepts exactly one value: > the java serializer. This is because there are currently bugs in the Kryo > serializer that make it not a viable candidate. This was uncovered by an > unsuccessful attempt to make it work: SPARK-7708. > My high level point is that the Java serializer has worked well for at least > 6 Spark versions now, and it is an incredibly complicated task to get other > serializers (not just Kryo) to work with Spark's closures. IMO the effort is > not worth it and we should just remove this documentation and all the code > associated with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12374) Improve performance of Range APIs via adding logical/physical operators
[ https://issues.apache.org/jira/browse/SPARK-12374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12374. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10335 [https://github.com/apache/spark/pull/10335] > Improve performance of Range APIs via adding logical/physical operators > --- > > Key: SPARK-12374 > URL: https://issues.apache.org/jira/browse/SPARK-12374 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Critical > Fix For: 2.0.0 > > > Creating an actual logical/physical operator for range for matching the > performance of RDD Range APIs. > Compared with the old Range API, the new version is 3 times faster than the > old version. > {code} > scala> val startTime = System.currentTimeMillis; sqlContext.oldRange(0, > 10, 1, 15).count(); val endTime = System.currentTimeMillis; val start > = new Timestamp(startTime); val end = new Timestamp(endTime); val elapsed = > (endTime - startTime)/ 1000.0 > startTime: Long = 1450416394240 > > endTime: Long = 1450416421199 > start: java.sql.Timestamp = 2015-12-17 21:26:34.24 > end: java.sql.Timestamp = 2015-12-17 21:27:01.199 > elapsed: Double = 26.959 > {code} > {code} > scala> val startTime = System.currentTimeMillis; sqlContext.range(0, > 10, 1, 15).count(); val endTime = System.currentTimeMillis; val start > = new Timestamp(startTime); val end = new Timestamp(endTime); val elapsed = > (endTime - startTime)/ 1000.0 > startTime: Long = 1450416360107 > > endTime: Long = 1450416368590 > start: java.sql.Timestamp = 2015-12-17 21:26:00.107 > end: java.sql.Timestamp = 2015-12-17 21:26:08.59 > elapsed: Double = 8.483 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12150) numPartitions argument to sqlContext.range() should be optional
[ https://issues.apache.org/jira/browse/SPARK-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12150. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10335 [https://github.com/apache/spark/pull/10335] > numPartitions argument to sqlContext.range() should be optional > > > Key: SPARK-12150 > URL: https://issues.apache.org/jira/browse/SPARK-12150 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Henri DF >Priority: Minor > Fix For: 2.0.0 > > > It's a little inconsistent that the first two sqlContext.range() methods > don't take a numPartitions arg, while the third one does. > And more importantly, it's a little inconvenient that the numPartitions arg > is mandatory for the third range() method - it means that if you want to > specify a step, you suddenly have to think about partitioning - an orthogonal > concern. > My suggestion would be to make numPartitions optional, like it is on the > sparkContext.range(..). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12363) PowerIterationClustering test case failed if we deprecated KMeans.setRuns
[ https://issues.apache.org/jira/browse/SPARK-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12363: Assignee: Apache Spark > PowerIterationClustering test case failed if we deprecated KMeans.setRuns > - > > Key: SPARK-12363 > URL: https://issues.apache.org/jira/browse/SPARK-12363 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > We plan to deprecated `runs` of KMeans, PowerIterationClustering will > leverage KMeans to train model. > I removed `setRuns` used in PowerIterationClustering, but one of the test > cases failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11823) HiveThriftBinaryServerSuite tests timing out, leaves hanging processes
[ https://issues.apache.org/jira/browse/SPARK-11823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067345#comment-15067345 ] Josh Rosen commented on SPARK-11823: I think I spotted the problem; it may be a bad use of Thread.sleep() in a test: https://github.com/apache/spark/pull/6207/files#r30935200 > HiveThriftBinaryServerSuite tests timing out, leaves hanging processes > -- > > Key: SPARK-11823 > URL: https://issues.apache.org/jira/browse/SPARK-11823 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: shane knapp >Assignee: Josh Rosen > Attachments: > spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-05.out, > stack.log > > > i've noticed on a few branches that the HiveThriftBinaryServerSuite tests > time out, and when that happens, the build is aborted but the tests leave > behind hanging processes that eat up cpu and ram. > most recently, i discovered this happening w/the 1.6 SBT build, specifically > w/the hadoop 2.0 profile: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.6-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/56/console > [~vanzin] grabbed the jstack log, which i've attached to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12473) Reuse serializer instances for performance
[ https://issues.apache.org/jira/browse/SPARK-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12473: -- Description: After commit de02782 of page rank regressed from 242s to 260s, about 7%. Although currently it's only 7%, we will likely register more classes in the future so this will only increase. The commit added 26 types to register every time we create a Kryo serializer instance. I ran a small microbenchmark to prove that this is noticeably expensive: {code} import org.apache.spark.serializer._ import org.apache.spark.SparkConf def makeMany(num: Int): Long = { val start = System.currentTimeMillis (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() } System.currentTimeMillis - start } // before commit de02782, averaged over multiple runs makeMany(5000) == 1500 // after commit de02782, averaged over multiple runs makeMany(5000) == 2750 {code} Since we create multiple serializer instances per partition, this means a 5000-partition stage will unconditionally see an increase of > 1s for the stage. In page rank, we may run many such stages. We should explore the alternative of reusing thread-local serializer instances, which would lead to much fewer calls to `kryo.register`. was: After commit de02782 of page rank regressed from 242s to 260s, about 7%. Although currently it's only 7%, we will likely register more classes in the future so we should do this the right way. The commit added 26 types to register every time we create a Kryo serializer instance. I ran a small microbenchmark to prove that this is noticeably expensive: {code} import org.apache.spark.serializer._ import org.apache.spark.SparkConf def makeMany(num: Int): Long = { val start = System.currentTimeMillis (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() } System.currentTimeMillis - start } // before commit de02782, averaged over multiple runs makeMany(5000) == 1500 // after commit de02782, averaged over multiple runs makeMany(5000) == 2750 {code} Since we create multiple serializer instances per partition, this means a 5000-partition stage will unconditionally see an increase of > 1s for the stage. In page rank, we may run many such stages. We should explore the alternative of reusing thread-local serializer instances, which would lead to much fewer calls to `kryo.register`. > Reuse serializer instances for performance > -- > > Key: SPARK-12473 > URL: https://issues.apache.org/jira/browse/SPARK-12473 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Andrew Or > > After commit de02782 of page rank regressed from 242s to 260s, about 7%. > Although currently it's only 7%, we will likely register more classes in the > future so this will only increase. > The commit added 26 types to register every time we create a Kryo serializer > instance. I ran a small microbenchmark to prove that this is noticeably > expensive: > {code} > import org.apache.spark.serializer._ > import org.apache.spark.SparkConf > def makeMany(num: Int): Long = { > val start = System.currentTimeMillis > (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() } > System.currentTimeMillis - start > } > // before commit de02782, averaged over multiple runs > makeMany(5000) == 1500 > // after commit de02782, averaged over multiple runs > makeMany(5000) == 2750 > {code} > Since we create multiple serializer instances per partition, this means a > 5000-partition stage will unconditionally see an increase of > 1s for the > stage. In page rank, we may run many such stages. > We should explore the alternative of reusing thread-local serializer > instances, which would lead to much fewer calls to `kryo.register`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12440) Avoid setCheckpointDir warning when filesystem is not local
[ https://issues.apache.org/jira/browse/SPARK-12440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12440: -- Priority: Trivial (was: Major) > Avoid setCheckpointDir warning when filesystem is not local > --- > > Key: SPARK-12440 > URL: https://issues.apache.org/jira/browse/SPARK-12440 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2, 1.6.0, 1.6.1 >Reporter: Pierre Borckmans >Priority: Trivial > > In SparkContext method `setCheckpointDir`, a warning is issued when spark > master is not local and the passed directory for the checkpoint dir appears > to be local. > In practice, when relying on hdfs configuration file and using relative path > (incomplete URI without hdfs scheme, ...), this warning should not be issued > and might be confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12440) Avoid setCheckpointDir warning when filesystem is not local
[ https://issues.apache.org/jira/browse/SPARK-12440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12440: -- Summary: Avoid setCheckpointDir warning when filesystem is not local (was: [CORE] Avoid setCheckpointDir warning when filesystem is not local) > Avoid setCheckpointDir warning when filesystem is not local > --- > > Key: SPARK-12440 > URL: https://issues.apache.org/jira/browse/SPARK-12440 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2, 1.6.0, 1.6.1 >Reporter: Pierre Borckmans > > In SparkContext method `setCheckpointDir`, a warning is issued when spark > master is not local and the passed directory for the checkpoint dir appears > to be local. > In practice, when relying on hdfs configuration file and using relative path > (incomplete URI without hdfs scheme, ...), this warning should not be issued > and might be confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
[ https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067125#comment-15067125 ] Martin Schade commented on SPARK-12453: --- Make sense, thank you. Ideally it should be 1.9.37 instead of 1.9.40 though. Both KCL 1.4.0 and KPL 0.10.1 reference 1.9.37. https://github.com/awslabs/amazon-kinesis-producer/blob/v0.10.1/java/amazon-kinesis-producer/pom.xml https://github.com/awslabs/amazon-kinesis-client/blob/v1.4.0/pom.xml Both reference 1.9.37 In the latest version of KPL (v0.10.2) 1.10.34 is references and in latest KCL 1.6.1 it is version 1.10.20, not easy so sync then. So it would need to do some testing which combination works actually. > Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version > > > Key: SPARK-12453 > URL: https://issues.apache.org/jira/browse/SPARK-12453 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Martin Schade >Priority: Critical > Labels: easyfix > > The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS > Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0). > AWS KCL 1.3.0 references AWS Java SDK version 1.9.37. > Using 1.9.16 in combination with 1.3.0 does fail to get data out of the > stream. > I tested Spark Streaming with 1.9.37 and it works fine. > Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also > fails, so it is due to the specific versions used in 1.5.2 and not a Spark > related implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12440) [CORE] Avoid setCheckpointDir warning when filesystem is not local
[ https://issues.apache.org/jira/browse/SPARK-12440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12440: -- Component/s: Spark Core > [CORE] Avoid setCheckpointDir warning when filesystem is not local > -- > > Key: SPARK-12440 > URL: https://issues.apache.org/jira/browse/SPARK-12440 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2, 1.6.0, 1.6.1 >Reporter: Pierre Borckmans > > In SparkContext method `setCheckpointDir`, a warning is issued when spark > master is not local and the passed directory for the checkpoint dir appears > to be local. > In practice, when relying on hdfs configuration file and using relative path > (incomplete URI without hdfs scheme, ...), this warning should not be issued > and might be confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12339) NullPointerException on stage kill from web UI
[ https://issues.apache.org/jira/browse/SPARK-12339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067154#comment-15067154 ] Andrew Or commented on SPARK-12339: --- I've updated the affected version to 2.0 since SPARK-11206 was merged only there. Please let me know if this is not the case. > NullPointerException on stage kill from web UI > -- > > Key: SPARK-12339 > URL: https://issues.apache.org/jira/browse/SPARK-12339 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Assignee: Alex Bozarth > Fix For: 2.0.0 > > > The following message is in the logs after killing a stage: > {code} > scala> INFO Executor: Executor killed task 1.0 in stage 7.0 (TID 33) > INFO Executor: Executor killed task 0.0 in stage 7.0 (TID 32) > WARN TaskSetManager: Lost task 1.0 in stage 7.0 (TID 33, localhost): > TaskKilled (killed intentionally) > WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 32, localhost): > TaskKilled (killed intentionally) > INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, > from pool > ERROR LiveListenerBus: Listener SQLListener threw an exception > java.lang.NullPointerException > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) > ERROR LiveListenerBus: Listener SQLListener threw an exception > java.lang.NullPointerException > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) > {code} > To reproduce, start a job and kill the stage from web UI, e.g.: > {code} > val rdd = sc.parallelize(0 to 9, 2) > rdd.mapPartitionsWithIndex { case (n, it) => Thread.sleep(10 * 1000); it > }.count > {code} > Go to web UI and in Stages tab click "kill" for the stage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
[ https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067156#comment-15067156 ] Sean Owen commented on SPARK-12453: --- Ah, I misread the PR and it already just removes aws.java.sdk.version and manually managing the dependency. Just deleting the version and the dependencyManagement entry does the trick right? > Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version > > > Key: SPARK-12453 > URL: https://issues.apache.org/jira/browse/SPARK-12453 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Martin Schade >Priority: Critical > Labels: easyfix > > The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS > Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0). > AWS KCL 1.3.0 references AWS Java SDK version 1.9.37. > Using 1.9.16 in combination with 1.3.0 does fail to get data out of the > stream. > I tested Spark Streaming with 1.9.37 and it works fine. > Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also > fails, so it is due to the specific versions used in 1.5.2 and not a Spark > related implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12339) NullPointerException on stage kill from web UI
[ https://issues.apache.org/jira/browse/SPARK-12339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12339: -- Affects Version/s: (was: 1.6.0) 2.0.0 > NullPointerException on stage kill from web UI > -- > > Key: SPARK-12339 > URL: https://issues.apache.org/jira/browse/SPARK-12339 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Assignee: Alex Bozarth > Fix For: 2.0.0 > > > The following message is in the logs after killing a stage: > {code} > scala> INFO Executor: Executor killed task 1.0 in stage 7.0 (TID 33) > INFO Executor: Executor killed task 0.0 in stage 7.0 (TID 32) > WARN TaskSetManager: Lost task 1.0 in stage 7.0 (TID 33, localhost): > TaskKilled (killed intentionally) > WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 32, localhost): > TaskKilled (killed intentionally) > INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, > from pool > ERROR LiveListenerBus: Listener SQLListener threw an exception > java.lang.NullPointerException > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) > ERROR LiveListenerBus: Listener SQLListener threw an exception > java.lang.NullPointerException > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) > {code} > To reproduce, start a job and kill the stage from web UI, e.g.: > {code} > val rdd = sc.parallelize(0 to 9, 2) > rdd.mapPartitionsWithIndex { case (n, it) => Thread.sleep(10 * 1000); it > }.count > {code} > Go to web UI and in Stages tab click "kill" for the stage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12392) Optimize a location order of broadcast blocks by considering preferred local hosts
[ https://issues.apache.org/jira/browse/SPARK-12392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-12392. --- Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Optimize a location order of broadcast blocks by considering preferred local > hosts > -- > > Key: SPARK-12392 > URL: https://issues.apache.org/jira/browse/SPARK-12392 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro > Fix For: 2.0.0 > > > When multiple workers exist in a host, we can bypass unnecessary remote > access for broadcasts; block managers fetch broadcast blocks from the same > host instead of remote hosts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12456) Add ExpressionDescription to misc functions
[ https://issues.apache.org/jira/browse/SPARK-12456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12456: Assignee: Apache Spark > Add ExpressionDescription to misc functions > --- > > Key: SPARK-12456 > URL: https://issues.apache.org/jira/browse/SPARK-12456 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12456) Add ExpressionDescription to misc functions
[ https://issues.apache.org/jira/browse/SPARK-12456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12456: Assignee: (was: Apache Spark) > Add ExpressionDescription to misc functions > --- > > Key: SPARK-12456 > URL: https://issues.apache.org/jira/browse/SPARK-12456 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12473) Reuse serializer instances for performance
[ https://issues.apache.org/jira/browse/SPARK-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12473: -- Description: After commit de02782 of page rank regressed from 242s to 260s, about 7%. Although currently it's only 7%, we will likely register more classes in the future so we should do this the right way. The commit added 26 types to register every time we create a Kryo serializer instance. I ran a small microbenchmark to prove that this is noticeably expensive: {code} import org.apache.spark.serializer._ import org.apache.spark.SparkConf def makeMany(num: Int): Long = { val start = System.currentTimeMillis (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() } System.currentTimeMillis - start } // before commit de02782, averaged over multiple runs makeMany(5000) == 1500 // after commit de02782, averaged over multiple runs makeMany(5000) == 2750 {code} Since we create multiple serializer instances per partition, this means a 5000-partition stage will unconditionally see an increase of > 1s for the stage. In page rank, we may run many such stages. We should explore the alternative of reusing thread-local serializer instances, which would lead to much fewer calls to `kryo.register`. was: After commit de02782 of page rank regressed from 242s to 260s, about 7%. The commit added 26 types to register every time we create a Kryo serializer instance. I ran a small microbenchmark to prove that this is noticeably expensive: {code} import org.apache.spark.serializer._ import org.apache.spark.SparkConf def makeMany(num: Int): Long = { val start = System.currentTimeMillis (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() } System.currentTimeMillis - start } // before commit de02782, averaged over multiple runs makeMany(5000) == 1500 // after commit de02782, averaged over multiple runs makeMany(5000) == 2750 {code} Since we create multiple serializer instances per partition, this means a 5000-partition stage will unconditionally see an increase of > 1s for the stage. In page rank, we may run many such stages. We should explore the alternative of reusing thread-local serializer instances, which would lead to much fewer calls to `kryo.register`. > Reuse serializer instances for performance > -- > > Key: SPARK-12473 > URL: https://issues.apache.org/jira/browse/SPARK-12473 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Andrew Or > > After commit de02782 of page rank regressed from 242s to 260s, about 7%. > Although currently it's only 7%, we will likely register more classes in the > future so we should do this the right way. > The commit added 26 types to register every time we create a Kryo serializer > instance. I ran a small microbenchmark to prove that this is noticeably > expensive: > {code} > import org.apache.spark.serializer._ > import org.apache.spark.SparkConf > def makeMany(num: Int): Long = { > val start = System.currentTimeMillis > (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() } > System.currentTimeMillis - start > } > // before commit de02782, averaged over multiple runs > makeMany(5000) == 1500 > // after commit de02782, averaged over multiple runs > makeMany(5000) == 2750 > {code} > Since we create multiple serializer instances per partition, this means a > 5000-partition stage will unconditionally see an increase of > 1s for the > stage. In page rank, we may run many such stages. > We should explore the alternative of reusing thread-local serializer > instances, which would lead to much fewer calls to `kryo.register`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12471) Spark daemons should log their pid in the log file
[ https://issues.apache.org/jira/browse/SPARK-12471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12471: Assignee: (was: Apache Spark) > Spark daemons should log their pid in the log file > -- > > Key: SPARK-12471 > URL: https://issues.apache.org/jira/browse/SPARK-12471 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Nong Li > > This is useful when debugging from the log files without the processes > running. This information makes it possible to combine the log files with > other system information (e.g. dmesg output) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12471) Spark daemons should log their pid in the log file
[ https://issues.apache.org/jira/browse/SPARK-12471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12471: Assignee: Apache Spark > Spark daemons should log their pid in the log file > -- > > Key: SPARK-12471 > URL: https://issues.apache.org/jira/browse/SPARK-12471 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Nong Li >Assignee: Apache Spark > > This is useful when debugging from the log files without the processes > running. This information makes it possible to combine the log files with > other system information (e.g. dmesg output) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12458) Add ExpressionDescription to datetime functions
[ https://issues.apache.org/jira/browse/SPARK-12458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067278#comment-15067278 ] Dilip Biswal commented on SPARK-12458: -- I would like to work on this one. > Add ExpressionDescription to datetime functions > --- > > Key: SPARK-12458 > URL: https://issues.apache.org/jira/browse/SPARK-12458 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
[ https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067115#comment-15067115 ] Sean Owen commented on SPARK-12453: --- OK, I see what happened here: https://github.com/apache/spark/commit/87f82a5fb9c4350a97c761411069245f07aad46f How about updating to 1.9.40 for consistency? really, it sounds like there's no point manually setting the SDK version here -- how about preemptively bringing those parts of SPARK-12269 back? Then really it should go into master first, and be backported, and then further updated by 12269. This is why I view it as sort of a duplicate, since it could as well come from back-porting just a subset of 12269. I don't know if a new 1.5.x release will happen. > Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version > > > Key: SPARK-12453 > URL: https://issues.apache.org/jira/browse/SPARK-12453 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Martin Schade >Priority: Critical > Labels: easyfix > > The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS > Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0). > AWS KCL 1.3.0 references AWS Java SDK version 1.9.37. > Using 1.9.16 in combination with 1.3.0 does fail to get data out of the > stream. > I tested Spark Streaming with 1.9.37 and it works fine. > Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also > fails, so it is due to the specific versions used in 1.5.2 and not a Spark > related implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode
[ https://issues.apache.org/jira/browse/SPARK-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12463: -- Component/s: Mesos > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > > > Key: SPARK-12463 > URL: https://issues.apache.org/jira/browse/SPARK-12463 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen > > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > configuration for cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org