date:20151221

[jira] [Assigned] (SPARK-12455) Add ExpressionDescription to window functions

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12455:


Assignee: Apache Spark  (was: Herman van Hovell)

> Add ExpressionDescription to window functions
> -
>
> Key: SPARK-12455
> URL: https://issues.apache.org/jira/browse/SPARK-12455
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12465:


Assignee: Apache Spark

> Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
> --
>
> Key: SPARK-12465
> URL: https://issues.apache.org/jira/browse/SPARK-12465
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Reporter: Timothy Chen
>Assignee: Apache Spark
>
> Remove spark.deploy.mesos.zookeeper.dir and use existing configuration 
> spark.deploy.zookeeper.dir for Mesos cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12465:


Assignee: (was: Apache Spark)

> Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
> --
>
> Key: SPARK-12465
> URL: https://issues.apache.org/jira/browse/SPARK-12465
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Reporter: Timothy Chen
>
> Remove spark.deploy.mesos.zookeeper.dir and use existing configuration 
> spark.deploy.zookeeper.dir for Mesos cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12457) Add ExpressionDescription to collection functions

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12457:


Assignee: (was: Apache Spark)

> Add ExpressionDescription to collection functions
> -
>
> Key: SPARK-12457
> URL: https://issues.apache.org/jira/browse/SPARK-12457
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12457) Add ExpressionDescription to collection functions

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12457:


Assignee: Apache Spark

> Add ExpressionDescription to collection functions
> -
>
> Key: SPARK-12457
> URL: https://issues.apache.org/jira/browse/SPARK-12457
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12464) Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url

2015-12-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067045#comment-15067045
 ] 

Apache Spark commented on SPARK-12464:
--

User 'tnachen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10057

> Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url
> --
>
> Key: SPARK-12464
> URL: https://issues.apache.org/jira/browse/SPARK-12464
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Reporter: Timothy Chen
>Assignee: Apache Spark
>
> Remove spark.deploy.mesos.zookeeper.url and use existing configuration 
> spark.deploy.zookeeper.url for Mesos cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12463:


Assignee: (was: Apache Spark)

> Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode
> 
>
> Key: SPARK-12463
> URL: https://issues.apache.org/jira/browse/SPARK-12463
> Project: Spark
>  Issue Type: Task
>Reporter: Timothy Chen
>
> Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode 
> configuration for cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12465:


Assignee: Apache Spark

> Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
> --
>
> Key: SPARK-12465
> URL: https://issues.apache.org/jira/browse/SPARK-12465
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Reporter: Timothy Chen
>Assignee: Apache Spark
>
> Remove spark.deploy.mesos.zookeeper.dir and use existing configuration 
> spark.deploy.zookeeper.dir for Mesos cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12321) JSON format for logical/physical execution plans

2015-12-21 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12321.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10311
[https://github.com/apache/spark/pull/10311]

> JSON format for logical/physical execution plans
> 
>
> Key: SPARK-12321
> URL: https://issues.apache.org/jira/browse/SPARK-12321
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12398) Smart truncation of DataFrame / Dataset toString

2015-12-21 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12398.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10373
[https://github.com/apache/spark/pull/10373]

> Smart truncation of DataFrame / Dataset toString
> 
>
> Key: SPARK-12398
> URL: https://issues.apache.org/jira/browse/SPARK-12398
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>  Labels: starter
> Fix For: 2.0.0
>
>
> When a DataFrame or Dataset has a long schema, we should intelligently 
> truncate to avoid flooding the screen with unreadable information.
> {code}
> // Standard output
> [a: int, b: int]
> // Truncate many top level fields
> [a: int, b, string ... 10 more fields]
> // Truncate long inner structs
> [a: struct]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12463:


Assignee: Apache Spark

> Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode
> 
>
> Key: SPARK-12463
> URL: https://issues.apache.org/jira/browse/SPARK-12463
> Project: Spark
>  Issue Type: Task
>Reporter: Timothy Chen
>Assignee: Apache Spark
>
> Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode 
> configuration for cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir

2015-12-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067059#comment-15067059
 ] 

Apache Spark commented on SPARK-12465:
--

User 'tnachen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10057

> Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
> --
>
> Key: SPARK-12465
> URL: https://issues.apache.org/jira/browse/SPARK-12465
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Reporter: Timothy Chen
>
> Remove spark.deploy.mesos.zookeeper.dir and use existing configuration 
> spark.deploy.zookeeper.dir for Mesos cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12468:


Assignee: (was: Apache Spark)

> getParamMap in Pyspark ML API returns empty dictionary in example for 
> Documentation
> ---
>
> Key: SPARK-12468
> URL: https://issues.apache.org/jira/browse/SPARK-12468
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Zachary Brown
>Priority: Minor
>
> The `extractParamMap()` method for a model that has been fit returns an empty 
> dictionary, e.g. (from the [Pyspark ML API 
> Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)):
> ```python
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.param import Param, Params
> # Prepare training data from a list of (label, features) tuples.
> training = sqlContext.createDataFrame([
> (1.0, Vectors.dense([0.0, 1.1, 0.1])),
> (0.0, Vectors.dense([2.0, 1.0, -1.0])),
> (0.0, Vectors.dense([2.0, 1.3, 1.0])),
> (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
> # Create a LogisticRegression instance. This instance is an Estimator.
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> # Print out the parameters, documentation, and any default values.
> print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"
> # Learn a LogisticRegression model. This uses the parameters stored in lr.
> model1 = lr.fit(training)
> # Since model1 is a Model (i.e., a transformer produced by an Estimator),
> # we can view the parameters it used during fit().
> # This prints the parameter (name: value) pairs, where names are unique IDs 
> for this
> # LogisticRegression instance.
> print "Model 1 was fit using parameters: "
> print model1.extractParamMap()
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12454) Add ExpressionDescription to expressions are registered in FunctionRegistry

2015-12-21 Thread Yin Huai (JIRA)

Yin Huai created SPARK-12454:


 Summary: Add ExpressionDescription to expressions are registered 
in FunctionRegistry
 Key: SPARK-12454
 URL: https://issues.apache.org/jira/browse/SPARK-12454
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai


ExpressionDescription is a annotation that contains doc of a function and when 
users use {{describe function}}, users can see the doc defined in this 
annotation. You can take a look at {{Upper}} as an example. 

However, we still have lots of expression that do not have 
ExpressionDescription. It will be great to take a look at expressions 
registered in FunctionRegistry and add ExpressionDescription to those 
expression that do not have it.. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12455) Add ExpressionDescription to window functions

2015-12-21 Thread Yin Huai (JIRA)

Yin Huai created SPARK-12455:


 Summary: Add ExpressionDescription to window functions
 Key: SPARK-12455
 URL: https://issues.apache.org/jira/browse/SPARK-12455
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Herman van Hovell






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12362) Create a full-fledged built-in SQL parser

2015-12-21 Thread Nong Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066864#comment-15066864
 ] 

Nong Li commented on SPARK-12362:
-

I think it makes sense to inline the hive ql parser into spark sql.  This 
satisfies the requirements in a pretty good way.

It is maximally HiveQL compatible and what the existing spark sql integration 
is built on. The parser uses antlr and looks
to be easy to extend going forward. Inlining it would involve taking some of 
the existing code in the hive.ql.parse package,
restricting it to the code that deals with parsing and not semantic analysis.




> Create a full-fledged built-in SQL parser
> -
>
> Key: SPARK-12362
> URL: https://issues.apache.org/jira/browse/SPARK-12362
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently has two SQL parsers it is using: a simple one based on Scala 
> parser combinator, and another one based on Hive.
> Neither is a good long term solution. The parser combinator one has bad error 
> messages for users and does not warn when there are conflicts in the defined 
> grammar. The Hive one depends directly on Hive itself, and as a result, it is 
> very difficult to introduce new grammar or fix bugs.
> The goal of the ticket is to create a single SQL query parser that is 
> powerful enough to replace the existing ones. The requirements for the new 
> parser are:
> 1. Can support almost all of HiveQL
> 2. Can support all existing SQL parser built using Scala parser combinators
> 3. Can be used for expression parsing in addition to SQL query parsing
> 4. Can provide good error messages for incorrect syntax
> Rather than building one from scratch, we should investigate whether we can 
> leverage existing open source projects such as Hive (by inlining the parser 
> part) or Calcite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12457) Add ExpressionDescription to collection functions

2015-12-21 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066892#comment-15066892
 ] 

Xiao Li commented on SPARK-12457:
-

Let me pick this one? : )

> Add ExpressionDescription to collection functions
> -
>
> Key: SPARK-12457
> URL: https://issues.apache.org/jira/browse/SPARK-12457
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12396) Once driver client registered successfully,it still retry to connected to master.

2015-12-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067023#comment-15067023
 ] 

Apache Spark commented on SPARK-12396:
--

User 'echoTomei' has created a pull request for this issue:
https://github.com/apache/spark/pull/10407

> Once driver client registered successfully,it still retry to connected to 
> master.
> -
>
> Key: SPARK-12396
> URL: https://issues.apache.org/jira/browse/SPARK-12396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.5.2
>Reporter: echo
>Priority: Minor
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> As description in AppClient.scala,Once driver connect to a master 
> successfully, all scheduling work and Futures will be cancelled. But at 
> currently,it still try to connect to master. And it should not happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12430) Temporary folders do not get deleted after Task completes causing problems with disk space.

2015-12-21 Thread Fede Bar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fede Bar updated SPARK-12430:
-
Component/s: Spark Core

> Temporary folders do not get deleted after Task completes causing problems 
> with disk space.
> ---
>
> Key: SPARK-12430
> URL: https://issues.apache.org/jira/browse/SPARK-12430
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.5.2
> Environment: Ubuntu server
>Reporter: Fede Bar
>
> We are experiencing an issue with automatic /tmp folder deletion after 
> framework completes. Completing a M/R job using Spark 1.5.2 (same behavior as 
> Spark 1.5.1) over Mesos will not delete some temporary folders causing free 
> disk space on server to exhaust. 
> Behavior of M/R job using Spark 1.4.1 over Mesos cluster:
> - Launched using spark-submit on one cluster node.
> - Following folders are created: */tmp/mesos/slaves/id#* , */tmp/spark-#/*  , 
>  */tmp/spark-#/blockmgr-#*
> - When task is completed */tmp/spark-#/* gets deleted along with 
> */tmp/spark-#/blockmgr-#* sub-folder.
> Behavior of M/R job using Spark 1.5.2 over Mesos cluster (same identical job):
> - Launched using spark-submit on one cluster node.
> - Following folders are created: */tmp/mesos/mesos/slaves/id** * , 
> */tmp/spark-***/ *  ,{color:red} /tmp/blockmgr-***{color}
> - When task is completed */tmp/spark-***/ * gets deleted but NOT shuffle 
> container folder {color:red} /tmp/blockmgr-***{color}
> Unfortunately, {color:red} /tmp/blockmgr-***{color} can account for several 
> GB depending on the job that ran. Over time this causes disk space to become 
> full with consequences that we all know. 
> Running a shell script would probably work but it is difficult to identify 
> folders in use by a running M/R or stale folders. I did notice similar issues 
> opened by other users marked as "resolved", but none seems to exactly match 
> the above behavior. 
> I really hope someone has insights on how to fix it.
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12430) Temporary folders do not get deleted after Task completes causing problems with disk space.

2015-12-21 Thread Fede Bar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fede Bar updated SPARK-12430:
-
Fix Version/s: (was: 1.4.1)

> Temporary folders do not get deleted after Task completes causing problems 
> with disk space.
> ---
>
> Key: SPARK-12430
> URL: https://issues.apache.org/jira/browse/SPARK-12430
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.5.2
> Environment: Ubuntu server
>Reporter: Fede Bar
>
> We are experiencing an issue with automatic /tmp folder deletion after 
> framework completes. Completing a M/R job using Spark 1.5.2 (same behavior as 
> Spark 1.5.1) over Mesos will not delete some temporary folders causing free 
> disk space on server to exhaust. 
> Behavior of M/R job using Spark 1.4.1 over Mesos cluster:
> - Launched using spark-submit on one cluster node.
> - Following folders are created: */tmp/mesos/slaves/id#* , */tmp/spark-#/*  , 
>  */tmp/spark-#/blockmgr-#*
> - When task is completed */tmp/spark-#/* gets deleted along with 
> */tmp/spark-#/blockmgr-#* sub-folder.
> Behavior of M/R job using Spark 1.5.2 over Mesos cluster (same identical job):
> - Launched using spark-submit on one cluster node.
> - Following folders are created: */tmp/mesos/mesos/slaves/id** * , 
> */tmp/spark-***/ *  ,{color:red} /tmp/blockmgr-***{color}
> - When task is completed */tmp/spark-***/ * gets deleted but NOT shuffle 
> container folder {color:red} /tmp/blockmgr-***{color}
> Unfortunately, {color:red} /tmp/blockmgr-***{color} can account for several 
> GB depending on the job that ran. Over time this causes disk space to become 
> full with consequences that we all know. 
> Running a shell script would probably work but it is difficult to identify 
> folders in use by a running M/R or stale folders. I did notice similar issues 
> opened by other users marked as "resolved", but none seems to exactly match 
> the above behavior. 
> I really hope someone has insights on how to fix it.
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version

2015-12-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12453.
---
Resolution: Duplicate

> Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
> 
>
> Key: SPARK-12453
> URL: https://issues.apache.org/jira/browse/SPARK-12453
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Martin Schade
>Priority: Critical
>  Labels: easyfix
>
> The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS 
> Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0).
> AWS KCL 1.3.0 references AWS Java SDK version 1.9.37.
> Using 1.9.16 in combination with 1.3.0 does fail to get data out of the 
> stream.
> I tested Spark Streaming with 1.9.37 and it works fine. 
> Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also 
> fails, so it is due to the specific versions used in 1.5.2 and not a Spark 
> related implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12462) Add ExpressionDescription to misc non-aggregate functions

2015-12-21 Thread Yin Huai (JIRA)

Yin Huai created SPARK-12462:


 Summary: Add ExpressionDescription to misc non-aggregate functions
 Key: SPARK-12462
 URL: https://issues.apache.org/jira/browse/SPARK-12462
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12430) Temporary folders do not get deleted after Task completes causing problems with disk space.

2015-12-21 Thread Fede Bar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fede Bar updated SPARK-12430:
-
Component/s: (was: Spark Submit)
 (was: Shuffle)
 (was: Block Manager)

> Temporary folders do not get deleted after Task completes causing problems 
> with disk space.
> ---
>
> Key: SPARK-12430
> URL: https://issues.apache.org/jira/browse/SPARK-12430
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.5.2
> Environment: Ubuntu server
>Reporter: Fede Bar
>
> We are experiencing an issue with automatic /tmp folder deletion after 
> framework completes. Completing a M/R job using Spark 1.5.2 (same behavior as 
> Spark 1.5.1) over Mesos will not delete some temporary folders causing free 
> disk space on server to exhaust. 
> Behavior of M/R job using Spark 1.4.1 over Mesos cluster:
> - Launched using spark-submit on one cluster node.
> - Following folders are created: */tmp/mesos/slaves/id#* , */tmp/spark-#/*  , 
>  */tmp/spark-#/blockmgr-#*
> - When task is completed */tmp/spark-#/* gets deleted along with 
> */tmp/spark-#/blockmgr-#* sub-folder.
> Behavior of M/R job using Spark 1.5.2 over Mesos cluster (same identical job):
> - Launched using spark-submit on one cluster node.
> - Following folders are created: */tmp/mesos/mesos/slaves/id** * , 
> */tmp/spark-***/ *  ,{color:red} /tmp/blockmgr-***{color}
> - When task is completed */tmp/spark-***/ * gets deleted but NOT shuffle 
> container folder {color:red} /tmp/blockmgr-***{color}
> Unfortunately, {color:red} /tmp/blockmgr-***{color} can account for several 
> GB depending on the job that ran. Over time this causes disk space to become 
> full with consequences that we all know. 
> Running a shell script would probably work but it is difficult to identify 
> folders in use by a running M/R or stale folders. I did notice similar issues 
> opened by other users marked as "resolved", but none seems to exactly match 
> the above behavior. 
> I really hope someone has insights on how to fix it.
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12464) Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url

2015-12-21 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067062#comment-15067062
 ] 

Andrew Or commented on SPARK-12464:
---

By the way for future reference you probably don't need a separate issue for 
each config. Just have an issue that says
`Remove spark.deploy.mesos.* and use spark.deploy.* instead`. Since you already 
opened these we can just keep them.

> Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url
> --
>
> Key: SPARK-12464
> URL: https://issues.apache.org/jira/browse/SPARK-12464
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Reporter: Timothy Chen
>Assignee: Apache Spark
>
> Remove spark.deploy.mesos.zookeeper.url and use existing configuration 
> spark.deploy.zookeeper.url for Mesos cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12457) Add ExpressionDescription to collection functions

2015-12-21 Thread Yin Huai (JIRA)

Yin Huai created SPARK-12457:


 Summary: Add ExpressionDescription to collection functions
 Key: SPARK-12457
 URL: https://issues.apache.org/jira/browse/SPARK-12457
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12464) Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12464:


Assignee: Apache Spark

> Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url
> --
>
> Key: SPARK-12464
> URL: https://issues.apache.org/jira/browse/SPARK-12464
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Reporter: Timothy Chen
>Assignee: Apache Spark
>
> Remove spark.deploy.mesos.zookeeper.url and use existing configuration 
> spark.deploy.zookeeper.url for Mesos cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-21 Thread Timothy Hunter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066857#comment-15066857
 ] 

Timothy Hunter commented on SPARK-12247:


Thanks for working on it, [~BenFradet]!

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12362) Create a full-fledged built-in SQL parser

2015-12-21 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066883#comment-15066883
 ] 

Reynold Xin commented on SPARK-12362:
-

+1

> Create a full-fledged built-in SQL parser
> -
>
> Key: SPARK-12362
> URL: https://issues.apache.org/jira/browse/SPARK-12362
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently has two SQL parsers it is using: a simple one based on Scala 
> parser combinator, and another one based on Hive.
> Neither is a good long term solution. The parser combinator one has bad error 
> messages for users and does not warn when there are conflicts in the defined 
> grammar. The Hive one depends directly on Hive itself, and as a result, it is 
> very difficult to introduce new grammar or fix bugs.
> The goal of the ticket is to create a single SQL query parser that is 
> powerful enough to replace the existing ones. The requirements for the new 
> parser are:
> 1. Can support almost all of HiveQL
> 2. Can support all existing SQL parser built using Scala parser combinators
> 3. Can be used for expression parsing in addition to SQL query parsing
> 4. Can provide good error messages for incorrect syntax
> Rather than building one from scratch, we should investigate whether we can 
> leverage existing open source projects such as Hive (by inlining the parser 
> part) or Calcite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12455) Add ExpressionDescription to window functions

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12455:


Assignee: Apache Spark  (was: Herman van Hovell)

> Add ExpressionDescription to window functions
> -
>
> Key: SPARK-12455
> URL: https://issues.apache.org/jira/browse/SPARK-12455
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12458) Add ExpressionDescription to datetime functions

2015-12-21 Thread Yin Huai (JIRA)

Yin Huai created SPARK-12458:


 Summary: Add ExpressionDescription to datetime functions
 Key: SPARK-12458
 URL: https://issues.apache.org/jira/browse/SPARK-12458
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12459) Add ExpressionDescription to string functions

2015-12-21 Thread Yin Huai (JIRA)

Yin Huai created SPARK-12459:


 Summary: Add ExpressionDescription to string functions
 Key: SPARK-12459
 URL: https://issues.apache.org/jira/browse/SPARK-12459
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12429) Update documentation to show how to use accumulators and broadcasts with Spark Streaming

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12429:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Update documentation to show how to use accumulators and broadcasts with 
> Spark Streaming
> 
>
> Key: SPARK-12429
> URL: https://issues.apache.org/jira/browse/SPARK-12429
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
>  Accumulators and Broadcasts with Spark Streaming cannot work perfectly when 
> restarting on driver failures. We need to add some example to guide the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12461) Add ExpressionDescription to math functions

2015-12-21 Thread Yin Huai (JIRA)

Yin Huai created SPARK-12461:


 Summary: Add ExpressionDescription to math functions
 Key: SPARK-12461
 URL: https://issues.apache.org/jira/browse/SPARK-12461
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12460) Add ExpressionDescription to aggregate functions

2015-12-21 Thread Yin Huai (JIRA)

Yin Huai created SPARK-12460:


 Summary: Add ExpressionDescription to aggregate functions
 Key: SPARK-12460
 URL: https://issues.apache.org/jira/browse/SPARK-12460
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12467) Get rid of sorting in Row's constructor in pyspark

2015-12-21 Thread Irakli Machabeli (JIRA)

Irakli Machabeli created SPARK-12467:


 Summary: Get rid of sorting in Row's constructor in pyspark
 Key: SPARK-12467
 URL: https://issues.apache.org/jira/browse/SPARK-12467
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.5.2
Reporter: Irakli Machabeli
Priority: Minor


Current implementation of Row's __new__ sorts columns by name
First of all there is no obvious reason to sort, second, if one converts 
dataframe to rdd and than back to dataframe, order of column changes. While 
this is not  a bug, nevetheless it makes looking at the data really 
inconvenient.



def __new__(self, *args, **kwargs):
if args and kwargs:
raise ValueError("Can not use both args "
 "and kwargs to create Row")
if args:
# create row class or objects
return tuple.__new__(self, args)

elif kwargs:
# create row objects
names = sorted(kwargs.keys()) # just get rid of sorting here!!!
row = tuple.__new__(self, [kwargs[n] for n in names])
row.__fields__ = names
return row

else:
raise ValueError("No args or kwargs")




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12231) Failed to generate predicate Error when using dropna

2015-12-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067033#comment-15067033
 ] 

Apache Spark commented on SPARK-12231:
--

User 'kevinyu98' has created a pull request for this issue:
https://github.com/apache/spark/pull/10388

> Failed to generate predicate Error when using dropna
> 
>
> Key: SPARK-12231
> URL: https://issues.apache.org/jira/browse/SPARK-12231
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2, 1.6.0
> Environment: python version: 2.7.9
> os: ubuntu 14.04
>Reporter: yahsuan, chang
>
> code to reproduce error
> # write.py
> {code}
> import pyspark
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> df = sqlc.range(10)
> df1 = df.withColumn('a', df['id'] * 2)
> df1.write.partitionBy('id').parquet('./data')
> {code}
> # read.py
> {code}
> import pyspark
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> df2 = sqlc.read.parquet('./data')
> df2.dropna().count()
> {code}
> $ spark-submit write.py
> $ spark-submit read.py
> # error message
> {code}
> 15/12/08 17:20:34 ERROR Filter: Failed to generate predicate, fallback to 
> interpreted org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: a#0L
> ...
> {code}
> If write data without partitionBy, the error won't happen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation

2015-12-21 Thread Zachary Brown (JIRA)

Zachary Brown created SPARK-12468:
-

 Summary: getParamMap in Pyspark ML API returns empty dictionary in 
example for Documentation
 Key: SPARK-12468
 URL: https://issues.apache.org/jira/browse/SPARK-12468
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.2
Reporter: Zachary Brown
Priority: Minor


The `extractParamMap()` method for a model that has been fit returns an empty 
dictionary, e.g. (from the [Pyspark ML API 
Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)):

```python
from pyspark.mllib.linalg import Vectors
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.param import Param, Params

# Prepare training data from a list of (label, features) tuples.
training = sqlContext.createDataFrame([
(1.0, Vectors.dense([0.0, 1.1, 0.1])),
(0.0, Vectors.dense([2.0, 1.0, -1.0])),
(0.0, Vectors.dense([2.0, 1.3, 1.0])),
(1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])

# Create a LogisticRegression instance. This instance is an Estimator.
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Print out the parameters, documentation, and any default values.
print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"

# Learn a LogisticRegression model. This uses the parameters stored in lr.
model1 = lr.fit(training)

# Since model1 is a Model (i.e., a transformer produced by an Estimator),
# we can view the parameters it used during fit().
# This prints the parameter (name: value) pairs, where names are unique IDs for 
this
# LogisticRegression instance.
print "Model 1 was fit using parameters: "
print model1.extractParamMap()
```




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12465:


Assignee: Apache Spark

> Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
> --
>
> Key: SPARK-12465
> URL: https://issues.apache.org/jira/browse/SPARK-12465
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Reporter: Timothy Chen
>Assignee: Apache Spark
>
> Remove spark.deploy.mesos.zookeeper.dir and use existing configuration 
> spark.deploy.zookeeper.dir for Mesos cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12463:


Assignee: (was: Apache Spark)

> Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode
> 
>
> Key: SPARK-12463
> URL: https://issues.apache.org/jira/browse/SPARK-12463
> Project: Spark
>  Issue Type: Task
>Reporter: Timothy Chen
>
> Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode 
> configuration for cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12463:


Assignee: Apache Spark

> Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode
> 
>
> Key: SPARK-12463
> URL: https://issues.apache.org/jira/browse/SPARK-12463
> Project: Spark
>  Issue Type: Task
>Reporter: Timothy Chen
>Assignee: Apache Spark
>
> Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode 
> configuration for cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12455) Add ExpressionDescription to window functions

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12455:


Assignee: Herman van Hovell  (was: Apache Spark)

> Add ExpressionDescription to window functions
> -
>
> Key: SPARK-12455
> URL: https://issues.apache.org/jira/browse/SPARK-12455
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Herman van Hovell
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir

2015-12-21 Thread Timothy Chen (JIRA)

Timothy Chen created SPARK-12465:


 Summary: Remove spark.deploy.mesos.zookeeper.dir and use 
spark.deploy.zookeeper.dir
 Key: SPARK-12465
 URL: https://issues.apache.org/jira/browse/SPARK-12465
 Project: Spark
  Issue Type: Task
  Components: Mesos
Reporter: Timothy Chen


Remove spark.deploy.mesos.zookeeper.dir and use existing configuration 
spark.deploy.zookeeper.dir for Mesos cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12464) Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url

2015-12-21 Thread Timothy Chen (JIRA)

Timothy Chen created SPARK-12464:


 Summary: Remove spark.deploy.mesos.zookeeper.url and use 
spark.deploy.zookeeper.url
 Key: SPARK-12464
 URL: https://issues.apache.org/jira/browse/SPARK-12464
 Project: Spark
  Issue Type: Task
  Components: Mesos
Reporter: Timothy Chen


Remove spark.deploy.mesos.zookeeper.url and use existing configuration 
spark.deploy.zookeeper.url for Mesos cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12331) R^2 for regression through the origin

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12331:


Assignee: Apache Spark

> R^2 for regression through the origin
> -
>
> Key: SPARK-12331
> URL: https://issues.apache.org/jira/browse/SPARK-12331
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Imran Younus
>Assignee: Apache Spark
>Priority: Minor
>
> The value of R^2 (coefficient of determination) obtained from 
> LinearRegressionModel is not consistent with R and statsmodels when the 
> fitIntercept is false i.e., regression through the origin. In this case, both 
> R and statsmodels use the definition of R^2 given by eq(4') in the following 
> review paper:
> https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf
> Here is the definition from this paper:
> R^2 = \sum(\hat( y)_i^2)/\sum(y_i^2)
> The paper also describes why this should be the case. I've double checked 
> that the value of R^2 from statsmodels and R are consistent with this 
> definition. On the other hand, scikit-learn doesn't use the above definition. 
> I would recommend using the above definition in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12331) R^2 for regression through the origin

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12331:


Assignee: (was: Apache Spark)

> R^2 for regression through the origin
> -
>
> Key: SPARK-12331
> URL: https://issues.apache.org/jira/browse/SPARK-12331
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Imran Younus
>Priority: Minor
>
> The value of R^2 (coefficient of determination) obtained from 
> LinearRegressionModel is not consistent with R and statsmodels when the 
> fitIntercept is false i.e., regression through the origin. In this case, both 
> R and statsmodels use the definition of R^2 given by eq(4') in the following 
> review paper:
> https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf
> Here is the definition from this paper:
> R^2 = \sum(\hat( y)_i^2)/\sum(y_i^2)
> The paper also describes why this should be the case. I've double checked 
> that the value of R^2 from statsmodels and R are consistent with this 
> definition. On the other hand, scikit-learn doesn't use the above definition. 
> I would recommend using the above definition in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)

2015-12-21 Thread Nong Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nong Li updated SPARK-12394:

Attachment: BucketedTables.pdf

Here is a design for how we can supported bucketed tables.

> Support writing out pre-hash-partitioned data and exploit that in join 
> optimizations to avoid shuffle (i.e. bucketing in Hive)
> --
>
> Key: SPARK-12394
> URL: https://issues.apache.org/jira/browse/SPARK-12394
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: BucketedTables.pdf
>
>
> In many cases users know ahead of time the columns that they will be joining 
> or aggregating on.  Ideally they should be able to leverage this information 
> and pre-shuffle the data so that subsequent queries do not require a shuffle. 
>  Hive supports this functionality by allowing the user to define buckets, 
> which are hash partitioning of the data based on some key.
>  - Allow the user to specify a set of columns when caching or writing out data
>  - Allow the user to specify some parallelism
>  - Shuffle the data when writing / caching such that its distributed by these 
> columns
>  - When planning/executing  a query, use this distribution to avoid another 
> shuffle when reading, assuming the join or aggregation is compatible with the 
> columns specified
>  - Should work with existing save modes: append, overwrite, etc
>  - Should work at least with all Hadoops FS data sources
>  - Should work with any data source when caching



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12279) Requesting a HBase table with kerberos is not working

2015-12-21 Thread Y Bodnar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067020#comment-15067020
 ] 

Y Bodnar commented on SPARK-12279:
--

Hi [~pbeauvois]

It's weird there's no messages related to HBase tokens. According to 
yarn.Client "Attempting to fetch HBase security token." message should appear. 

{code:title=Client.scala|borderStyle=solid}
def obtainTokenForHBase(conf: Configuration, credentials: Credentials): Unit = {
if (UserGroupInformation.isSecurityEnabled) {
  val mirror = universe.runtimeMirror(getClass.getClassLoader)

  try {
val confCreate = mirror.classLoader.
  loadClass("org.apache.hadoop.hbase.HBaseConfiguration").
  getMethod("create", classOf[Configuration])
val obtainToken = mirror.classLoader.
  loadClass("org.apache.hadoop.hbase.security.token.TokenUtil").
  getMethod("obtainToken", classOf[Configuration])

logDebug("Attempting to fetch HBase security token.")
{code}

I would suggest to try 2 things: 
1. Check UserGroupInformation.isSecurityEnabled from your code. If it's false 
then no attempt to obtain ticket is being made
2. Print HBaseConfiguration and check security related options (like 
hbase.scurity.authentication)  to see if they're properly set and 
hbase-site.xml you provide is actually applied

> Requesting a HBase table with kerberos is not working
> -
>
> Key: SPARK-12279
> URL: https://issues.apache.org/jira/browse/SPARK-12279
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.2
> Environment: Spark 1.5.2 / HBase 1.1.2 / Hadoop 2.7.1 / Zookeeper 
> 3.4.5 / Authentication done through Kerberos
>Reporter: Pierre Beauvois
>
> I can't read a HBase table with Spark 1.5.2. 
> I added the option "spark.driver.extraClassPath" in the spark-defaults.conf 
> which contains the HBASE_CONF_DIR as below:
> spark.driver.extraClassPath = /opt/application/Hbase/current/conf/
> On the driver, I started spark-shell (I was running it in yarn-client mode) 
> {code}
> [my_user@uabigspark01 ~]$ spark-shell -v --name HBaseTest --jars 
> /opt/application/Hbase/current/lib/hbase-common-1.1.2.jar,/opt/application/Hbase/current/lib/hbase-server-1.1.2.jar,/opt/application/Hbase/current/lib/hbase-client-1.1.2.jar,/opt/application/Hbase/current/lib/hbase-protocol-1.1.2.jar,/opt/application/Hbase/current/lib/protobuf-java-2.5.0.jar,/opt/application/Hbase/current/lib/htrace-core-3.1.0-incubating.jar,/opt/application/Hbase/current/lib/hbase-annotations-1.1.2.jar,/opt/application/Hbase/current/lib/guava-12.0.1.jar
> {code}
> Then I ran the following lines:
> {code}
> scala> import org.apache.spark._
> import org.apache.spark._
> scala> import org.apache.spark.rdd.NewHadoopRDD
> import org.apache.spark.rdd.NewHadoopRDD
> scala> import org.apache.hadoop.fs.Path
> import org.apache.hadoop.fs.Path
> scala> import org.apache.hadoop.hbase.util.Bytes
> import org.apache.hadoop.hbase.util.Bytes
> scala> import org.apache.hadoop.hbase.HColumnDescriptor
> import org.apache.hadoop.hbase.HColumnDescriptor
> scala> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
> scala> import org.apache.hadoop.hbase.client.{HBaseAdmin, Put, HTable, Result}
> import org.apache.hadoop.hbase.client.{HBaseAdmin, Put, HTable, Result}
> scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
> scala> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
> import org.apache.hadoop.hbase.io.ImmutableBytesWritable
> scala> val conf = HBaseConfiguration.create()
> conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, hbase-default.xml, 
> hbase-site.xml
> scala> conf.addResource(new 
> Path("/opt/application/Hbase/current/conf/hbase-site.xml"))
> scala> conf.set("hbase.zookeeper.quorum", "FQDN1:2181,FQDN2:2181,FQDN3:2181")
> scala> conf.set(TableInputFormat.INPUT_TABLE, "user:noheader")
> scala> val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], 
> classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], 
> classOf[org.apache.hadoop.hbase.client.Result])
> 2015-12-09 15:17:58,890 INFO  [main] storage.MemoryStore: 
> ensureFreeSpace(266248) called with curMem=0, maxMem=556038881
> 2015-12-09 15:17:58,892 INFO  [main] storage.MemoryStore: Block broadcast_0 
> stored as values in memory (estimated size 260.0 KB, free 530.0 MB)
> 2015-12-09 15:17:59,196 INFO  [main] storage.MemoryStore: 
> ensureFreeSpace(32808) called with curMem=266248, maxMem=556038881
> 2015-12-09 15:17:59,197 INFO  [main]

[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-21 Thread Benjamin Fradet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067021#comment-15067021
 ] 

Benjamin Fradet commented on SPARK-12247:
-

Ok thanks, I'll rework the examples accordingly.

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation

2015-12-21 Thread Zachary Brown (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067080#comment-15067080
 ] 

Zachary Brown commented on SPARK-12468:
---

Found a possible fix for this by modifying the `_fit()` method of the 
JavaEstimator class in `python/pyspark/ml/wrapper.py` to update the paramMap of 
the returned model. 

Created a pull request for it here:
https://github.com/apache/spark/pull/10419

> getParamMap in Pyspark ML API returns empty dictionary in example for 
> Documentation
> ---
>
> Key: SPARK-12468
> URL: https://issues.apache.org/jira/browse/SPARK-12468
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Zachary Brown
>Priority: Minor
>
> The `extractParamMap()` method for a model that has been fit returns an empty 
> dictionary, e.g. (from the [Pyspark ML API 
> Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)):
> ```python
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.param import Param, Params
> # Prepare training data from a list of (label, features) tuples.
> training = sqlContext.createDataFrame([
> (1.0, Vectors.dense([0.0, 1.1, 0.1])),
> (0.0, Vectors.dense([2.0, 1.0, -1.0])),
> (0.0, Vectors.dense([2.0, 1.3, 1.0])),
> (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
> # Create a LogisticRegression instance. This instance is an Estimator.
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> # Print out the parameters, documentation, and any default values.
> print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"
> # Learn a LogisticRegression model. This uses the parameters stored in lr.
> model1 = lr.fit(training)
> # Since model1 is a Model (i.e., a transformer produced by an Estimator),
> # we can view the parameters it used during fit().
> # This prints the parameter (name: value) pairs, where names are unique IDs 
> for this
> # LogisticRegression instance.
> print "Model 1 was fit using parameters: "
> print model1.extractParamMap()
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12452) Add exception details to TaskCompletionListener/TaskContext

2015-12-21 Thread Neelesh Shastry (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neelesh Shastry updated SPARK-12452:

Component/s: (was: Streaming)
 Spark Core

> Add exception details to TaskCompletionListener/TaskContext
> ---
>
> Key: SPARK-12452
> URL: https://issues.apache.org/jira/browse/SPARK-12452
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Neelesh Shastry
>Priority: Minor
>
> TaskCompletionListeners are called without success/failure details. 
> If we change this
> {code}
> trait TaskCompletionListener extends EventListener {
>   def onTaskCompletion(context: TaskContext)
> }
> class TaskContextImpl {
>  
> private[spark] def markTaskCompleted(throwable:Option[Throwable]): Unit
> 
> listener.onTaskCompletion(this,throwable)
> }
> {code}
> to something like
> {code}
> trait TaskCompletionListener extends EventListener {
>   def onTaskCompletion(context: TaskContext, throwable:Option[Throwable]=None)
> }
> {code}
> .. and  in Task.scala
> {code}
>val results = Try(runTask(context))
>var throwable:Option[Throwable] = None
> try {
>   runTask(context)
>
> }catch{
>   case t:Throwable => throwable=t
> }
>  finally {
>   context.markTaskCompleted(throwable)
>   TaskContext.unset()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode

2015-12-21 Thread Timothy Chen (JIRA)

Timothy Chen created SPARK-12463:


 Summary: Remove spark.deploy.mesos.recoveryMode and use 
spark.deploy.recoveryMode
 Key: SPARK-12463
 URL: https://issues.apache.org/jira/browse/SPARK-12463
 Project: Spark
  Issue Type: Task
Reporter: Timothy Chen


Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode 
configuration for cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12453:


Assignee: Apache Spark

> Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
> 
>
> Key: SPARK-12453
> URL: https://issues.apache.org/jira/browse/SPARK-12453
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Martin Schade
>Assignee: Apache Spark
>Priority: Critical
>  Labels: easyfix
>
> The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS 
> Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0).
> AWS KCL 1.3.0 references AWS Java SDK version 1.9.37.
> Using 1.9.16 in combination with 1.3.0 does fail to get data out of the 
> stream.
> I tested Spark Streaming with 1.9.37 and it works fine. 
> Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also 
> fails, so it is due to the specific versions used in 1.5.2 and not a Spark 
> related implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version

2015-12-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066852#comment-15066852
 ] 

Apache Spark commented on SPARK-12453:
--

User 'Schadix' has created a pull request for this issue:
https://github.com/apache/spark/pull/10416

> Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
> 
>
> Key: SPARK-12453
> URL: https://issues.apache.org/jira/browse/SPARK-12453
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Martin Schade
>Priority: Critical
>  Labels: easyfix
>
> The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS 
> Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0).
> AWS KCL 1.3.0 references AWS Java SDK version 1.9.37.
> Using 1.9.16 in combination with 1.3.0 does fail to get data out of the 
> stream.
> I tested Spark Streaming with 1.9.37 and it works fine. 
> Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also 
> fails, so it is due to the specific versions used in 1.5.2 and not a Spark 
> related implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-21 Thread Timothy Hunter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066854#comment-15066854
 ] 

Timothy Hunter commented on SPARK-12247:


If we could import all the code that builds the ratings dataframe {{val ratings 
= sc.textFile(params.ratings).map(Rating.parseRating).cache()}}, that would be 
ideal.

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12453:


Assignee: (was: Apache Spark)

> Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
> 
>
> Key: SPARK-12453
> URL: https://issues.apache.org/jira/browse/SPARK-12453
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Martin Schade
>Priority: Critical
>  Labels: easyfix
>
> The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS 
> Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0).
> AWS KCL 1.3.0 references AWS Java SDK version 1.9.37.
> Using 1.9.16 in combination with 1.3.0 does fail to get data out of the 
> stream.
> I tested Spark Streaming with 1.9.37 and it works fine. 
> Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also 
> fails, so it is due to the specific versions used in 1.5.2 and not a Spark 
> related implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12466) Harmless Master NPE in tests

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12466:


Assignee: Apache Spark  (was: Andrew Or)

> Harmless Master NPE in tests
> 
>
> Key: SPARK-12466
> URL: https://issues.apache.org/jira/browse/SPARK-12466
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Tests
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Apache Spark
> Fix For: 1.6.1, 2.0.0
>
>
> {code}
> [info] ReplayListenerSuite:
> [info] - Simple replay (58 milliseconds)
> java.lang.NullPointerException
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982)
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980)
>   at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
>   at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at scala.concurrent.Promise$class.complete(Promise.scala:55)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> [info] - End-to-end replay (10 seconds, 755 milliseconds)
> {code}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull
> caused by https://github.com/apache/spark/pull/10284
> Thanks to [~ted_yu] for reporting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12396) Once driver client registered successfully,it still retry to connected to master.

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12396:


Assignee: Apache Spark

> Once driver client registered successfully,it still retry to connected to 
> master.
> -
>
> Key: SPARK-12396
> URL: https://issues.apache.org/jira/browse/SPARK-12396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.5.2
>Reporter: echo
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> As description in AppClient.scala,Once driver connect to a master 
> successfully, all scheduling work and Futures will be cancelled. But at 
> currently,it still try to connect to master. And it should not happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12396) Once driver client registered successfully,it still retry to connected to master.

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12396:


Assignee: (was: Apache Spark)

> Once driver client registered successfully,it still retry to connected to 
> master.
> -
>
> Key: SPARK-12396
> URL: https://issues.apache.org/jira/browse/SPARK-12396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.5.2
>Reporter: echo
>Priority: Minor
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> As description in AppClient.scala,Once driver connect to a master 
> successfully, all scheduling work and Futures will be cancelled. But at 
> currently,it still try to connect to master. And it should not happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12465:


Assignee: Apache Spark

> Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
> --
>
> Key: SPARK-12465
> URL: https://issues.apache.org/jira/browse/SPARK-12465
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Reporter: Timothy Chen
>Assignee: Apache Spark
>
> Remove spark.deploy.mesos.zookeeper.dir and use existing configuration 
> spark.deploy.zookeeper.dir for Mesos cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode

2015-12-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067065#comment-15067065
 ] 

Apache Spark commented on SPARK-12463:
--

User 'tnachen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10057

> Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode
> 
>
> Key: SPARK-12463
> URL: https://issues.apache.org/jira/browse/SPARK-12463
> Project: Spark
>  Issue Type: Task
>Reporter: Timothy Chen
>
> Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode 
> configuration for cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12465:


Assignee: (was: Apache Spark)

> Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
> --
>
> Key: SPARK-12465
> URL: https://issues.apache.org/jira/browse/SPARK-12465
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Reporter: Timothy Chen
>
> Remove spark.deploy.mesos.zookeeper.dir and use existing configuration 
> spark.deploy.zookeeper.dir for Mesos cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12456) Add ExpressionDescription to misc functions

2015-12-21 Thread Yin Huai (JIRA)

Yin Huai created SPARK-12456:


 Summary: Add ExpressionDescription to misc functions
 Key: SPARK-12456
 URL: https://issues.apache.org/jira/browse/SPARK-12456
 Project: Spark
  Issue Type: Sub-task
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12454) Add ExpressionDescription to expressions are registered in FunctionRegistry

2015-12-21 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12454:
-
Description: 
ExpressionDescription is a annotation that contains doc of a function and when 
users use {{describe function}}, users can see the doc defined in this 
annotation. You can take a look at {{Upper}} as an example. 

However, we still have lots of expression that do not have 
ExpressionDescription. It will be great to take a look at expressions 
registered in FunctionRegistry and add ExpressionDescription to those 
expression that do not have it..

A list of expressions (and their categories) registered in function registry 
can be found at 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L117-L296.

  was:
ExpressionDescription is a annotation that contains doc of a function and when 
users use {{describe function}}, users can see the doc defined in this 
annotation. You can take a look at {{Upper}} as an example. 

However, we still have lots of expression that do not have 
ExpressionDescription. It will be great to take a look at expressions 
registered in FunctionRegistry and add ExpressionDescription to those 
expression that do not have it.. 


> Add ExpressionDescription to expressions are registered in FunctionRegistry
> ---
>
> Key: SPARK-12454
> URL: https://issues.apache.org/jira/browse/SPARK-12454
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> ExpressionDescription is a annotation that contains doc of a function and 
> when users use {{describe function}}, users can see the doc defined in this 
> annotation. You can take a look at {{Upper}} as an example. 
> However, we still have lots of expression that do not have 
> ExpressionDescription. It will be great to take a look at expressions 
> registered in FunctionRegistry and add ExpressionDescription to those 
> expression that do not have it..
> A list of expressions (and their categories) registered in function registry 
> can be found at 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L117-L296.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12331) R^2 for regression through the origin

2015-12-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066898#comment-15066898
 ] 

Apache Spark commented on SPARK-12331:
--

User 'iyounus' has created a pull request for this issue:
https://github.com/apache/spark/pull/10384

> R^2 for regression through the origin
> -
>
> Key: SPARK-12331
> URL: https://issues.apache.org/jira/browse/SPARK-12331
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Imran Younus
>Priority: Minor
>
> The value of R^2 (coefficient of determination) obtained from 
> LinearRegressionModel is not consistent with R and statsmodels when the 
> fitIntercept is false i.e., regression through the origin. In this case, both 
> R and statsmodels use the definition of R^2 given by eq(4') in the following 
> review paper:
> https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf
> Here is the definition from this paper:
> R^2 = \sum(\hat( y)_i^2)/\sum(y_i^2)
> The paper also describes why this should be the case. I've double checked 
> that the value of R^2 from statsmodels and R are consistent with this 
> definition. On the other hand, scikit-learn doesn't use the above definition. 
> I would recommend using the above definition in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12466) Harmless Master NPE in tests

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12466:


Assignee: Andrew Or  (was: Apache Spark)

> Harmless Master NPE in tests
> 
>
> Key: SPARK-12466
> URL: https://issues.apache.org/jira/browse/SPARK-12466
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Tests
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.6.1, 2.0.0
>
>
> {code}
> [info] ReplayListenerSuite:
> [info] - Simple replay (58 milliseconds)
> java.lang.NullPointerException
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982)
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980)
>   at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
>   at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at scala.concurrent.Promise$class.complete(Promise.scala:55)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> [info] - End-to-end replay (10 seconds, 755 milliseconds)
> {code}
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull
> caused by https://github.com/apache/spark/pull/10284
> Thanks to [~ted_yu] for reporting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12321) JSON format for logical/physical execution plans

2015-12-21 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12321:
-
Assignee: Wenchen Fan

> JSON format for logical/physical execution plans
> 
>
> Key: SPARK-12321
> URL: https://issues.apache.org/jira/browse/SPARK-12321
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12465:


Assignee: (was: Apache Spark)

> Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
> --
>
> Key: SPARK-12465
> URL: https://issues.apache.org/jira/browse/SPARK-12465
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Reporter: Timothy Chen
>
> Remove spark.deploy.mesos.zookeeper.dir and use existing configuration 
> spark.deploy.zookeeper.dir for Mesos cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12469) Consistent Accumulators for Spark

2015-12-21 Thread holdenk (JIRA)

holdenk created SPARK-12469:
---

 Summary: Consistent Accumulators for Spark
 Key: SPARK-12469
 URL: https://issues.apache.org/jira/browse/SPARK-12469
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: holdenk


Tasks executed on Spark workers are unable to modify values from the driver, 
and accumulators are the one exception for this. Accumulators in Spark are 
implemented in such a way that when a stage is recomputed (say for cache 
eviction) the accumulator will be updated a second time. This makes 
accumulators inside of transformations more difficult to use for things like 
counting invalid records (one of the primary potential use cases of collecting 
side information during a transformation). However in some cases this counting 
during re-evaluation is exactly the behaviour we want (say in tracking total 
execution time for a particular function). Spark would benefit from a version 
of accumulators which did not double count even if stages were re-executed.

Motivating example:
{code}
val parseTime = sc.accumulator(0L)
val parseFailures = sc.accumulator(0L)
val parsedData = sc.textFile(...).flatMap { line =>
  val start = System.currentTimeMillis()
  val parsed = Try(parse(line))
  if (parsed.isFailure) parseFailures += 1
  parseTime += System.currentTimeMillis() - start
  parsed.toOption
}
parsedData.cache()

val resultA = parsedData.map(...).filter(...).count()

// some intervening code.  Almost anything could happen here -- some of 
parsedData may
// get kicked out of the cache, or an executor where data was cached might get 
lost

val resultB = parsedData.filter(...).map(...).flatMap(...).count()

// now we look at the accumulators
{code}

Here we would want parseFailures to only have been added to once for every line 
which failed to parse.  Unfortunately, the current Spark accumulator API 
doesn’t support the current parseFailures use case since if some data had been 
evicted its possible that it will be double counted.


See the full design document at 
https://docs.google.com/document/d/1lR_l1g3zMVctZXrcVjFusq2iQVpr4XvRK_UUDsDr6nk/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12451) Regexp functions don't support patterns containing '*/'

2015-12-21 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066840#comment-15066840
 ] 

Xiao Li commented on SPARK-12451:
-

This is a duplicate of https://issues.apache.org/jira/browse/SPARK-11352

The problem has been resolved. You can get the fix in 1.5.3 and 1.6

Thanks!

> Regexp functions don't support patterns containing '*/'
> ---
>
> Key: SPARK-12451
> URL: https://issues.apache.org/jira/browse/SPARK-12451
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: William Dee
>
> When using the regexp functions in Spark SQL, patterns containing '*/' create 
> runtime errors in the auto generated code. This is due to the fact that the 
> code generator creates a multiline comment containing, amongst other things, 
> the pattern.
> Here is an excerpt from my stacktrace to illustrate: (Helpfully, the stack 
> trace includes all of the auto-generated code)
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: Line 232, Column 
> 54: Unexpected token "," in primary
>   at org.codehaus.janino.Parser.compileException(Parser.java:3125)
>   at org.codehaus.janino.Parser.parsePrimary(Parser.java:2512)
>   at org.codehaus.janino.Parser.parseUnaryExpression(Parser.java:2252)
>   at 
> org.codehaus.janino.Parser.parseMultiplicativeExpression(Parser.java:2211)
>   at org.codehaus.janino.Parser.parseAdditiveExpression(Parser.java:2190)
>   at org.codehaus.janino.Parser.parseShiftExpression(Parser.java:2169)
>   at 
> org.codehaus.janino.Parser.parseRelationalExpression(Parser.java:2072)
>   at org.codehaus.janino.Parser.parseEqualityExpression(Parser.java:2046)
>   at org.codehaus.janino.Parser.parseAndExpression(Parser.java:2025)
>   at 
> org.codehaus.janino.Parser.parseExclusiveOrExpression(Parser.java:2004)
>   at 
> org.codehaus.janino.Parser.parseInclusiveOrExpression(Parser.java:1983)
>   at 
> org.codehaus.janino.Parser.parseConditionalAndExpression(Parser.java:1962)
>   at 
> org.codehaus.janino.Parser.parseConditionalOrExpression(Parser.java:1941)
>   at 
> org.codehaus.janino.Parser.parseConditionalExpression(Parser.java:1922)
>   at 
> org.codehaus.janino.Parser.parseAssignmentExpression(Parser.java:1901)
>   at org.codehaus.janino.Parser.parseExpression(Parser.java:1886)
>   at org.codehaus.janino.Parser.parseBlockStatement(Parser.java:1149)
>   at org.codehaus.janino.Parser.parseBlockStatements(Parser.java:1085)
>   at 
> org.codehaus.janino.Parser.parseMethodDeclarationRest(Parser.java:938)
>   at org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:620)
>   at org.codehaus.janino.Parser.parseClassBody(Parser.java:515)
>   at org.codehaus.janino.Parser.parseClassDeclarationRest(Parser.java:481)
>   at org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:577)
>   at 
> org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229)
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:192)
>   at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:84)
>   at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:77)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:387)
> ... line 232 ...
> /* regexp_replace(input[46, StringType],^.*/,) */
> 
> /* input[46, StringType] */
> 
> boolean isNull31 = i.isNullAt(46);
> UTF8String primitive32 = isNull31 ? null : (i.getUTF8String(46));
> 
> boolean isNull24 = true;
> UTF8String primitive25 = null;
> if (!isNull31) {
>   /* ^.*/ */
>   
>   /* expression: ^.*/ */
>   Object obj35 = expressions[4].eval(i);
>   boolean isNull33 = obj35 == null;
>   UTF8String primitive34 = null;
>   if (!isNull33) {
> primitive34 = (UTF8String) obj35;
>   }
> ...
> {code}
> Note the multiple multiline comments, these obviously break when the regex 
> pattern contains the end-of-comment token '*/'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12455) Add ExpressionDescription to window functions

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12455:


Assignee: Herman van Hovell  (was: Apache Spark)

> Add ExpressionDescription to window functions
> -
>
> Key: SPARK-12455
> URL: https://issues.apache.org/jira/browse/SPARK-12455
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Herman van Hovell
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12466) Harmless Master NPE in tests

2015-12-21 Thread Andrew Or (JIRA)

Andrew Or created SPARK-12466:
-

 Summary: Harmless Master NPE in tests
 Key: SPARK-12466
 URL: https://issues.apache.org/jira/browse/SPARK-12466
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Tests
Affects Versions: 1.6.0
Reporter: Andrew Or
Assignee: Andrew Or
 Fix For: 1.6.1, 2.0.0


{code}
[info] ReplayListenerSuite:
[info] - Simple replay (58 milliseconds)
java.lang.NullPointerException
at 
org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982)
at 
org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980)
at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at 
com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at 
scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at 
scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[info] - End-to-end replay (10 seconds, 755 milliseconds)
{code}
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull

caused by https://github.com/apache/spark/pull/10284

Thanks to [~ted_yu] for reporting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12457) Add ExpressionDescription to collection functions

2015-12-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067022#comment-15067022
 ] 

Apache Spark commented on SPARK-12457:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/10418

> Add ExpressionDescription to collection functions
> -
>
> Key: SPARK-12457
> URL: https://issues.apache.org/jira/browse/SPARK-12457
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12247:


Assignee: (was: Apache Spark)

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12247:


Assignee: Apache Spark

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>Assignee: Apache Spark
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12468:


Assignee: Apache Spark

> getParamMap in Pyspark ML API returns empty dictionary in example for 
> Documentation
> ---
>
> Key: SPARK-12468
> URL: https://issues.apache.org/jira/browse/SPARK-12468
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Zachary Brown
>Assignee: Apache Spark
>Priority: Minor
>
> The `extractParamMap()` method for a model that has been fit returns an empty 
> dictionary, e.g. (from the [Pyspark ML API 
> Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)):
> ```python
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.param import Param, Params
> # Prepare training data from a list of (label, features) tuples.
> training = sqlContext.createDataFrame([
> (1.0, Vectors.dense([0.0, 1.1, 0.1])),
> (0.0, Vectors.dense([2.0, 1.0, -1.0])),
> (0.0, Vectors.dense([2.0, 1.3, 1.0])),
> (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
> # Create a LogisticRegression instance. This instance is an Estimator.
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> # Print out the parameters, documentation, and any default values.
> print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"
> # Learn a LogisticRegression model. This uses the parameters stored in lr.
> model1 = lr.fit(training)
> # Since model1 is a Model (i.e., a transformer produced by an Estimator),
> # we can view the parameters it used during fit().
> # This prints the parameter (name: value) pairs, where names are unique IDs 
> for this
> # LogisticRegression instance.
> print "Model 1 was fit using parameters: "
> print model1.extractParamMap()
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5882) Add a test for GraphLoader.edgeListFile

2015-12-21 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-5882.
--
  Resolution: Fixed
Assignee: Takeshi Yamamuro
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Add a test for GraphLoader.edgeListFile
> ---
>
> Key: SPARK-5882
> URL: https://issues.apache.org/jira/browse/SPARK-5882
> Project: Spark
>  Issue Type: Test
>  Components: GraphX
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Trivial
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12470) Incorrect calculation of row size in o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner

2015-12-21 Thread Pete Robbins (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pete Robbins updated SPARK-12470:
-
Component/s: SQL
Summary: Incorrect calculation of row size in 
o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner  (was: Incorrect 
calculation of row size in 
o.a.s.catalyst.expressions.codegen.GenerateUnsafeRowJoiner)

> Incorrect calculation of row size in 
> o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner
> ---
>
> Key: SPARK-12470
> URL: https://issues.apache.org/jira/browse/SPARK-12470
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Pete Robbins
>Priority: Minor
>
> While looking into https://issues.apache.org/jira/browse/SPARK-12319 I 
> noticed that the row size is incorrectly calculated.
> The "sizeReduction" value is calculated in words:
>// The number of words we can reduce when we concat two rows together.
> // The only reduction comes from merging the bitset portion of the two 
> rows, saving 1 word.
> val sizeReduction = bitset1Words + bitset2Words - outputBitsetWords
> but then it is subtracted from the size of the row in bytes:
>|out.pointTo(buf, ${schema1.size + schema2.size}, sizeInBytes - 
> $sizeReduction);
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11823) HiveThriftBinaryServerSuite tests timing out, leaves hanging processes

2015-12-21 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067314#comment-15067314
 ] 

Josh Rosen commented on SPARK-11823:


It looks like this has caused a huge number of timeouts in the Master Maven 
Hadoop 2.4 builds this week: 
https://spark-tests.appspot.com/jobs/Spark-Master-Maven-with-YARN%20%C2%BB%20hadoop-2.4%2Cspark-test

I'm going to pull some logs and take a look.

> HiveThriftBinaryServerSuite tests timing out, leaves hanging processes
> --
>
> Key: SPARK-11823
> URL: https://issues.apache.org/jira/browse/SPARK-11823
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: shane knapp
> Attachments: 
> spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-05.out,
>  stack.log
>
>
> i've noticed on a few branches that the HiveThriftBinaryServerSuite tests 
> time out, and when that happens, the build is aborted but the tests leave 
> behind hanging processes that eat up cpu and ram.
> most recently, i discovered this happening w/the 1.6 SBT build, specifically 
> w/the hadoop 2.0 profile:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.6-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/56/console
> [~vanzin] grabbed the jstack log, which i've attached to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12414) Remove closure serializer

2015-12-21 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12414:
--
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-11806

> Remove closure serializer
> -
>
> Key: SPARK-12414
> URL: https://issues.apache.org/jira/browse/SPARK-12414
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>
> There is a config `spark.closure.serializer` that accepts exactly one value: 
> the java serializer. This is because there are currently bugs in the Kryo 
> serializer that make it not a viable candidate. This was uncovered by an 
> unsuccessful attempt to make it work: SPARK-7708.
> My high level point is that the Java serializer has worked well for at least 
> 6 Spark versions now, and it is an incredibly complicated task to get other 
> serializers (not just Kryo) to work with Spark's closures. IMO the effort is 
> not worth it and we should just remove this documentation and all the code 
> associated with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12374) Improve performance of Range APIs via adding logical/physical operators

2015-12-21 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12374.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10335
[https://github.com/apache/spark/pull/10335]

> Improve performance of Range APIs via adding logical/physical operators
> ---
>
> Key: SPARK-12374
> URL: https://issues.apache.org/jira/browse/SPARK-12374
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Critical
> Fix For: 2.0.0
>
>
> Creating an actual logical/physical operator for range for matching the 
> performance of RDD Range APIs. 
> Compared with the old Range API, the new version is 3 times faster than the 
> old version. 
> {code}
> scala> val startTime = System.currentTimeMillis; sqlContext.oldRange(0, 
> 10, 1, 15).count(); val endTime = System.currentTimeMillis; val start 
> = new Timestamp(startTime); val end = new Timestamp(endTime); val elapsed = 
> (endTime - startTime)/ 1000.0
> startTime: Long = 1450416394240   
>   
> endTime: Long = 1450416421199
> start: java.sql.Timestamp = 2015-12-17 21:26:34.24
> end: java.sql.Timestamp = 2015-12-17 21:27:01.199
> elapsed: Double = 26.959
> {code}
> {code}
> scala> val startTime = System.currentTimeMillis; sqlContext.range(0, 
> 10, 1, 15).count(); val endTime = System.currentTimeMillis; val start 
> = new Timestamp(startTime); val end = new Timestamp(endTime); val elapsed = 
> (endTime - startTime)/ 1000.0
> startTime: Long = 1450416360107   
>   
> endTime: Long = 1450416368590
> start: java.sql.Timestamp = 2015-12-17 21:26:00.107
> end: java.sql.Timestamp = 2015-12-17 21:26:08.59
> elapsed: Double = 8.483
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12150) numPartitions argument to sqlContext.range() should be optional

2015-12-21 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12150.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10335
[https://github.com/apache/spark/pull/10335]

> numPartitions argument to sqlContext.range()  should be optional
> 
>
> Key: SPARK-12150
> URL: https://issues.apache.org/jira/browse/SPARK-12150
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Henri DF
>Priority: Minor
> Fix For: 2.0.0
>
>
> It's a little inconsistent that the first two sqlContext.range() methods 
> don't take a numPartitions arg, while the third one does. 
> And more importantly, it's a little inconvenient that the numPartitions arg 
> is mandatory for the third range() method - it means that if you want to 
> specify a step, you suddenly have to think about partitioning - an orthogonal 
> concern.
> My suggestion would be to make numPartitions optional, like it is on the 
> sparkContext.range(..).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12363) PowerIterationClustering test case failed if we deprecated KMeans.setRuns

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12363:


Assignee: Apache Spark

> PowerIterationClustering test case failed if we deprecated KMeans.setRuns
> -
>
> Key: SPARK-12363
> URL: https://issues.apache.org/jira/browse/SPARK-12363
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> We plan to deprecated `runs` of KMeans, PowerIterationClustering will 
> leverage KMeans to train model.
> I removed `setRuns` used in PowerIterationClustering, but one of the test 
> cases failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11823) HiveThriftBinaryServerSuite tests timing out, leaves hanging processes

2015-12-21 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067345#comment-15067345
 ] 

Josh Rosen commented on SPARK-11823:


I think I spotted the problem; it may be a bad use of Thread.sleep() in a test: 
https://github.com/apache/spark/pull/6207/files#r30935200

> HiveThriftBinaryServerSuite tests timing out, leaves hanging processes
> --
>
> Key: SPARK-11823
> URL: https://issues.apache.org/jira/browse/SPARK-11823
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: shane knapp
>Assignee: Josh Rosen
> Attachments: 
> spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-05.out,
>  stack.log
>
>
> i've noticed on a few branches that the HiveThriftBinaryServerSuite tests 
> time out, and when that happens, the build is aborted but the tests leave 
> behind hanging processes that eat up cpu and ram.
> most recently, i discovered this happening w/the 1.6 SBT build, specifically 
> w/the hadoop 2.0 profile:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.6-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/56/console
> [~vanzin] grabbed the jstack log, which i've attached to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12473) Reuse serializer instances for performance

2015-12-21 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12473:
--
Description: 
After commit de02782 of page rank regressed from 242s to 260s, about 7%. 
Although currently it's only 7%, we will likely register more classes in the 
future so this will only increase.

The commit added 26 types to register every time we create a Kryo serializer 
instance. I ran a small microbenchmark to prove that this is noticeably 
expensive:

{code}
import org.apache.spark.serializer._
import org.apache.spark.SparkConf

def makeMany(num: Int): Long = {
  val start = System.currentTimeMillis
  (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() }
  System.currentTimeMillis - start
}

// before commit de02782, averaged over multiple runs
makeMany(5000) == 1500

// after commit de02782, averaged over multiple runs
makeMany(5000) == 2750
{code}

Since we create multiple serializer instances per partition, this means a 
5000-partition stage will unconditionally see an increase of > 1s for the 
stage. In page rank, we may run many such stages.

We should explore the alternative of reusing thread-local serializer instances, 
which would lead to much fewer calls to `kryo.register`.

  was:
After commit de02782 of page rank regressed from 242s to 260s, about 7%. 
Although currently it's only 7%, we will likely register more classes in the 
future so we should do this the right way.

The commit added 26 types to register every time we create a Kryo serializer 
instance. I ran a small microbenchmark to prove that this is noticeably 
expensive:

{code}
import org.apache.spark.serializer._
import org.apache.spark.SparkConf

def makeMany(num: Int): Long = {
  val start = System.currentTimeMillis
  (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() }
  System.currentTimeMillis - start
}

// before commit de02782, averaged over multiple runs
makeMany(5000) == 1500

// after commit de02782, averaged over multiple runs
makeMany(5000) == 2750
{code}

Since we create multiple serializer instances per partition, this means a 
5000-partition stage will unconditionally see an increase of > 1s for the 
stage. In page rank, we may run many such stages.

We should explore the alternative of reusing thread-local serializer instances, 
which would lead to much fewer calls to `kryo.register`.


> Reuse serializer instances for performance
> --
>
> Key: SPARK-12473
> URL: https://issues.apache.org/jira/browse/SPARK-12473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> After commit de02782 of page rank regressed from 242s to 260s, about 7%. 
> Although currently it's only 7%, we will likely register more classes in the 
> future so this will only increase.
> The commit added 26 types to register every time we create a Kryo serializer 
> instance. I ran a small microbenchmark to prove that this is noticeably 
> expensive:
> {code}
> import org.apache.spark.serializer._
> import org.apache.spark.SparkConf
> def makeMany(num: Int): Long = {
>   val start = System.currentTimeMillis
>   (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() }
>   System.currentTimeMillis - start
> }
> // before commit de02782, averaged over multiple runs
> makeMany(5000) == 1500
> // after commit de02782, averaged over multiple runs
> makeMany(5000) == 2750
> {code}
> Since we create multiple serializer instances per partition, this means a 
> 5000-partition stage will unconditionally see an increase of > 1s for the 
> stage. In page rank, we may run many such stages.
> We should explore the alternative of reusing thread-local serializer 
> instances, which would lead to much fewer calls to `kryo.register`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12440) Avoid setCheckpointDir warning when filesystem is not local

2015-12-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12440:
--
Priority: Trivial  (was: Major)

> Avoid setCheckpointDir warning when filesystem is not local
> ---
>
> Key: SPARK-12440
> URL: https://issues.apache.org/jira/browse/SPARK-12440
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.0, 1.6.1
>Reporter: Pierre Borckmans
>Priority: Trivial
>
> In SparkContext method `setCheckpointDir`, a warning is issued when spark 
> master is not local and the passed directory for the checkpoint dir appears 
> to be local.
> In practice, when relying on hdfs configuration file and using relative path 
> (incomplete URI without hdfs scheme, ...), this warning should not be issued 
> and might be confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12440) Avoid setCheckpointDir warning when filesystem is not local

2015-12-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12440:
--
Summary: Avoid setCheckpointDir warning when filesystem is not local  (was: 
[CORE] Avoid setCheckpointDir warning when filesystem is not local)

> Avoid setCheckpointDir warning when filesystem is not local
> ---
>
> Key: SPARK-12440
> URL: https://issues.apache.org/jira/browse/SPARK-12440
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.0, 1.6.1
>Reporter: Pierre Borckmans
>
> In SparkContext method `setCheckpointDir`, a warning is issued when spark 
> master is not local and the passed directory for the checkpoint dir appears 
> to be local.
> In practice, when relying on hdfs configuration file and using relative path 
> (incomplete URI without hdfs scheme, ...), this warning should not be issued 
> and might be confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version

2015-12-21 Thread Martin Schade (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067125#comment-15067125
 ] 

Martin Schade commented on SPARK-12453:
---

Make sense, thank you. Ideally it should be 1.9.37 instead of 1.9.40 though. 
Both KCL 1.4.0 and KPL 0.10.1 reference 1.9.37.

https://github.com/awslabs/amazon-kinesis-producer/blob/v0.10.1/java/amazon-kinesis-producer/pom.xml
 
https://github.com/awslabs/amazon-kinesis-client/blob/v1.4.0/pom.xml
Both reference 1.9.37

In the latest version of KPL (v0.10.2) 1.10.34 is references and in latest KCL 
1.6.1 it is version 1.10.20, not easy so sync then. So it would need to do some 
testing which combination works actually.

> Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
> 
>
> Key: SPARK-12453
> URL: https://issues.apache.org/jira/browse/SPARK-12453
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Martin Schade
>Priority: Critical
>  Labels: easyfix
>
> The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS 
> Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0).
> AWS KCL 1.3.0 references AWS Java SDK version 1.9.37.
> Using 1.9.16 in combination with 1.3.0 does fail to get data out of the 
> stream.
> I tested Spark Streaming with 1.9.37 and it works fine. 
> Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also 
> fails, so it is due to the specific versions used in 1.5.2 and not a Spark 
> related implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12440) [CORE] Avoid setCheckpointDir warning when filesystem is not local

2015-12-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12440:
--
Component/s: Spark Core

> [CORE] Avoid setCheckpointDir warning when filesystem is not local
> --
>
> Key: SPARK-12440
> URL: https://issues.apache.org/jira/browse/SPARK-12440
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.0, 1.6.1
>Reporter: Pierre Borckmans
>
> In SparkContext method `setCheckpointDir`, a warning is issued when spark 
> master is not local and the passed directory for the checkpoint dir appears 
> to be local.
> In practice, when relying on hdfs configuration file and using relative path 
> (incomplete URI without hdfs scheme, ...), this warning should not be issued 
> and might be confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12339) NullPointerException on stage kill from web UI

2015-12-21 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067154#comment-15067154
 ] 

Andrew Or commented on SPARK-12339:
---

I've updated the affected version to 2.0 since SPARK-11206 was merged only 
there. Please let me know if this is not the case.

> NullPointerException on stage kill from web UI
> --
>
> Key: SPARK-12339
> URL: https://issues.apache.org/jira/browse/SPARK-12339
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Assignee: Alex Bozarth
> Fix For: 2.0.0
>
>
> The following message is in the logs after killing a stage:
> {code}
> scala> INFO Executor: Executor killed task 1.0 in stage 7.0 (TID 33)
> INFO Executor: Executor killed task 0.0 in stage 7.0 (TID 32)
> WARN TaskSetManager: Lost task 1.0 in stage 7.0 (TID 33, localhost): 
> TaskKilled (killed intentionally)
> WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 32, localhost): 
> TaskKilled (killed intentionally)
> INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, 
> from pool
> ERROR LiveListenerBus: Listener SQLListener threw an exception
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
> ERROR LiveListenerBus: Listener SQLListener threw an exception
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
> {code}
> To reproduce, start a job and kill the stage from web UI, e.g.:
> {code}
> val rdd = sc.parallelize(0 to 9, 2)
> rdd.mapPartitionsWithIndex { case (n, it) => Thread.sleep(10 * 1000); it 
> }.count
> {code}
> Go to web UI and in Stages tab click "kill" for the stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version

2015-12-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067156#comment-15067156
 ] 

Sean Owen commented on SPARK-12453:
---

Ah, I misread the PR and it already just removes aws.java.sdk.version and 
manually managing the dependency. Just deleting the version and the 
dependencyManagement entry does the trick right?

> Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
> 
>
> Key: SPARK-12453
> URL: https://issues.apache.org/jira/browse/SPARK-12453
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Martin Schade
>Priority: Critical
>  Labels: easyfix
>
> The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS 
> Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0).
> AWS KCL 1.3.0 references AWS Java SDK version 1.9.37.
> Using 1.9.16 in combination with 1.3.0 does fail to get data out of the 
> stream.
> I tested Spark Streaming with 1.9.37 and it works fine. 
> Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also 
> fails, so it is due to the specific versions used in 1.5.2 and not a Spark 
> related implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12339) NullPointerException on stage kill from web UI

2015-12-21 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12339:
--
Affects Version/s: (was: 1.6.0)
   2.0.0

> NullPointerException on stage kill from web UI
> --
>
> Key: SPARK-12339
> URL: https://issues.apache.org/jira/browse/SPARK-12339
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Assignee: Alex Bozarth
> Fix For: 2.0.0
>
>
> The following message is in the logs after killing a stage:
> {code}
> scala> INFO Executor: Executor killed task 1.0 in stage 7.0 (TID 33)
> INFO Executor: Executor killed task 0.0 in stage 7.0 (TID 32)
> WARN TaskSetManager: Lost task 1.0 in stage 7.0 (TID 33, localhost): 
> TaskKilled (killed intentionally)
> WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 32, localhost): 
> TaskKilled (killed intentionally)
> INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, 
> from pool
> ERROR LiveListenerBus: Listener SQLListener threw an exception
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
> ERROR LiveListenerBus: Listener SQLListener threw an exception
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
> {code}
> To reproduce, start a job and kill the stage from web UI, e.g.:
> {code}
> val rdd = sc.parallelize(0 to 9, 2)
> rdd.mapPartitionsWithIndex { case (n, it) => Thread.sleep(10 * 1000); it 
> }.count
> {code}
> Go to web UI and in Stages tab click "kill" for the stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12392) Optimize a location order of broadcast blocks by considering preferred local hosts

2015-12-21 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12392.
---
  Resolution: Fixed
Assignee: Takeshi Yamamuro
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Optimize a location order of broadcast blocks by considering preferred local 
> hosts
> --
>
> Key: SPARK-12392
> URL: https://issues.apache.org/jira/browse/SPARK-12392
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
> Fix For: 2.0.0
>
>
> When multiple workers exist in a host, we can bypass unnecessary remote 
> access for broadcasts; block managers fetch broadcast blocks from the same 
> host instead of remote hosts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12456) Add ExpressionDescription to misc functions

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12456:


Assignee: Apache Spark

> Add ExpressionDescription to misc functions
> ---
>
> Key: SPARK-12456
> URL: https://issues.apache.org/jira/browse/SPARK-12456
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12456) Add ExpressionDescription to misc functions

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12456:


Assignee: (was: Apache Spark)

> Add ExpressionDescription to misc functions
> ---
>
> Key: SPARK-12456
> URL: https://issues.apache.org/jira/browse/SPARK-12456
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12473) Reuse serializer instances for performance

2015-12-21 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12473:
--
Description: 
After commit de02782 of page rank regressed from 242s to 260s, about 7%. 
Although currently it's only 7%, we will likely register more classes in the 
future so we should do this the right way.

The commit added 26 types to register every time we create a Kryo serializer 
instance. I ran a small microbenchmark to prove that this is noticeably 
expensive:

{code}
import org.apache.spark.serializer._
import org.apache.spark.SparkConf

def makeMany(num: Int): Long = {
  val start = System.currentTimeMillis
  (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() }
  System.currentTimeMillis - start
}

// before commit de02782, averaged over multiple runs
makeMany(5000) == 1500

// after commit de02782, averaged over multiple runs
makeMany(5000) == 2750
{code}

Since we create multiple serializer instances per partition, this means a 
5000-partition stage will unconditionally see an increase of > 1s for the 
stage. In page rank, we may run many such stages.

We should explore the alternative of reusing thread-local serializer instances, 
which would lead to much fewer calls to `kryo.register`.

  was:
After commit de02782 of page rank regressed from 242s to 260s, about 7%.

The commit added 26 types to register every time we create a Kryo serializer 
instance. I ran a small microbenchmark to prove that this is noticeably 
expensive:

{code}
import org.apache.spark.serializer._
import org.apache.spark.SparkConf

def makeMany(num: Int): Long = {
  val start = System.currentTimeMillis
  (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() }
  System.currentTimeMillis - start
}

// before commit de02782, averaged over multiple runs
makeMany(5000) == 1500

// after commit de02782, averaged over multiple runs
makeMany(5000) == 2750
{code}

Since we create multiple serializer instances per partition, this means a 
5000-partition stage will unconditionally see an increase of > 1s for the 
stage. In page rank, we may run many such stages.

We should explore the alternative of reusing thread-local serializer instances, 
which would lead to much fewer calls to `kryo.register`.


> Reuse serializer instances for performance
> --
>
> Key: SPARK-12473
> URL: https://issues.apache.org/jira/browse/SPARK-12473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> After commit de02782 of page rank regressed from 242s to 260s, about 7%. 
> Although currently it's only 7%, we will likely register more classes in the 
> future so we should do this the right way.
> The commit added 26 types to register every time we create a Kryo serializer 
> instance. I ran a small microbenchmark to prove that this is noticeably 
> expensive:
> {code}
> import org.apache.spark.serializer._
> import org.apache.spark.SparkConf
> def makeMany(num: Int): Long = {
>   val start = System.currentTimeMillis
>   (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() }
>   System.currentTimeMillis - start
> }
> // before commit de02782, averaged over multiple runs
> makeMany(5000) == 1500
> // after commit de02782, averaged over multiple runs
> makeMany(5000) == 2750
> {code}
> Since we create multiple serializer instances per partition, this means a 
> 5000-partition stage will unconditionally see an increase of > 1s for the 
> stage. In page rank, we may run many such stages.
> We should explore the alternative of reusing thread-local serializer 
> instances, which would lead to much fewer calls to `kryo.register`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12471) Spark daemons should log their pid in the log file

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12471:


Assignee: (was: Apache Spark)

> Spark daemons should log their pid in the log file
> --
>
> Key: SPARK-12471
> URL: https://issues.apache.org/jira/browse/SPARK-12471
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Nong Li
>
> This is useful when debugging from the log files without the processes 
> running. This information makes it possible to combine the log files with 
> other system information (e.g. dmesg output)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12471) Spark daemons should log their pid in the log file

2015-12-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12471:


Assignee: Apache Spark

> Spark daemons should log their pid in the log file
> --
>
> Key: SPARK-12471
> URL: https://issues.apache.org/jira/browse/SPARK-12471
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Nong Li
>Assignee: Apache Spark
>
> This is useful when debugging from the log files without the processes 
> running. This information makes it possible to combine the log files with 
> other system information (e.g. dmesg output)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12458) Add ExpressionDescription to datetime functions

2015-12-21 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067278#comment-15067278
 ] 

Dilip Biswal commented on SPARK-12458:
--

I would like to work on this one.

> Add ExpressionDescription to datetime functions
> ---
>
> Key: SPARK-12458
> URL: https://issues.apache.org/jira/browse/SPARK-12458
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version

2015-12-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067115#comment-15067115
 ] 

Sean Owen commented on SPARK-12453:
---

OK, I see what happened here: 
https://github.com/apache/spark/commit/87f82a5fb9c4350a97c761411069245f07aad46f

How about updating to 1.9.40 for consistency? really, it sounds like there's no 
point manually setting the SDK version here -- how about preemptively bringing 
those parts of SPARK-12269 back? 

Then really it should go into master first, and be backported, and then further 
updated by 12269. This is why I view it as sort of a duplicate, since it could 
as well come from back-porting just a subset of 12269.

I don't know if a new 1.5.x release will happen.

> Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
> 
>
> Key: SPARK-12453
> URL: https://issues.apache.org/jira/browse/SPARK-12453
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Martin Schade
>Priority: Critical
>  Labels: easyfix
>
> The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS 
> Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0).
> AWS KCL 1.3.0 references AWS Java SDK version 1.9.37.
> Using 1.9.16 in combination with 1.3.0 does fail to get data out of the 
> stream.
> I tested Spark Streaming with 1.9.37 and it works fine. 
> Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also 
> fails, so it is due to the specific versions used in 1.5.2 and not a Spark 
> related implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode

2015-12-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12463:
--
Component/s: Mesos

> Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode
> 
>
> Key: SPARK-12463
> URL: https://issues.apache.org/jira/browse/SPARK-12463
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Reporter: Timothy Chen
>
> Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode 
> configuration for cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 199 matches

Mail list logo