[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2015-10-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956367#comment-14956367
 ] 

Xiao Li commented on SPARK-10925:
-

Also hit the same problem. Trying to narrow down the root cause of the analyzer 
internal. 

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
>   at TestCase2$.main(TestCase2.scala:51)
>   at TestCase2.main(TestCase2.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:4

[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-13 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956341#comment-14956341
 ] 

Sandy Ryza commented on SPARK-:
---

Thanks for the explanation [~rxin] and [~marmbrus].  I understand the problem 
and don't have any great ideas for an alternative workable solution.

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11095) Simplify Netty RPC implementation by using a separate thread pool for each endpoint

2015-10-13 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-11095:
---

 Summary: Simplify Netty RPC implementation by using a separate 
thread pool for each endpoint
 Key: SPARK-11095
 URL: https://issues.apache.org/jira/browse/SPARK-11095
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu


The dispatcher class and the inbox class of the current Netty-based RPC 
implementation is fairly complicated. It uses a single, shared thread pool to 
execute all the endpoints. This is similar to how Akka does actor message 
dispatching. The benefit of this design is that this RPC implementation can 
support a very large number of endpoints, as they are all multiplexed into a 
single thread pool for execution. The downside is the complexity resulting from 
synchronization and coordination.

An alternative implementation is to have a separate message queue and thread 
pool for each endpoint. The dispatcher simply routes the messages to the 
appropriate message queue, and the threads poll the queue for messages to 
process.

If the endpoint is single threaded, then the thread pool should contain only a 
single thread. If the endpoint supports concurrent execution, then the thread 
pool should contain more threads.

Two additional things we need to be careful with are:

1. An endpoint should only process normal messages after OnStart is called. 
This can be done by having the thread that starts the endpoint processing 
OnStart.

2. An endpoint should process OnStop after all normal messages have been 
processed. I think this can be done by having a busy loop to spin until the 
size of the message queue is 0.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10996) Implement sampleBy() in DataFrameStatFunctions

2015-10-13 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-10996.
---
   Resolution: Fixed
 Assignee: Sun Rui
Fix Version/s: 1.6.0

Resolved by https://github.com/apache/spark/pull/9023

> Implement sampleBy() in DataFrameStatFunctions
> --
>
> Key: SPARK-10996
> URL: https://issues.apache.org/jira/browse/SPARK-10996
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>Assignee: Sun Rui
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10981) R semijoin leads to Java errors, R leftsemi leads to Spark errors

2015-10-13 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-10981:
--
Assignee: Monica Liu

> R semijoin leads to Java errors, R leftsemi leads to Spark errors
> -
>
> Key: SPARK-10981
> URL: https://issues.apache.org/jira/browse/SPARK-10981
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.5.0
> Environment: SparkR from RStudio on Macbook
>Reporter: Monica Liu
>Assignee: Monica Liu
>Priority: Minor
>  Labels: easyfix, newbie
> Fix For: 1.5.2, 1.6.0
>
>
> I am using SparkR from RStudio, and I ran into an error with the join 
> function that I recreated with a smaller example:
> {code:title=joinTest.R|borderStyle=solid}
> Sys.setenv(SPARK_HOME="/Users/liumo1/Applications/spark/")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
> sc <- sparkR.init("local[4]")
> sqlContext <- sparkRSQL.init(sc) 
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b)
> df1= createDataFrame(sqlContext, df)
> showDF(df1)
> x = c(2, 3, 10)
> t = c("dd", "ee", "ff")
> c = c(FALSE, FALSE, TRUE)
> dff = data.frame(x, t, c)
> df2 = createDataFrame(sqlContext, dff)
> showDF(df2)
> res = join(df1, df2, df1$n == df2$x, "semijoin")
> showDF(res)
> {code}
> Running this code, I encountered the error:
> {panel}
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
>   java.lang.IllegalArgumentException: Unsupported join type 'semijoin'. 
> Supported join types include: 'inner', 'outer', 'full', 'fullouter', 
> 'leftouter', 'left', 'rightouter', 'right', 'leftsemi'.
> {panel}
> However, if I changed the joinType to "leftsemi", 
> {code}
> res = join(df1, df2, df1$n == df2$x, "leftsemi")
> {code}
> I would get the error:
> {panel}
> Error in .local(x, y, ...) : 
>   joinType must be one of the following types: 'inner', 'outer', 
> 'left_outer', 'right_outer', 'semijoin'
> {panel}
> Since the join function in R appears to invoke a Java method, I went into 
> DataFrame.R and changed the code on line 1374 and line 1378 to change the 
> "semijoin" to "leftsemi" to match the Java function's parameters. These also 
> make the R joinType accepted values match those of Scala's. 
> semijoin:
> {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
> if (joinType %in% c("inner", "outer", "left_outer", "right_outer", 
> "semijoin")) {
> sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
> } 
> else {
>  stop("joinType must be one of the following types: ",
>  "'inner', 'outer', 'left_outer', 'right_outer', 'semijoin'")
> }
> {code}
> leftsemi:
> {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
> if (joinType %in% c("inner", "outer", "left_outer", "right_outer", 
> "leftsemi")) {
> sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
> } 
> else {
>  stop("joinType must be one of the following types: ",
>  "'inner', 'outer', 'left_outer', 'right_outer', 'leftsemi'")
> }
> {code}
> This fixed the issue, but I'm not sure if this solution breaks hive 
> compatibility or causes other issues, but I can submit a pull request to 
> change this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10981) R semijoin leads to Java errors, R leftsemi leads to Spark errors

2015-10-13 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-10981.
---
   Resolution: Fixed
Fix Version/s: 1.6.0
   1.5.2

Resolved by https://github.com/apache/spark/pull/9029

> R semijoin leads to Java errors, R leftsemi leads to Spark errors
> -
>
> Key: SPARK-10981
> URL: https://issues.apache.org/jira/browse/SPARK-10981
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.5.0
> Environment: SparkR from RStudio on Macbook
>Reporter: Monica Liu
>Priority: Minor
>  Labels: easyfix, newbie
> Fix For: 1.5.2, 1.6.0
>
>
> I am using SparkR from RStudio, and I ran into an error with the join 
> function that I recreated with a smaller example:
> {code:title=joinTest.R|borderStyle=solid}
> Sys.setenv(SPARK_HOME="/Users/liumo1/Applications/spark/")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
> sc <- sparkR.init("local[4]")
> sqlContext <- sparkRSQL.init(sc) 
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b)
> df1= createDataFrame(sqlContext, df)
> showDF(df1)
> x = c(2, 3, 10)
> t = c("dd", "ee", "ff")
> c = c(FALSE, FALSE, TRUE)
> dff = data.frame(x, t, c)
> df2 = createDataFrame(sqlContext, dff)
> showDF(df2)
> res = join(df1, df2, df1$n == df2$x, "semijoin")
> showDF(res)
> {code}
> Running this code, I encountered the error:
> {panel}
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
>   java.lang.IllegalArgumentException: Unsupported join type 'semijoin'. 
> Supported join types include: 'inner', 'outer', 'full', 'fullouter', 
> 'leftouter', 'left', 'rightouter', 'right', 'leftsemi'.
> {panel}
> However, if I changed the joinType to "leftsemi", 
> {code}
> res = join(df1, df2, df1$n == df2$x, "leftsemi")
> {code}
> I would get the error:
> {panel}
> Error in .local(x, y, ...) : 
>   joinType must be one of the following types: 'inner', 'outer', 
> 'left_outer', 'right_outer', 'semijoin'
> {panel}
> Since the join function in R appears to invoke a Java method, I went into 
> DataFrame.R and changed the code on line 1374 and line 1378 to change the 
> "semijoin" to "leftsemi" to match the Java function's parameters. These also 
> make the R joinType accepted values match those of Scala's. 
> semijoin:
> {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
> if (joinType %in% c("inner", "outer", "left_outer", "right_outer", 
> "semijoin")) {
> sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
> } 
> else {
>  stop("joinType must be one of the following types: ",
>  "'inner', 'outer', 'left_outer', 'right_outer', 'semijoin'")
> }
> {code}
> leftsemi:
> {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
> if (joinType %in% c("inner", "outer", "left_outer", "right_outer", 
> "leftsemi")) {
> sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
> } 
> else {
>  stop("joinType must be one of the following types: ",
>  "'inner', 'outer', 'left_outer', 'right_outer', 'leftsemi'")
> }
> {code}
> This fixed the issue, but I'm not sure if this solution breaks hive 
> compatibility or causes other issues, but I can submit a pull request to 
> change this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11094) Test runner script fails to parse Java version.

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11094:


Assignee: Apache Spark

> Test runner script fails to parse Java version.
> ---
>
> Key: SPARK-11094
> URL: https://issues.apache.org/jira/browse/SPARK-11094
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
> Environment: Debian testing
>Reporter: Jakob Odersky
>Assignee: Apache Spark
>Priority: Minor
>
> Running {{dev/run-tests}} fails when the local Java version has an extra 
> string appended to the version.
> For example, in Debian Stretch (currently testing distribution), {{java 
> -version}} yields "1.8.0_66-internal" where the extra part "-internal" causes 
> the script to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11094) Test runner script fails to parse Java version.

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11094:


Assignee: (was: Apache Spark)

> Test runner script fails to parse Java version.
> ---
>
> Key: SPARK-11094
> URL: https://issues.apache.org/jira/browse/SPARK-11094
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
> Environment: Debian testing
>Reporter: Jakob Odersky
>Priority: Minor
>
> Running {{dev/run-tests}} fails when the local Java version has an extra 
> string appended to the version.
> For example, in Debian Stretch (currently testing distribution), {{java 
> -version}} yields "1.8.0_66-internal" where the extra part "-internal" causes 
> the script to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11094) Test runner script fails to parse Java version.

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956288#comment-14956288
 ] 

Apache Spark commented on SPARK-11094:
--

User 'jodersky' has created a pull request for this issue:
https://github.com/apache/spark/pull/9111

> Test runner script fails to parse Java version.
> ---
>
> Key: SPARK-11094
> URL: https://issues.apache.org/jira/browse/SPARK-11094
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
> Environment: Debian testing
>Reporter: Jakob Odersky
>Priority: Minor
>
> Running {{dev/run-tests}} fails when the local Java version has an extra 
> string appended to the version.
> For example, in Debian Stretch (currently testing distribution), {{java 
> -version}} yields "1.8.0_66-internal" where the extra part "-internal" causes 
> the script to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10935) Avito Context Ad Clicks

2015-10-13 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956279#comment-14956279
 ] 

Xusen Yin commented on SPARK-10935:
---

[~mengxr] [~kpl...@gmail.com] Are you still love to work on this? I want to 
split this if you want.

> Avito Context Ad Clicks
> ---
>
> Key: SPARK-10935
> URL: https://issues.apache.org/jira/browse/SPARK-10935
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>
> From [~kpl...@gmail.com]:
> I would love to do Avito Context Ad Clicks - 
> https://www.kaggle.com/c/avito-context-ad-clicks - but it involves a lot of 
> feature engineering and preprocessing. I would love to split this with 
> somebody else if anybody is interested on working with this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11094) Test runner script fails to parse Java version.

2015-10-13 Thread Jakob Odersky (JIRA)
Jakob Odersky created SPARK-11094:
-

 Summary: Test runner script fails to parse Java version.
 Key: SPARK-11094
 URL: https://issues.apache.org/jira/browse/SPARK-11094
 Project: Spark
  Issue Type: Bug
  Components: Tests
 Environment: Debian testing
Reporter: Jakob Odersky
Priority: Minor


Running {{dev/run-tests}} fails when the local Java version has an extra string 
appended to the version.
For example, in Debian Stretch (currently testing distribution), {{java 
-version}} yields "1.8.0_66-internal" where the extra part "-internal" causes 
the script to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10055) San Francisco Crime Classification

2015-10-13 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956276#comment-14956276
 ] 

Xusen Yin commented on SPARK-10055:
---

Yes, I will find a new dataset soon and ping you on JIRA.

> San Francisco Crime Classification
> --
>
> Key: SPARK-10055
> URL: https://issues.apache.org/jira/browse/SPARK-10055
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Kai Sasaki
>
> Apply ML pipeline API to San Francisco Crime Classification 
> (https://www.kaggle.com/c/sf-crime).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11092) Add source URLs to API documentation.

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11092:


Assignee: (was: Apache Spark)

> Add source URLs to API documentation.
> -
>
> Key: SPARK-11092
> URL: https://issues.apache.org/jira/browse/SPARK-11092
> Project: Spark
>  Issue Type: Documentation
>  Components: Build, Documentation
>Reporter: Jakob Odersky
>Priority: Trivial
>
> It would be nice to have source URLs in the Spark scaladoc, similar to the 
> standard library (e.g. 
> http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).
> The fix should be really simple, just adding a line to the sbt unidoc 
> settings.
> I'll use the github repo url 
> bq. https://github.com/apache/spark/tree/v${version}/${FILE_PATH}
> Feel free to tell me if I should use something else as base url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11092) Add source URLs to API documentation.

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956270#comment-14956270
 ] 

Apache Spark commented on SPARK-11092:
--

User 'jodersky' has created a pull request for this issue:
https://github.com/apache/spark/pull/9110

> Add source URLs to API documentation.
> -
>
> Key: SPARK-11092
> URL: https://issues.apache.org/jira/browse/SPARK-11092
> Project: Spark
>  Issue Type: Documentation
>  Components: Build, Documentation
>Reporter: Jakob Odersky
>Priority: Trivial
>
> It would be nice to have source URLs in the Spark scaladoc, similar to the 
> standard library (e.g. 
> http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).
> The fix should be really simple, just adding a line to the sbt unidoc 
> settings.
> I'll use the github repo url 
> bq. https://github.com/apache/spark/tree/v${version}/${FILE_PATH}
> Feel free to tell me if I should use something else as base url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11092) Add source URLs to API documentation.

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11092:


Assignee: Apache Spark

> Add source URLs to API documentation.
> -
>
> Key: SPARK-11092
> URL: https://issues.apache.org/jira/browse/SPARK-11092
> Project: Spark
>  Issue Type: Documentation
>  Components: Build, Documentation
>Reporter: Jakob Odersky
>Assignee: Apache Spark
>Priority: Trivial
>
> It would be nice to have source URLs in the Spark scaladoc, similar to the 
> standard library (e.g. 
> http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).
> The fix should be really simple, just adding a line to the sbt unidoc 
> settings.
> I'll use the github repo url 
> bq. https://github.com/apache/spark/tree/v${version}/${FILE_PATH}
> Feel free to tell me if I should use something else as base url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10382) Make example code in user guide testable

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10382:


Assignee: Apache Spark  (was: Xusen Yin)

> Make example code in user guide testable
> 
>
> Key: SPARK-10382
> URL: https://issues.apache.org/jira/browse/SPARK-10382
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "guide" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Just one way to implement this. It would be nice to hear more ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10382) Make example code in user guide testable

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956250#comment-14956250
 ] 

Apache Spark commented on SPARK-10382:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9109

> Make example code in user guide testable
> 
>
> Key: SPARK-10382
> URL: https://issues.apache.org/jira/browse/SPARK-10382
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "guide" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Just one way to implement this. It would be nice to hear more ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10382) Make example code in user guide testable

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10382:


Assignee: Xusen Yin  (was: Apache Spark)

> Make example code in user guide testable
> 
>
> Key: SPARK-10382
> URL: https://issues.apache.org/jira/browse/SPARK-10382
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "guide" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Just one way to implement this. It would be nice to hear more ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9302) Handle complex JSON types in collect()/head()

2015-10-13 Thread Sun Rui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun Rui closed SPARK-9302.
--
Resolution: Fixed

> Handle complex JSON types in collect()/head()
> -
>
> Key: SPARK-9302
> URL: https://issues.apache.org/jira/browse/SPARK-9302
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Sun Rui
>
> Reported in the mailing list by Exie :
> {noformat}
> A sample record in raw JSON looks like this:
> {"version": 1,"event": "view","timestamp": 1427846422377,"system":
> "DCDS","asset": "6404476","assetType": "myType","assetCategory":
> "myCategory","extras": [{"name": "videoSource","value": "mySource"},{"name":
> "playerType","value": "Article"},{"name": "duration","value":
> "202088"}],"trackingId": "155629a0-d802-11e4-13ee-6884e43d6000","ipAddress":
> "165.69.2.4","title": "myTitle"}
> > head(mydf)
> Error in as.data.frame.default(x[[i]], optional = TRUE) : 
>   cannot coerce class ""jobj"" to a data.frame
> >
> > show(mydf)
> DataFrame[localEventDtTm:timestamp, asset:string, assetCategory:string, 
> assetType:string, event:string, 
> extras:array>, ipAddress:string, 
> memberId:string, system:string, timestamp:bigint, title:string, 
> trackingId:string, version:bigint]
> >
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9302) Handle complex JSON types in collect()/head()

2015-10-13 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956237#comment-14956237
 ] 

Sun Rui commented on SPARK-9302:


This is fixed after supporting complex types in DataFrame was done.

> Handle complex JSON types in collect()/head()
> -
>
> Key: SPARK-9302
> URL: https://issues.apache.org/jira/browse/SPARK-9302
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Sun Rui
>
> Reported in the mailing list by Exie :
> {noformat}
> A sample record in raw JSON looks like this:
> {"version": 1,"event": "view","timestamp": 1427846422377,"system":
> "DCDS","asset": "6404476","assetType": "myType","assetCategory":
> "myCategory","extras": [{"name": "videoSource","value": "mySource"},{"name":
> "playerType","value": "Article"},{"name": "duration","value":
> "202088"}],"trackingId": "155629a0-d802-11e4-13ee-6884e43d6000","ipAddress":
> "165.69.2.4","title": "myTitle"}
> > head(mydf)
> Error in as.data.frame.default(x[[i]], optional = TRUE) : 
>   cannot coerce class ""jobj"" to a data.frame
> >
> > show(mydf)
> DataFrame[localEventDtTm:timestamp, asset:string, assetCategory:string, 
> assetType:string, event:string, 
> extras:array>, ipAddress:string, 
> memberId:string, system:string, timestamp:bigint, title:string, 
> trackingId:string, version:bigint]
> >
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9694) Add random seed Param to Scala CrossValidator

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9694:
---

Assignee: Apache Spark

> Add random seed Param to Scala CrossValidator
> -
>
> Key: SPARK-9694
> URL: https://issues.apache.org/jira/browse/SPARK-9694
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9694) Add random seed Param to Scala CrossValidator

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956234#comment-14956234
 ] 

Apache Spark commented on SPARK-9694:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9108

> Add random seed Param to Scala CrossValidator
> -
>
> Key: SPARK-9694
> URL: https://issues.apache.org/jira/browse/SPARK-9694
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9694) Add random seed Param to Scala CrossValidator

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9694:
---

Assignee: (was: Apache Spark)

> Add random seed Param to Scala CrossValidator
> -
>
> Key: SPARK-9694
> URL: https://issues.apache.org/jira/browse/SPARK-9694
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9338) Aliases from SELECT not available in GROUP BY

2015-10-13 Thread fang fang chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956196#comment-14956196
 ] 

fang fang chen commented on SPARK-9338:
---

Also encounter this issue, the sql is very simple:
select id as id_test from user group by id_test limit 100
The error is:
cannot resolve 'id_test' given input columns ..., id,  ...;

> Aliases from SELECT not available in GROUP BY
> -
>
> Key: SPARK-9338
> URL: https://issues.apache.org/jira/browse/SPARK-9338
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Observed on Mac OS X and Ubuntu 14.04
>Reporter: James Aley
>
> It feels like this should really be a known issue, but I've not been able to 
> find any mailing list or JIRA tickets for exactly this. There are a few 
> closed/resolved tickets about specific types of exceptions, but I couldn't 
> find this exact problem, so apologies if this is a dupe!
> Spark SQL doesn't appear to support referencing aliases from a SELECT in the 
> GROUP BY part of the query. This is confusing our analysts, as it works in 
> most other tools they use. Here's an example to reproduce:
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val schema =
>   StructType(
> StructField("x", IntegerType, nullable=false) ::
> StructField("y",
>   StructType(StructField("a", DoubleType, nullable=false) :: Nil),
>   nullable=false) :: Nil)
> val rdd = sc.parallelize(
>   Row(1, Row(1.0)) :: Row(2, Row(1.34)) :: Row(3, Row(2.3)) :: Row(4, 
> Row(2.5)) :: Nil)
> val df = sqlContext.createDataFrame(rdd, schema)
> // DataFrame content looks like this:
> // x   z
> // 1   {a: 1.0}
> // 2   {a: 1.34}
> // 3   {a: 2.3}
> // 4   {a: 2.5}
> df.registerTempTable("test_data")
> sqlContext.udf.register("roundToInt", (x: Double) => x.toInt)
> sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(x) as s FROM test_data 
> GROUP BY grp").show()
> // => org.apache.spark.sql.AnalysisException: cannot resolve 'grp' given 
> input columns x, y
> sqlContext.sql("SELECT y.a as grp, SUM(x) as s FROM test_data GROUP BY 
> grp").show()
> // => org.apache.spark.sql.AnalysisException: cannot resolve 'grp' given 
> input columns x, y;
> sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(y.a) as s FROM test_data 
> GROUP BY roundToInt(y.a)").show()
> // =>
> // +---++
> // |grp|   s|
> // +---++
> // |  1|2.34|
> // |  2| 4.8|
> // +---++
> {code}
> As you can see, it's particularly inconvenient when using UDFs on nested 
> fields, as it means repeating some potentially complex expressions. It's very 
> common for us to want to make a date type conversion (from epoch milliseconds 
> or something) from some nested field, then reference it in multiple places in 
> the query. With this issue, it makes for quite verbose queries. 
> Might it also mean that we're mapping these functions over the data twice? I 
> can't quite tell from the explain output whether that's been optimised out or 
> not, but here it is for somebody who understands :-)
> {code}
> sqlContext.sql("SELECT roundToInt(y.a) as grp, SUM(x) as s FROM test_data 
> GROUP BY roundToInt(y.a)").explain()
> // == Physical Plan ==
> // Aggregate false, [PartialGroup#126], [PartialGroup#126 AS 
> grp#116,CombineSum(PartialSum#125L) AS s#117L]
> // Exchange (HashPartitioning 200)
> // Aggregate true, [scalaUDF(y#7.a)], [scalaUDF(y#7.a) AS 
> PartialGroup#126,SUM(CAST(x#6, LongType)) AS PartialSum#125L]
> // PhysicalRDD [x#6,y#7], MapPartitionsRDD[10] at createDataFrame at 
> :31
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9443) Expose sampleByKey in SparkR

2015-10-13 Thread Sun Rui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun Rui closed SPARK-9443.
--
Resolution: Duplicate

> Expose sampleByKey in SparkR
> 
>
> Key: SPARK-9443
> URL: https://issues.apache.org/jira/browse/SPARK-9443
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Hossein Falaki
>
> There is pull request for DataFrames (I believe close to merging) that adds 
> sampleByKey. It would be great to expose it in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9443) Expose sampleByKey in SparkR

2015-10-13 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956179#comment-14956179
 ] 

Sun Rui commented on SPARK-9443:


close it as it duplicates SPARK-10996

> Expose sampleByKey in SparkR
> 
>
> Key: SPARK-9443
> URL: https://issues.apache.org/jira/browse/SPARK-9443
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Hossein Falaki
>
> There is pull request for DataFrames (I believe close to merging) that adds 
> sampleByKey. It would be great to expose it in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11067) Spark SQL thrift server fails to handle decimal value

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11067:


Assignee: (was: Apache Spark)

> Spark SQL thrift server fails to handle decimal value
> -
>
> Key: SPARK-11067
> URL: https://issues.apache.org/jira/browse/SPARK-11067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Alex Liu
> Attachments: SPARK-11067.1.patch.txt
>
>
> When executing the following query through beeline connecting to Spark sql 
> thrift server, it errors out for decimal column
> {code}
> Select decimal_column from table
> WARN  2015-10-09 15:04:00 
> org.apache.hive.service.cli.thrift.ThriftCLIService: Error fetching results: 
> java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal
>   at 
> org.apache.hive.service.cli.ColumnValue.toTColumnValue(ColumnValue.java:174) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(Shim13.scala:144)
>  ~[spark-hive-thriftserver_2.10-1.4.1.1.jar:1.4.1.1]
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:192)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:471)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:405) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:530)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>  [hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
>  [hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
> [libthrift-0.9.2.jar:0.9.2]
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
> [libthrift-0.9.2.jar:0.9.2]
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
>  [hive-service-0.13.1a.jar:4.8.1-SNAPSHOT]
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
>  [libthrift-0.9.2.jar:0.9.2]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_55]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_55]
>   at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11067) Spark SQL thrift server fails to handle decimal value

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956129#comment-14956129
 ] 

Apache Spark commented on SPARK-11067:
--

User 'navis' has created a pull request for this issue:
https://github.com/apache/spark/pull/9107

> Spark SQL thrift server fails to handle decimal value
> -
>
> Key: SPARK-11067
> URL: https://issues.apache.org/jira/browse/SPARK-11067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Alex Liu
> Attachments: SPARK-11067.1.patch.txt
>
>
> When executing the following query through beeline connecting to Spark sql 
> thrift server, it errors out for decimal column
> {code}
> Select decimal_column from table
> WARN  2015-10-09 15:04:00 
> org.apache.hive.service.cli.thrift.ThriftCLIService: Error fetching results: 
> java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal
>   at 
> org.apache.hive.service.cli.ColumnValue.toTColumnValue(ColumnValue.java:174) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(Shim13.scala:144)
>  ~[spark-hive-thriftserver_2.10-1.4.1.1.jar:1.4.1.1]
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:192)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:471)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:405) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:530)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>  [hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
>  [hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
> [libthrift-0.9.2.jar:0.9.2]
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
> [libthrift-0.9.2.jar:0.9.2]
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
>  [hive-service-0.13.1a.jar:4.8.1-SNAPSHOT]
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
>  [libthrift-0.9.2.jar:0.9.2]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_55]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_55]
>   at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11067) Spark SQL thrift server fails to handle decimal value

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11067:


Assignee: Apache Spark

> Spark SQL thrift server fails to handle decimal value
> -
>
> Key: SPARK-11067
> URL: https://issues.apache.org/jira/browse/SPARK-11067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Alex Liu
>Assignee: Apache Spark
> Attachments: SPARK-11067.1.patch.txt
>
>
> When executing the following query through beeline connecting to Spark sql 
> thrift server, it errors out for decimal column
> {code}
> Select decimal_column from table
> WARN  2015-10-09 15:04:00 
> org.apache.hive.service.cli.thrift.ThriftCLIService: Error fetching results: 
> java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal
>   at 
> org.apache.hive.service.cli.ColumnValue.toTColumnValue(ColumnValue.java:174) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(Shim13.scala:144)
>  ~[spark-hive-thriftserver_2.10-1.4.1.1.jar:1.4.1.1]
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:192)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:471)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:405) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:530)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>  [hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
>  [hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
> [libthrift-0.9.2.jar:0.9.2]
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
> [libthrift-0.9.2.jar:0.9.2]
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
>  [hive-service-0.13.1a.jar:4.8.1-SNAPSHOT]
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
>  [libthrift-0.9.2.jar:0.9.2]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_55]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_55]
>   at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11067) Spark SQL thrift server fails to handle decimal value

2015-10-13 Thread Navis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956121#comment-14956121
 ] 

Navis commented on SPARK-11067:
---

[~alexliu68] Seeing that "RowBasedSet" is in stack trace, It's older version of 
hive jdbc. Anyway, with the patch attached decimals are serialized to string in 
server via 
{code}
HiveDecimal.create(from.getDecimal(ordinal)).bigDecimalValue().toPlainString()
{code}
and deserialized to bigDecimal in client via
{code}
new BigDecimal(string)
{code}

first, It's heavy calculation and seemed possibly affect performance.
second, it seemed not exact to use toPlainString() which removes trailing 
zeros. toString() should be used instead.

> Spark SQL thrift server fails to handle decimal value
> -
>
> Key: SPARK-11067
> URL: https://issues.apache.org/jira/browse/SPARK-11067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Alex Liu
> Attachments: SPARK-11067.1.patch.txt
>
>
> When executing the following query through beeline connecting to Spark sql 
> thrift server, it errors out for decimal column
> {code}
> Select decimal_column from table
> WARN  2015-10-09 15:04:00 
> org.apache.hive.service.cli.thrift.ThriftCLIService: Error fetching results: 
> java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal
>   at 
> org.apache.hive.service.cli.ColumnValue.toTColumnValue(ColumnValue.java:174) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(Shim13.scala:144)
>  ~[spark-hive-thriftserver_2.10-1.4.1.1.jar:1.4.1.1]
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:192)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:471)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:405) 
> ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:530)
>  ~[hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>  [hive-service-0.13.1a.jar:0.13.1a]
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
>  [hive-service-0.13.1a.jar:0.13.1a]
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
> [libthrift-0.9.2.jar:0.9.2]
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
> [libthrift-0.9.2.jar:0.9.2]
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
>  [hive-service-0.13.1a.jar:4.8.1-SNAPSHOT]
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
>  [libthrift-0.9.2.jar:0.9.2]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_55]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_55]
>   at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11093) ChildFirstURLClassLoader#getResources should return all found resources, not just those in the child classloader

2015-10-13 Thread Adam Lewandowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Lewandowski updated SPARK-11093:
-
Description: 
Currently when using a child-first classloader 
(spark.driver|executor.userClassPathFirst = true), the getResources method does 
not return any matching resources from the parent classloader if the child 
classloader contains any. This is not child-first, it's child-only and is 
inconsistent with how the default parent-first classloaders work in the JDK 
(all found resources are returned from both classloaders). It is also 
inconsistent with how child-first classloaders work in other environments 
(Servlet containers, for example). 
ChildFirstURLClassLoader#getResources() should return resources found from both 
the child and the parent classloaders, placing any found from the child 
classloader first. 

For reference, the specific use case where I encountered this problem was 
running Spark on AWS EMR in a child-first arrangement (due to guava version 
conflicts), where Akka's configuration file (reference.conf) was made available 
in the parent classloader, but was not visible to the Typesafe config library 
which uses Classloader.getResources() on the Thread's context classloader to 
find them. This resulted in a fatal error from the Config library: 
"com.typesafe.config.ConfigException$Missing: No configuration setting found 
for key 'akka.version'" .


  was:
Currently when using a child-first classloader 
(spark.{driver|executor}.userClassPathFirst = true), the getResources method 
does not return any matching resources from the parent classloader if the child 
classloader contains any. This is not child-first, it's child-only and is 
inconsistent with how the default parent-first classloaders work in the JDK 
(all found resources are returned from both classloaders). It is also 
inconsistent with how child-first classloaders work in other environments 
(Servlet containers, for example). 
ChildFirstURLClassLoader#getResources() should return resources found from both 
the child and the parent classloaders, placing any found from the child 
classloader first. 

For reference, the specific use case where I encountered this problem was 
running Spark on AWS EMR in a child-first arrangement (due to guava version 
conflicts), where Akka's configuration file (reference.conf) was made available 
in the parent classloader, but was not visible to the Typesafe config library 
which uses Classloader.getResources() on the Thread's context classloader to 
find them. This resulted in a fatal error from the Config library: 
"com.typesafe.config.ConfigException$Missing: No configuration setting found 
for key 'akka.version'" .



> ChildFirstURLClassLoader#getResources should return all found resources, not 
> just those in the child classloader
> 
>
> Key: SPARK-11093
> URL: https://issues.apache.org/jira/browse/SPARK-11093
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Adam Lewandowski
>
> Currently when using a child-first classloader 
> (spark.driver|executor.userClassPathFirst = true), the getResources method 
> does not return any matching resources from the parent classloader if the 
> child classloader contains any. This is not child-first, it's child-only and 
> is inconsistent with how the default parent-first classloaders work in the 
> JDK (all found resources are returned from both classloaders). It is also 
> inconsistent with how child-first classloaders work in other environments 
> (Servlet containers, for example). 
> ChildFirstURLClassLoader#getResources() should return resources found from 
> both the child and the parent classloaders, placing any found from the child 
> classloader first. 
> For reference, the specific use case where I encountered this problem was 
> running Spark on AWS EMR in a child-first arrangement (due to guava version 
> conflicts), where Akka's configuration file (reference.conf) was made 
> available in the parent classloader, but was not visible to the Typesafe 
> config library which uses Classloader.getResources() on the Thread's context 
> classloader to find them. This resulted in a fatal error from the Config 
> library: "com.typesafe.config.ConfigException$Missing: No configuration 
> setting found for key 'akka.version'" .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11093) ChildFirstURLClassLoader#getResources should return all found resources, not just those in the child classloader

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11093:


Assignee: (was: Apache Spark)

> ChildFirstURLClassLoader#getResources should return all found resources, not 
> just those in the child classloader
> 
>
> Key: SPARK-11093
> URL: https://issues.apache.org/jira/browse/SPARK-11093
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Adam Lewandowski
>
> Currently when using a child-first classloader 
> (spark.{driver|executor}.userClassPathFirst = true), the getResources method 
> does not return any matching resources from the parent classloader if the 
> child classloader contains any. This is not child-first, it's child-only and 
> is inconsistent with how the default parent-first classloaders work in the 
> JDK (all found resources are returned from both classloaders). It is also 
> inconsistent with how child-first classloaders work in other environments 
> (Servlet containers, for example). 
> ChildFirstURLClassLoader#getResources() should return resources found from 
> both the child and the parent classloaders, placing any found from the child 
> classloader first. 
> For reference, the specific use case where I encountered this problem was 
> running Spark on AWS EMR in a child-first arrangement (due to guava version 
> conflicts), where Akka's configuration file (reference.conf) was made 
> available in the parent classloader, but was not visible to the Typesafe 
> config library which uses Classloader.getResources() on the Thread's context 
> classloader to find them. This resulted in a fatal error from the Config 
> library: "com.typesafe.config.ConfigException$Missing: No configuration 
> setting found for key 'akka.version'" .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11093) ChildFirstURLClassLoader#getResources should return all found resources, not just those in the child classloader

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956099#comment-14956099
 ] 

Apache Spark commented on SPARK-11093:
--

User 'alewando' has created a pull request for this issue:
https://github.com/apache/spark/pull/9106

> ChildFirstURLClassLoader#getResources should return all found resources, not 
> just those in the child classloader
> 
>
> Key: SPARK-11093
> URL: https://issues.apache.org/jira/browse/SPARK-11093
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Adam Lewandowski
>
> Currently when using a child-first classloader 
> (spark.{driver|executor}.userClassPathFirst = true), the getResources method 
> does not return any matching resources from the parent classloader if the 
> child classloader contains any. This is not child-first, it's child-only and 
> is inconsistent with how the default parent-first classloaders work in the 
> JDK (all found resources are returned from both classloaders). It is also 
> inconsistent with how child-first classloaders work in other environments 
> (Servlet containers, for example). 
> ChildFirstURLClassLoader#getResources() should return resources found from 
> both the child and the parent classloaders, placing any found from the child 
> classloader first. 
> For reference, the specific use case where I encountered this problem was 
> running Spark on AWS EMR in a child-first arrangement (due to guava version 
> conflicts), where Akka's configuration file (reference.conf) was made 
> available in the parent classloader, but was not visible to the Typesafe 
> config library which uses Classloader.getResources() on the Thread's context 
> classloader to find them. This resulted in a fatal error from the Config 
> library: "com.typesafe.config.ConfigException$Missing: No configuration 
> setting found for key 'akka.version'" .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11093) ChildFirstURLClassLoader#getResources should return all found resources, not just those in the child classloader

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11093:


Assignee: Apache Spark

> ChildFirstURLClassLoader#getResources should return all found resources, not 
> just those in the child classloader
> 
>
> Key: SPARK-11093
> URL: https://issues.apache.org/jira/browse/SPARK-11093
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Adam Lewandowski
>Assignee: Apache Spark
>
> Currently when using a child-first classloader 
> (spark.{driver|executor}.userClassPathFirst = true), the getResources method 
> does not return any matching resources from the parent classloader if the 
> child classloader contains any. This is not child-first, it's child-only and 
> is inconsistent with how the default parent-first classloaders work in the 
> JDK (all found resources are returned from both classloaders). It is also 
> inconsistent with how child-first classloaders work in other environments 
> (Servlet containers, for example). 
> ChildFirstURLClassLoader#getResources() should return resources found from 
> both the child and the parent classloaders, placing any found from the child 
> classloader first. 
> For reference, the specific use case where I encountered this problem was 
> running Spark on AWS EMR in a child-first arrangement (due to guava version 
> conflicts), where Akka's configuration file (reference.conf) was made 
> available in the parent classloader, but was not visible to the Typesafe 
> config library which uses Classloader.getResources() on the Thread's context 
> classloader to find them. This resulted in a fatal error from the Config 
> library: "com.typesafe.config.ConfigException$Missing: No configuration 
> setting found for key 'akka.version'" .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10747) add support for window specification to include how NULLS are ordered

2015-10-13 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956094#comment-14956094
 ] 

Xin Wu commented on SPARK-10747:


I ran this query on the released Hive 1.2.1 version, and this is not supported 
yet
{code}
hive> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 
desc nulls last) from tolap;
FAILED: ParseException line 1:76 missing ) at 'nulls' near 'nulls'
line 1:82 missing EOF at 'last' near 'nulls'
{code}

And SparkSQL is using Hive ql parser to parse the query. and it will fail. 

{code}
scala> sqlContext.sql("select rnum, c1, c2, c3, dense_rank() over(partition by 
c1 order by c3 desc nulls last) from tolap")
org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' near 
'nulls'
line 1:82 missing EOF at 'last' near 'nulls';
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:298)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:276)
at org.apache.spark.sql.hive.HiveQLDialect.parse(HiveContext.scala:62)
at 
org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:173)
at 
org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:173)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:115)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)

{code}

In HiveQl.scala, you see the following where getAst(sql) will throw the 
org.apache.hadoop.hive.ql.parse.ParseException,

{code}
def createPlan(sql: String): LogicalPlan = {
try {
  val tree = getAst(sql)
  if (nativeCommands contains tree.getText) {
HiveNativeCommand(sql)
  } else {
nodeToPlan(tree) match {
  case NativePlaceholder => HiveNativeCommand(sql)
  case other => other
}
  }
} catch {
  case pe: org.apache.hadoop.hive.ql.parse.ParseException =>
pe.getMessage match {
  case errorRegEx(line, start, message) =>
throw new AnalysisException(message, Some(line.toInt), 
Some(start.toInt))
  case otherMessage =>
throw new AnalysisException(otherMessage)
}

{code}

which is thrown by org.apache.hadoop.hive.ql.parse.ParseDriver.java

{code}
public ASTNode parse(String command) throws ParseException {
return this.parse(command, (Context)null);
}
{code}

So I think this needs to wait for HIVE-9535 to be resolved.. 
I am new and learning the spark code, so I hope my understanding is correct 
here. 


> add support f

[jira] [Closed] (SPARK-11082) Cores per executor is wrong when response vcore number is less than requested number

2015-10-13 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao closed SPARK-11082.
---
Resolution: Invalid

> Cores per executor is wrong when response vcore number is less than requested 
> number
> 
>
> Key: SPARK-11082
> URL: https://issues.apache.org/jira/browse/SPARK-11082
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>
> When DefaultResourceCalculator is set (by default) for Yarn capacity 
> scheduler, the response container resource vcore number is always 1, which 
> may be less than the requested vcore number, ExecutorRunnable should honor 
> this returned vcore number (not the requested number) to pass to each 
> executor. Otherwise, actual allocated vcore number is different from Spark's 
> managed CPU cores per executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11093) ChildFirstURLClassLoader#getResources should return all found resources, not just those in the child classloader

2015-10-13 Thread Adam Lewandowski (JIRA)
Adam Lewandowski created SPARK-11093:


 Summary: ChildFirstURLClassLoader#getResources should return all 
found resources, not just those in the child classloader
 Key: SPARK-11093
 URL: https://issues.apache.org/jira/browse/SPARK-11093
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.1
Reporter: Adam Lewandowski


Currently when using a child-first classloader 
(spark.{driver|executor}.userClassPathFirst = true), the getResources method 
does not return any matching resources from the parent classloader if the child 
classloader contains any. This is not child-first, it's child-only and is 
inconsistent with how the default parent-first classloaders work in the JDK 
(all found resources are returned from both classloaders). It is also 
inconsistent with how child-first classloaders work in other environments 
(Servlet containers, for example). 
ChildFirstURLClassLoader#getResources() should return resources found from both 
the child and the parent classloaders, placing any found from the child 
classloader first. 

For reference, the specific use case where I encountered this problem was 
running Spark on AWS EMR in a child-first arrangement (due to guava version 
conflicts), where Akka's configuration file (reference.conf) was made available 
in the parent classloader, but was not visible to the Typesafe config library 
which uses Classloader.getResources() on the Thread's context classloader to 
find them. This resulted in a fatal error from the Config library: 
"com.typesafe.config.ConfigException$Missing: No configuration setting found 
for key 'akka.version'" .




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11091) Change the flag of spark.sql.canonicalizeView to spark.sql.nativeView

2015-10-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-11091.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9103
[https://github.com/apache/spark/pull/9103]

> Change the flag of spark.sql.canonicalizeView to spark.sql.nativeView
> -
>
> Key: SPARK-11091
> URL: https://issues.apache.org/jira/browse/SPARK-11091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.6.0
>
>
> The meaning of this flag is exactly the opposite. Let's change it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11068) Add callback to query execution

2015-10-13 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-11068.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9078
[https://github.com/apache/spark/pull/9078]

> Add callback to query execution
> ---
>
> Key: SPARK-11068
> URL: https://issues.apache.org/jira/browse/SPARK-11068
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11092) Add source URLs to API documentation.

2015-10-13 Thread Jakob Odersky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Odersky updated SPARK-11092:
--
Description: 
It would be nice to have source URLs in the Spark scaladoc, similar to the 
standard library (e.g. 
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).

The fix should be really simple, just adding a line to the sbt unidoc settings.
I'll use the github repo url 
bq. https://github.com/apache/spark/tree/v${version}/${FILE_PATH}
Feel free to tell me if I should use something else as base url.

  was:
It would be nice to have source URLs in the Spark scaladoc, similar to the 
standard library (e.g. 
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).

The fix should be really simple, just adding a line to the sbt unidoc settings.
I'll use the github repo url  
"https://github.com/apache/spark/tree/v${version}/${FILE_PATH}";). Feel free to 
tell me if I should use something else as base url.


> Add source URLs to API documentation.
> -
>
> Key: SPARK-11092
> URL: https://issues.apache.org/jira/browse/SPARK-11092
> Project: Spark
>  Issue Type: Documentation
>  Components: Build, Documentation
>Reporter: Jakob Odersky
>Priority: Trivial
>
> It would be nice to have source URLs in the Spark scaladoc, similar to the 
> standard library (e.g. 
> http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).
> The fix should be really simple, just adding a line to the sbt unidoc 
> settings.
> I'll use the github repo url 
> bq. https://github.com/apache/spark/tree/v${version}/${FILE_PATH}
> Feel free to tell me if I should use something else as base url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11092) Add source URLs to API documentation.

2015-10-13 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955996#comment-14955996
 ] 

Jakob Odersky commented on SPARK-11092:
---

I can't set the assignee field, though I'd like to resolve this issue.

> Add source URLs to API documentation.
> -
>
> Key: SPARK-11092
> URL: https://issues.apache.org/jira/browse/SPARK-11092
> Project: Spark
>  Issue Type: Documentation
>  Components: Build, Documentation
>Reporter: Jakob Odersky
>Priority: Trivial
>
> It would be nice to have source URLs in the Spark scaladoc, similar to the 
> standard library (e.g. 
> http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).
> The fix should be really simple, just adding a line to the sbt unidoc 
> settings.
> I'll use the github repo url  
> "https://github.com/apache/spark/tree/v${version}/${FILE_PATH}";). Feel free 
> to tell me if I should use something else as base url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11092) Add source URLs to API documentation.

2015-10-13 Thread Jakob Odersky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Odersky updated SPARK-11092:
--
Description: 
It would be nice to have source URLs in the Spark scaladoc, similar to the 
standard library (e.g. 
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).

The fix should be really simple, just adding a line to the sbt unidoc settings.
I'll use the github repo url  
"https://github.com/apache/spark/tree/v${version}/${FILE_PATH}";). Feel free to 
tell me if I should use something else as base url.

  was:
It would be nice to have source URLs in the Spark scaladoc, similar to the 
standard library (e.g. 
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).

The fix should be really simple, just adding a line to the sbt unidoc settings.
I'll use the github repo url  
"https://github.com/apache/spark/tree/v${version}/${FILE_PATH";). Feel free to 
tell me if I should use something else as base url.


> Add source URLs to API documentation.
> -
>
> Key: SPARK-11092
> URL: https://issues.apache.org/jira/browse/SPARK-11092
> Project: Spark
>  Issue Type: Documentation
>  Components: Build, Documentation
>Reporter: Jakob Odersky
>Priority: Trivial
>
> It would be nice to have source URLs in the Spark scaladoc, similar to the 
> standard library (e.g. 
> http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).
> The fix should be really simple, just adding a line to the sbt unidoc 
> settings.
> I'll use the github repo url  
> "https://github.com/apache/spark/tree/v${version}/${FILE_PATH}";). Feel free 
> to tell me if I should use something else as base url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11092) Add source URLs to API documentation.

2015-10-13 Thread Jakob Odersky (JIRA)
Jakob Odersky created SPARK-11092:
-

 Summary: Add source URLs to API documentation.
 Key: SPARK-11092
 URL: https://issues.apache.org/jira/browse/SPARK-11092
 Project: Spark
  Issue Type: Documentation
  Components: Build, Documentation
Reporter: Jakob Odersky
Priority: Trivial


It would be nice to have source URLs in the Spark scaladoc, similar to the 
standard library (e.g. 
http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List).

The fix should be really simple, just adding a line to the sbt unidoc settings.
I'll use the github repo url  
"https://github.com/apache/spark/tree/v${version}/${FILE_PATH";). Feel free to 
tell me if I should use something else as base url.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11032) Failure to resolve having correctly

2015-10-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11032.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9105
[https://github.com/apache/spark/pull/9105]

> Failure to resolve having correctly
> ---
>
> Key: SPARK-11032
> URL: https://issues.apache.org/jira/browse/SPARK-11032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Michael Armbrust
>Priority: Blocker
> Fix For: 1.6.0
>
>
> This is a regression from Spark 1.4
> {code}
> Seq(("michael", 30)).toDF("name", "age").registerTempTable("people")
> sql("SELECT MIN(t0.age) FROM (SELECT * FROM PEOPLE WHERE age > 0) t0 
> HAVING(COUNT(1) > 0)").explain(true)
> == Parsed Logical Plan ==
> 'Filter cast(('COUNT(1) > 0) as boolean)
>  'Project [unresolvedalias('MIN('t0.age))]
>   'Subquery t0
>'Project [unresolvedalias(*)]
> 'Filter ('age > 0)
>  'UnresolvedRelation [PEOPLE], None
> == Analyzed Logical Plan ==
> _c0: int
> Filter cast((count(1) > cast(0 as bigint)) as boolean)
>  Aggregate [min(age#6) AS _c0#9]
>   Subquery t0
>Project [name#5,age#6]
> Filter (age#6 > 0)
>  Subquery people
>   Project [_1#3 AS name#5,_2#4 AS age#6]
>LocalRelation [_1#3,_2#4], [[michael,30]]
> == Optimized Logical Plan ==
> Filter (count(1) > 0)
>  Aggregate [min(age#6) AS _c0#9]
>   Project [_2#4 AS age#6]
>Filter (_2#4 > 0)
> LocalRelation [_1#3,_2#4], [[michael,30]]
> == Physical Plan ==
> Filter (count(1) > 0)
>  TungstenAggregate(key=[], 
> functions=[(min(age#6),mode=Final,isDistinct=false)], output=[_c0#9])
>   TungstenExchange SinglePartition
>TungstenAggregate(key=[], 
> functions=[(min(age#6),mode=Partial,isDistinct=false)], output=[min#12])
> TungstenProject [_2#4 AS age#6]
>  Filter (_2#4 > 0)
>   LocalTableScan [_1#3,_2#4], [[michael,30]]
> Code Generation: true
> {code}
> {code}
> Caused by: java.lang.UnsupportedOperationException: Cannot evaluate 
> expression: count(1)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:188)
>   at 
> org.apache.spark.sql.catalyst.expressions.Count.eval(aggregates.scala:156)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:327)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11090) Initial code generated construction of Product classes from InternalRow

2015-10-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11090.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9100
[https://github.com/apache/spark/pull/9100]

> Initial code generated construction of Product classes from InternalRow
> ---
>
> Key: SPARK-11090
> URL: https://issues.apache.org/jira/browse/SPARK-11090
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7099) Floating point literals cannot be specified using exponent

2015-10-13 Thread Ryan Pham (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955962#comment-14955962
 ] 

Ryan Pham commented on SPARK-7099:
--

Hi Kevin. It seems like I can't close this JIRA since I'm not its reporter. 
Unfortunately, Peter is no longer with IBM, so I can't contact him to close it.

> Floating point literals cannot be specified using exponent
> --
>
> Key: SPARK-7099
> URL: https://issues.apache.org/jira/browse/SPARK-7099
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
> Environment: Windows, Linux, Mac OS X
>Reporter: Peter Hagelund
>Priority: Minor
>
> Floating point literals cannot be expressed in scientific notation using an 
> exponent, like e.g. 1.23E4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10389) support order by non-attribute grouping expression on Aggregate

2015-10-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10389:
-
Fix Version/s: 1.5.2

> support order by non-attribute grouping expression on Aggregate
> ---
>
> Key: SPARK-10389
> URL: https://issues.apache.org/jira/browse/SPARK-10389
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.5.2, 1.6.0
>
>
> For example, we should support "SELECT MAX(value) FROM src GROUP BY key + 1 
> ORDER BY key + 1".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10959) PySpark StreamingLogisticRegressionWithSGD does not train with given regParam and convergenceTol parameters

2015-10-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10959.
---
   Resolution: Fixed
Fix Version/s: (was: 1.6.0)
   1.5.2

Issue resolved by pull request 9087
[https://github.com/apache/spark/pull/9087]

> PySpark StreamingLogisticRegressionWithSGD does not train with given regParam 
> and convergenceTol parameters
> ---
>
> Key: SPARK-10959
> URL: https://issues.apache.org/jira/browse/SPARK-10959
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Critical
>  Labels: backport-needed
> Fix For: 1.5.2
>
>
> These parameters are passed into the StreamingLogisticRegressionWithSGD 
> constructor, but do not get transferred to the model to use when training.  
> Same problem with StreamingLinearRegressionWithSGD and the intercept param is 
> in the wrong  argument place where it is being used as regularization value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-13 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955878#comment-14955878
 ] 

Michael Armbrust commented on SPARK-:
-

I think improving Java compatibility and getting rid of the ClassTags is more 
than a _nice to have_.  Having a separate class hierarchy for Java/Scala makes 
it very hard for people to build higher level libraries that work with both 
Scala and Java.  As a result, I think Java adoption suffers.  ClassTags are 
burdensome for [both Scala and 
Java|https://twitter.com/posco/status/633505168747687936] users.

In order to make encoders work they way we want, nearly every function that 
takes a ClassTag today will need to be changed to take an encoder.  As [~rxin] 
points out, I think that kind of compatibly breaking is actually more damaging 
for a project of Spark's maturity than providing a higher-level parallel API to 
RDDs.

That said, I think source compatibility for common code between RDDs -> 
Datasets would be great to make sure users can make the transition with as 
little pain as possible.

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10959) PySpark StreamingLogisticRegressionWithSGD does not train with given regParam and convergenceTol parameters

2015-10-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10959:
--
Fix Version/s: 1.6.0

> PySpark StreamingLogisticRegressionWithSGD does not train with given regParam 
> and convergenceTol parameters
> ---
>
> Key: SPARK-10959
> URL: https://issues.apache.org/jira/browse/SPARK-10959
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Critical
>  Labels: backport-needed
> Fix For: 1.5.2, 1.6.0
>
>
> These parameters are passed into the StreamingLogisticRegressionWithSGD 
> constructor, but do not get transferred to the model to use when training.  
> Same problem with StreamingLinearRegressionWithSGD and the intercept param is 
> in the wrong  argument place where it is being used as regularization value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955877#comment-14955877
 ] 

Reynold Xin commented on SPARK-:


BTW another possible approach that we haven't discussed is that we can start 
with an experimental new API, and in Spark 2.0 rename it to RDD. I'm less in 
favor of this because it still means applications can't update to Spark 2.0 
without rewriting.


> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955868#comment-14955868
 ] 

Reynold Xin edited comment on SPARK- at 10/13/15 11:00 PM:
---

[~sandyr] Your concern is absolutely valid, but I don't think your EncodedRDD 
proposal works. For one, the map function (every other function that returns a 
type different from RDD's own T) will break. For two, the whole concept of 
PairRDDFunctions should go away with this new API.

As I said, it's actually my preference to just use the RDD API. But if you take 
a look at what's needed here, it'd break too many functions. So we have the 
following choices:

1. Don't create a new API, and break the RDD API. People then can't update to 
newer versions of Spark unless they rewrite their apps. We did this with the 
SchemaRDD -> DataFrame change, which went well -- but SchemaRDD wasn't really 
an advertised API back then.

2. Create a new API, and keep RDD API intact. People can update to new versions 
of Spark, but they can't take full advantage of all the Tungsten/DataFrame work 
immediately unless they rewrite their apps. Maybe we can implement the RDD API 
later in some cases using the new API so legacy apps can still take advantage 
whenever possible (e.g. inferring encoder based on classtags when possible). 

Also the RDD API as I see it today is actually a pretty good way for developers 
to provide data (i.e. used for data sources). If we break it, we'd still need 
to come up with a new data input API.





was (Author: rxin):
[~sandyr] Your concern is absolutely valid, but I don't think your EncodedRDD 
proposal works. For one, the map function (every other function that returns a 
type different from RDD's own T) will break. For two, the whole concept of 
PairRDDFunctions should go away with this new API.

As I said, it's actually my preference to just use the RDD API. But if you take 
a look at what's needed here, it'd break too many functions. So we have the 
following choices:

1. Don't create a new API, and break the RDD API. People then can't update to 
newer versions of Spark unless they rewrite their apps. We did this with the 
SchemaRDD -> DataFrame change, which went well -- but SchemaRDD wasn't really 
an advertised API back then.

2. Create a new API, and keep RDD API intact. People can update to new versions 
of Spark, but it can't take full advantage of all the Tungsten/DataFrame work 
immediately unless they rewrite their apps. Maybe we can implement the RDD API 
later in some cases using the new API so legacy apps can still take advantage 
whenever possible (e.g. inferring encoder based on classtags when possible). 

Also the RDD API as I see it today is actually a pretty good way for developers 
to provide data (i.e. used for data sources). If we break it, we'd still need 
to come up with a new data input API.




> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 

[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955868#comment-14955868
 ] 

Reynold Xin commented on SPARK-:


[~sandyr] Your concern is absolutely valid, but I don't think your EncodedRDD 
proposal works. For one, the map function (every other function that returns a 
type different from RDD's own T) will break. For two, the whole concept of 
PairRDDFunctions should go away with this new API.

As I said, it's actually my preference to just use the RDD API. But if you take 
a look at what's needed here, it'd break too many functions. So we have the 
following choices:

1. Don't create a new API, and break the RDD API. People then can't update to 
newer versions of Spark unless they rewrite their apps. We did this with the 
SchemaRDD -> DataFrame change, which went well -- but SchemaRDD wasn't really 
an advertised API back then.

2. Create a new API, and keep RDD API intact. People can update to new versions 
of Spark, but it can't take full advantage of all the Tungsten/DataFrame work 
immediately unless they rewrite their apps. Maybe we can implement the RDD API 
later in some cases using the new API so legacy apps can still take advantage 
whenever possible (e.g. inferring encoder based on classtags when possible). 

Also the RDD API as I see it today is actually a pretty good way for developers 
to provide data (i.e. used for data sources). If we break it, we'd still need 
to come up with a new data input API.




> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11059) ML: change range of quantile probabilities in AFTSurvivalRegression

2015-10-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11059.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9083
[https://github.com/apache/spark/pull/9083]

> ML: change range of quantile probabilities in AFTSurvivalRegression
> ---
>
> Key: SPARK-11059
> URL: https://issues.apache.org/jira/browse/SPARK-11059
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Kai Jiang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Values of the quantile probabilities array should be in the range (0, 1) 
> instead of \[0,1\]
> [AFTSurvivalRegression.scala#L62|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala#L62]
>  according to [Discussion | 
> https://github.com/apache/spark/pull/8926#discussion-diff-40698242]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11059) ML: change range of quantile probabilities in AFTSurvivalRegression

2015-10-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11059:
--
Assignee: Kai Jiang

> ML: change range of quantile probabilities in AFTSurvivalRegression
> ---
>
> Key: SPARK-11059
> URL: https://issues.apache.org/jira/browse/SPARK-11059
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Kai Jiang
>Assignee: Kai Jiang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Values of the quantile probabilities array should be in the range (0, 1) 
> instead of \[0,1\]
> [AFTSurvivalRegression.scala#L62|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala#L62]
>  according to [Discussion | 
> https://github.com/apache/spark/pull/8926#discussion-diff-40698242]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5328) Update PySpark MLlib NaiveBayes API to take model type parameter for Bernoulli fit

2015-10-13 Thread Bhargav Mangipudi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955855#comment-14955855
 ] 

Bhargav Mangipudi commented on SPARK-5328:
--

I can take up this work item.

> Update PySpark MLlib NaiveBayes API to take model type parameter for 
> Bernoulli fit
> --
>
> Key: SPARK-5328
> URL: https://issues.apache.org/jira/browse/SPARK-5328
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Leah McGuire
>Priority: Minor
>  Labels: mllib
>
> [SPARK-4894] Adds Bernoulli-variant of Naive Bayes adds Bernoulli fitting to 
> NaiveBayes.scala need to update python API to accept model type parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-10-13 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955840#comment-14955840
 ] 

Sandy Ryza commented on SPARK-:
---

If I understand correctly, it seems like there are ways to work around each of 
these issues that, necessarily, make the API dirtier, but avoid the need for a 
whole new public API.

* groupBy: deprecate the old groupBy and add a groupWith or groupby method that 
returns a GroupedRDD.
* partitions: have -1 be a special value that means "determined by the planner"
* encoders: what are the main obstacles to addressing this with an EncodedRDD 
that extends RDD?

Regarding the issues Michael brought up:
I'd love to get rid of class tags from the public API as well as take out 
JavaRDD, but these seem more like "nice to have" than core to the proposal.  Am 
I misunderstanding?

All of these of course add ugliness, but I think it's really easy to 
underestimate the cost of introducing a new API.  Applications everywhere 
become legacy and need to be rewritten to take advantage of new features.  Code 
examples and training materials everywhere become invalidated.  Can we point to 
systems that have successfully made a transition like this at this point in 
their maturity?

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10932) Port two minor changes to release packaging scripts back into Spark repo

2015-10-13 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10932:
---
Fix Version/s: 1.5.2

> Port two minor changes to release packaging scripts back into Spark repo
> 
>
> Key: SPARK-10932
> URL: https://issues.apache.org/jira/browse/SPARK-10932
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.5.2, 1.6.0
>
>
> Spark's release packaging scripts used to live in separate repositories. 
> Although these scripts are now part of the Spark repo, there are some patches 
> against the old repos that are missing in Spark's copy of the script. As part 
> of the deprecation of those other repos, we should port those changes into 
> Spark's copy of the script. I'll open a PR to do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10932) Port two minor changes to release packaging scripts back into Spark repo

2015-10-13 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-10932.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8986
[https://github.com/apache/spark/pull/8986]

> Port two minor changes to release packaging scripts back into Spark repo
> 
>
> Key: SPARK-10932
> URL: https://issues.apache.org/jira/browse/SPARK-10932
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.0
>
>
> Spark's release packaging scripts used to live in separate repositories. 
> Although these scripts are now part of the Spark repo, there are some patches 
> against the old repos that are missing in Spark's copy of the script. As part 
> of the deprecation of those other repos, we should port those changes into 
> Spark's copy of the script. I'll open a PR to do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11080) Incorporate per-JVM id into ExprId to prevent unsafe cross-JVM comparisions

2015-10-13 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11080:
---
Summary: Incorporate per-JVM id into ExprId to prevent unsafe cross-JVM 
comparisions  (was: NamedExpression.newExprId should only be called on driver)

> Incorporate per-JVM id into ExprId to prevent unsafe cross-JVM comparisions
> ---
>
> Key: SPARK-11080
> URL: https://issues.apache.org/jira/browse/SPARK-11080
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.0
>
>
> My understanding of {{NamedExpression.newExprId}} is that it is only intended 
> to be called on the driver. If it is called on executors, then this may lead 
> to scenarios where the same expression id is re-used in two different 
> NamedExpressions.
> More generally, I think that calling {{NamedExpression.newExprId}} within 
> tasks may be an indicator of potential attribute binding bugs. Therefore, I 
> think that we should prevent {{NamedExpression.newExprId}} from being called 
> inside of tasks by throwing an exception when such calls occur. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11080) Incorporate per-JVM id into ExprId to prevent unsafe cross-JVM comparisions

2015-10-13 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11080:
---
Description: 
In the current implementation of named expressions' ExprIds, we rely on a 
per-JVM AtomicLong to ensure that expression ids are unique within a JVM. 
However, these expression ids will not be globally unique. This opens the 
potential for id collisions if new expression ids happen to be created inside 
of tasks rather than on the driver.

There are currently a few cases where tasks allocate expression ids, which 
happen to be safe because those expressions are never compared to expressions 
created on the driver. In order to guard against the introduction of invalid 
comparisons between driver-created and executor-created expression ids, this 
patch extends ExprId to incorporate a UUID to identify the JVM that created the 
id, which prevents collisions.

  was:
My understanding of {{NamedExpression.newExprId}} is that it is only intended 
to be called on the driver. If it is called on executors, then this may lead to 
scenarios where the same expression id is re-used in two different 
NamedExpressions.

More generally, I think that calling {{NamedExpression.newExprId}} within tasks 
may be an indicator of potential attribute binding bugs. Therefore, I think 
that we should prevent {{NamedExpression.newExprId}} from being called inside 
of tasks by throwing an exception when such calls occur. 


> Incorporate per-JVM id into ExprId to prevent unsafe cross-JVM comparisions
> ---
>
> Key: SPARK-11080
> URL: https://issues.apache.org/jira/browse/SPARK-11080
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.0
>
>
> In the current implementation of named expressions' ExprIds, we rely on a 
> per-JVM AtomicLong to ensure that expression ids are unique within a JVM. 
> However, these expression ids will not be globally unique. This opens the 
> potential for id collisions if new expression ids happen to be created inside 
> of tasks rather than on the driver.
> There are currently a few cases where tasks allocate expression ids, which 
> happen to be safe because those expressions are never compared to expressions 
> created on the driver. In order to guard against the introduction of invalid 
> comparisons between driver-created and executor-created expression ids, this 
> patch extends ExprId to incorporate a UUID to identify the JVM that created 
> the id, which prevents collisions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11080) NamedExpression.newExprId should only be called on driver

2015-10-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11080.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9093
[https://github.com/apache/spark/pull/9093]

> NamedExpression.newExprId should only be called on driver
> -
>
> Key: SPARK-11080
> URL: https://issues.apache.org/jira/browse/SPARK-11080
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.0
>
>
> My understanding of {{NamedExpression.newExprId}} is that it is only intended 
> to be called on the driver. If it is called on executors, then this may lead 
> to scenarios where the same expression id is re-used in two different 
> NamedExpressions.
> More generally, I think that calling {{NamedExpression.newExprId}} within 
> tasks may be an indicator of potential attribute binding bugs. Therefore, I 
> think that we should prevent {{NamedExpression.newExprId}} from being called 
> inside of tasks by throwing an exception when such calls occur. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11032) Failure to resolve having correctly

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11032:


Assignee: (was: Apache Spark)

> Failure to resolve having correctly
> ---
>
> Key: SPARK-11032
> URL: https://issues.apache.org/jira/browse/SPARK-11032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Michael Armbrust
>Priority: Blocker
>
> This is a regression from Spark 1.4
> {code}
> Seq(("michael", 30)).toDF("name", "age").registerTempTable("people")
> sql("SELECT MIN(t0.age) FROM (SELECT * FROM PEOPLE WHERE age > 0) t0 
> HAVING(COUNT(1) > 0)").explain(true)
> == Parsed Logical Plan ==
> 'Filter cast(('COUNT(1) > 0) as boolean)
>  'Project [unresolvedalias('MIN('t0.age))]
>   'Subquery t0
>'Project [unresolvedalias(*)]
> 'Filter ('age > 0)
>  'UnresolvedRelation [PEOPLE], None
> == Analyzed Logical Plan ==
> _c0: int
> Filter cast((count(1) > cast(0 as bigint)) as boolean)
>  Aggregate [min(age#6) AS _c0#9]
>   Subquery t0
>Project [name#5,age#6]
> Filter (age#6 > 0)
>  Subquery people
>   Project [_1#3 AS name#5,_2#4 AS age#6]
>LocalRelation [_1#3,_2#4], [[michael,30]]
> == Optimized Logical Plan ==
> Filter (count(1) > 0)
>  Aggregate [min(age#6) AS _c0#9]
>   Project [_2#4 AS age#6]
>Filter (_2#4 > 0)
> LocalRelation [_1#3,_2#4], [[michael,30]]
> == Physical Plan ==
> Filter (count(1) > 0)
>  TungstenAggregate(key=[], 
> functions=[(min(age#6),mode=Final,isDistinct=false)], output=[_c0#9])
>   TungstenExchange SinglePartition
>TungstenAggregate(key=[], 
> functions=[(min(age#6),mode=Partial,isDistinct=false)], output=[min#12])
> TungstenProject [_2#4 AS age#6]
>  Filter (_2#4 > 0)
>   LocalTableScan [_1#3,_2#4], [[michael,30]]
> Code Generation: true
> {code}
> {code}
> Caused by: java.lang.UnsupportedOperationException: Cannot evaluate 
> expression: count(1)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:188)
>   at 
> org.apache.spark.sql.catalyst.expressions.Count.eval(aggregates.scala:156)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:327)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11032) Failure to resolve having correctly

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11032:


Assignee: Apache Spark

> Failure to resolve having correctly
> ---
>
> Key: SPARK-11032
> URL: https://issues.apache.org/jira/browse/SPARK-11032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Blocker
>
> This is a regression from Spark 1.4
> {code}
> Seq(("michael", 30)).toDF("name", "age").registerTempTable("people")
> sql("SELECT MIN(t0.age) FROM (SELECT * FROM PEOPLE WHERE age > 0) t0 
> HAVING(COUNT(1) > 0)").explain(true)
> == Parsed Logical Plan ==
> 'Filter cast(('COUNT(1) > 0) as boolean)
>  'Project [unresolvedalias('MIN('t0.age))]
>   'Subquery t0
>'Project [unresolvedalias(*)]
> 'Filter ('age > 0)
>  'UnresolvedRelation [PEOPLE], None
> == Analyzed Logical Plan ==
> _c0: int
> Filter cast((count(1) > cast(0 as bigint)) as boolean)
>  Aggregate [min(age#6) AS _c0#9]
>   Subquery t0
>Project [name#5,age#6]
> Filter (age#6 > 0)
>  Subquery people
>   Project [_1#3 AS name#5,_2#4 AS age#6]
>LocalRelation [_1#3,_2#4], [[michael,30]]
> == Optimized Logical Plan ==
> Filter (count(1) > 0)
>  Aggregate [min(age#6) AS _c0#9]
>   Project [_2#4 AS age#6]
>Filter (_2#4 > 0)
> LocalRelation [_1#3,_2#4], [[michael,30]]
> == Physical Plan ==
> Filter (count(1) > 0)
>  TungstenAggregate(key=[], 
> functions=[(min(age#6),mode=Final,isDistinct=false)], output=[_c0#9])
>   TungstenExchange SinglePartition
>TungstenAggregate(key=[], 
> functions=[(min(age#6),mode=Partial,isDistinct=false)], output=[min#12])
> TungstenProject [_2#4 AS age#6]
>  Filter (_2#4 > 0)
>   LocalTableScan [_1#3,_2#4], [[michael,30]]
> Code Generation: true
> {code}
> {code}
> Caused by: java.lang.UnsupportedOperationException: Cannot evaluate 
> expression: count(1)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:188)
>   at 
> org.apache.spark.sql.catalyst.expressions.Count.eval(aggregates.scala:156)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:327)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11032) Failure to resolve having correctly

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955796#comment-14955796
 ] 

Apache Spark commented on SPARK-11032:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9105

> Failure to resolve having correctly
> ---
>
> Key: SPARK-11032
> URL: https://issues.apache.org/jira/browse/SPARK-11032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Michael Armbrust
>Priority: Blocker
>
> This is a regression from Spark 1.4
> {code}
> Seq(("michael", 30)).toDF("name", "age").registerTempTable("people")
> sql("SELECT MIN(t0.age) FROM (SELECT * FROM PEOPLE WHERE age > 0) t0 
> HAVING(COUNT(1) > 0)").explain(true)
> == Parsed Logical Plan ==
> 'Filter cast(('COUNT(1) > 0) as boolean)
>  'Project [unresolvedalias('MIN('t0.age))]
>   'Subquery t0
>'Project [unresolvedalias(*)]
> 'Filter ('age > 0)
>  'UnresolvedRelation [PEOPLE], None
> == Analyzed Logical Plan ==
> _c0: int
> Filter cast((count(1) > cast(0 as bigint)) as boolean)
>  Aggregate [min(age#6) AS _c0#9]
>   Subquery t0
>Project [name#5,age#6]
> Filter (age#6 > 0)
>  Subquery people
>   Project [_1#3 AS name#5,_2#4 AS age#6]
>LocalRelation [_1#3,_2#4], [[michael,30]]
> == Optimized Logical Plan ==
> Filter (count(1) > 0)
>  Aggregate [min(age#6) AS _c0#9]
>   Project [_2#4 AS age#6]
>Filter (_2#4 > 0)
> LocalRelation [_1#3,_2#4], [[michael,30]]
> == Physical Plan ==
> Filter (count(1) > 0)
>  TungstenAggregate(key=[], 
> functions=[(min(age#6),mode=Final,isDistinct=false)], output=[_c0#9])
>   TungstenExchange SinglePartition
>TungstenAggregate(key=[], 
> functions=[(min(age#6),mode=Partial,isDistinct=false)], output=[min#12])
> TungstenProject [_2#4 AS age#6]
>  Filter (_2#4 > 0)
>   LocalTableScan [_1#3,_2#4], [[michael,30]]
> Code Generation: true
> {code}
> {code}
> Caused by: java.lang.UnsupportedOperationException: Cannot evaluate 
> expression: count(1)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:188)
>   at 
> org.apache.spark.sql.catalyst.expressions.Count.eval(aggregates.scala:156)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:327)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7099) Floating point literals cannot be specified using exponent

2015-10-13 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955790#comment-14955790
 ] 

kevin yu commented on SPARK-7099:
-

Hello Ryan: Can you close this JIRA? Thanks.
Kevin

> Floating point literals cannot be specified using exponent
> --
>
> Key: SPARK-7099
> URL: https://issues.apache.org/jira/browse/SPARK-7099
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
> Environment: Windows, Linux, Mac OS X
>Reporter: Peter Hagelund
>Priority: Minor
>
> Floating point literals cannot be expressed in scientific notation using an 
> exponent, like e.g. 1.23E4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11088) Optimize DataSourceStrategy.mergeWithPartitionValues

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11088:


Assignee: Cheng Lian  (was: Apache Spark)

> Optimize DataSourceStrategy.mergeWithPartitionValues
> 
>
> Key: SPARK-11088
> URL: https://issues.apache.org/jira/browse/SPARK-11088
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> This method is essentially a projection, but it's implemented in an pretty 
> inefficient way and causes significant boxing cost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11088) Optimize DataSourceStrategy.mergeWithPartitionValues

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11088:


Assignee: Apache Spark  (was: Cheng Lian)

> Optimize DataSourceStrategy.mergeWithPartitionValues
> 
>
> Key: SPARK-11088
> URL: https://issues.apache.org/jira/browse/SPARK-11088
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.1
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> This method is essentially a projection, but it's implemented in an pretty 
> inefficient way and causes significant boxing cost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11088) Optimize DataSourceStrategy.mergeWithPartitionValues

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955765#comment-14955765
 ] 

Apache Spark commented on SPARK-11088:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/9104

> Optimize DataSourceStrategy.mergeWithPartitionValues
> 
>
> Key: SPARK-11088
> URL: https://issues.apache.org/jira/browse/SPARK-11088
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> This method is essentially a projection, but it's implemented in an pretty 
> inefficient way and causes significant boxing cost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

2015-10-13 Thread nirav patel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955489#comment-14955489
 ] 

nirav patel edited comment on SPARK-2365 at 10/13/15 9:46 PM:
--

Is it possible with current design to have a capability to do a "scan" over 
ordered rdd? "scan" functionality can be similar to hbase scan.


was (Author: tenstriker):
Can it also have a capability to do a "scan" over ordered rdd? "scan" 
functionality can be similar to hbase scan.

> Add IndexedRDD, an efficient updatable key-value store
> --
>
> Key: SPARK-2365
> URL: https://issues.apache.org/jira/browse/SPARK-2365
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, Spark Core
>Reporter: Ankur Dave
>Assignee: Ankur Dave
> Attachments: 2014-07-07-IndexedRDD-design-review.pdf
>
>
> RDDs currently provide a bulk-updatable, iterator-based interface. This 
> imposes minimal requirements on the storage layer, which only needs to 
> support sequential access, enabling on-disk and serialized storage.
> However, many applications would benefit from a richer interface. Efficient 
> support for point lookups would enable serving data out of RDDs, but it 
> currently requires iterating over an entire partition to find the desired 
> element. Point updates similarly require copying an entire iterator. Joins 
> are also expensive, requiring a shuffle and local hash joins.
> To address these problems, we propose IndexedRDD, an efficient key-value 
> store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key 
> uniqueness and pre-indexing the entries for efficient joins and point 
> lookups, updates, and deletions.
> It would be implemented by (1) hash-partitioning the entries by key, (2) 
> maintaining a hash index within each partition, and (3) using purely 
> functional (immutable and efficiently updatable) data structures to enable 
> efficient modifications and deletions.
> GraphX would be the first user of IndexedRDD, since it currently implements a 
> limited form of this functionality in VertexRDD. We envision a variety of 
> other uses for IndexedRDD, including streaming updates to RDDs, direct 
> serving from RDDs, and as an execution strategy for Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11091) Change the flag of spark.sql.canonicalizeView to spark.sql.nativeView

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11091:


Assignee: Yin Huai  (was: Apache Spark)

> Change the flag of spark.sql.canonicalizeView to spark.sql.nativeView
> -
>
> Key: SPARK-11091
> URL: https://issues.apache.org/jira/browse/SPARK-11091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> The meaning of this flag is exactly the opposite. Let's change it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11091) Change the flag of spark.sql.canonicalizeView to spark.sql.nativeView

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955737#comment-14955737
 ] 

Apache Spark commented on SPARK-11091:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/9103

> Change the flag of spark.sql.canonicalizeView to spark.sql.nativeView
> -
>
> Key: SPARK-11091
> URL: https://issues.apache.org/jira/browse/SPARK-11091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> The meaning of this flag is exactly the opposite. Let's change it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11091) Change the flag of spark.sql.canonicalizeView to spark.sql.nativeView

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11091:


Assignee: Apache Spark  (was: Yin Huai)

> Change the flag of spark.sql.canonicalizeView to spark.sql.nativeView
> -
>
> Key: SPARK-11091
> URL: https://issues.apache.org/jira/browse/SPARK-11091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> The meaning of this flag is exactly the opposite. Let's change it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11091) Change the flag of spark.sql.canonicalizeView to spark.sql.nativeView

2015-10-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11091:
-
Description: 
The meaning of this flag is exactly the opposite. Let's change it.


  was:This flag's name is not right. The meaning is exactly the opposite. 


> Change the flag of spark.sql.canonicalizeView to spark.sql.nativeView
> -
>
> Key: SPARK-11091
> URL: https://issues.apache.org/jira/browse/SPARK-11091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> The meaning of this flag is exactly the opposite. Let's change it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11091) Change the flag of spark.sql.canonicalizeView to spark.sql.nativeView

2015-10-13 Thread Yin Huai (JIRA)
Yin Huai created SPARK-11091:


 Summary: Change the flag of spark.sql.canonicalizeView to 
spark.sql.nativeView
 Key: SPARK-11091
 URL: https://issues.apache.org/jira/browse/SPARK-11091
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai


This flag's name is not right. The meaning is exactly the opposite. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10217) Spark SQL cannot handle ordering directive in ORDER BY clauses with expressions

2015-10-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955723#comment-14955723
 ] 

Xiao Li commented on SPARK-10217:
-

Hi, Simeon, 

I am trying to reproduce your problem on Spark 1.5.1. I failed to get the 
exception you got. See the plan I got:

== Parsed Logical Plan ==
'Sort [('c1 + 'c1) ASC], true
 'Project [unresolvedalias('c1)]
  'UnresolvedRelation [inMemoryDF], None

== Analyzed Logical Plan ==
c1: string
Sort [(cast(c1#0 as double) + cast(c1#0 as double)) ASC], true
 Project [c1#0]
  Subquery inMemoryDF
   LogicalRDD [c1#0,c2#1,c3#2,c4#3], MapPartitionsRDD[1] at createDataFrame at 
SimpleApp.scala:42

== Optimized Logical Plan ==
Sort [(cast(c1#0 as double) + cast(c1#0 as double)) ASC], true
 Project [c1#0]
  LogicalRDD [c1#0,c2#1,c3#2,c4#3], MapPartitionsRDD[1] at createDataFrame at 
SimpleApp.scala:42

== Physical Plan ==
TungstenSort [(cast(c1#0 as double) + cast(c1#0 as double)) ASC], true, 0
 ConvertToUnsafe
  Exchange rangepartitioning((cast(c1#0 as double) + cast(c1#0 as double)) ASC)
   ConvertToSafe
TungstenProject [c1#0]
 Scan PhysicalRDD[c1#0,c2#1,c3#2,c4#3]

Thanks, 

Xiao Li

> Spark SQL cannot handle ordering directive in ORDER BY clauses with 
> expressions
> ---
>
> Key: SPARK-10217
> URL: https://issues.apache.org/jira/browse/SPARK-10217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>  Labels: SQL, analyzers
>
> Spark SQL supports expressions in ORDER BY clauses, e.g.,
> {code}
> scala> sqlContext.sql("select cnt from test order by (cnt + cnt)")
> res2: org.apache.spark.sql.DataFrame = [cnt: bigint]
> {code}
> However, the analyzer gets confused when there is an explicit ordering 
> directive (ASC/DESC):
> {code}
> scala> sqlContext.sql("select cnt from test order by (cnt + cnt) asc")
> 15/08/25 04:08:02 INFO ParseDriver: Parsing command: select cnt from test 
> order by (cnt + cnt) asc
> org.apache.spark.sql.AnalysisException: extraneous input 'asc' expecting EOF 
> near ''; line 1 pos 40
>   at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:289)
>   at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
>   at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10337) Views are broken

2015-10-13 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949251#comment-14949251
 ] 

Yin Huai edited comment on SPARK-10337 at 10/13/15 9:30 PM:


*Update: The flag will be changed to spark.sql.nativeView.*

https://github.com/apache/spark/pull/8990 introduces the initial view supports 
in Spark SQL. Right now, we have a way to natively handle view definition.

The caveat of this implementation is that users have to manually canonicalize 
the SQL string. Otherwise, the semantic of the view can be different. For 
example, for a SQL string {{SELECT a, b FROM table}}, we will save this text to 
Hive metastore as is instead of saving {{SELECT `table`.`a`, `table`.`b` FROM 
`currentDB`.`table`}} in the metastore. When the current database is changed, 
table `table` can actually point to a totally different table than the one that 
the user meant to use in the view definition.

This feature is not enabled by default. To enable it, please set 
{{spark.sql.canonicalizeView}} to true.


was (Author: yhuai):
https://github.com/apache/spark/pull/8990 introduces the initial view supports 
in Spark SQL. Right now, we have a way to natively handle view definition.

The caveat of this implementation is that users have to manually canonicalize 
the SQL string. Otherwise, the semantic of the view can be different. For 
example, for a SQL string {{SELECT a, b FROM table}}, we will save this text to 
Hive metastore as is instead of saving {{SELECT `table`.`a`, `table`.`b` FROM 
`currentDB`.`table`}} in the metastore. When the current database is changed, 
table `table` can actually point to a totally different table than the one that 
the user meant to use in the view definition.

This feature is not enabled by default. To enable it, please set 
{{spark.sql.canonicalizeView}} to true.

> Views are broken
> 
>
> Key: SPARK-10337
> URL: https://issues.apache.org/jira/browse/SPARK-10337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michael Armbrust
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 1.6.0
>
>
> I haven't dug into this yet... but it seems like this should work:
> This works:
> {code}
> SELECT * FROM 100milints
> {code}
> This seems to work:
> {code}
> CREATE VIEW testView AS SELECT * FROM 100milints
> {code}
> This fails:
> {code}
> SELECT * FROM testView
> org.apache.spark.sql.AnalysisException: cannot resolve '100milints.col' given 
> input columns id; line 1 pos 7
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catal

[jira] [Comment Edited] (SPARK-10617) Leap year miscalculated in sql query

2015-10-13 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955691#comment-14955691
 ] 

Davies Liu edited comment on SPARK-10617 at 10/13/15 9:15 PM:
--

If you really want to get 2016-02-28 back, you could do  
date_add(cast('2015-02-28' as date), 365)

We definitely need different way to get different answer.


was (Author: davies):
If you really want to get 2016-02-28 back, you could do  
date_add(cast('2015-02-28' as date), 365)

> Leap year miscalculated in sql query
> 
>
> Key: SPARK-10617
> URL: https://issues.apache.org/jira/browse/SPARK-10617
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.6.0
>Reporter: shao lo
>
> -- This is wrong...returns 2016-03-01
> select date_add(add_months(cast('2015-02-28' as date), 1 * 12), 1)
> -- This is right...returns 2016-02-29
> select date_add(cast('2016-02-28' as date), 1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10617) Leap year miscalculated in sql query

2015-10-13 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10617.

Resolution: Not A Problem
  Assignee: Davies Liu

> Leap year miscalculated in sql query
> 
>
> Key: SPARK-10617
> URL: https://issues.apache.org/jira/browse/SPARK-10617
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.6.0
>Reporter: shao lo
>Assignee: Davies Liu
>
> -- This is wrong...returns 2016-03-01
> select date_add(add_months(cast('2015-02-28' as date), 1 * 12), 1)
> -- This is right...returns 2016-02-29
> select date_add(cast('2016-02-28' as date), 1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10617) Leap year miscalculated in sql query

2015-10-13 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955691#comment-14955691
 ] 

Davies Liu commented on SPARK-10617:


If you really want to get 2016-02-28 back, you could do  
date_add(cast('2015-02-28' as date), 365)

> Leap year miscalculated in sql query
> 
>
> Key: SPARK-10617
> URL: https://issues.apache.org/jira/browse/SPARK-10617
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.6.0
>Reporter: shao lo
>
> -- This is wrong...returns 2016-03-01
> select date_add(add_months(cast('2015-02-28' as date), 1 * 12), 1)
> -- This is right...returns 2016-02-29
> select date_add(cast('2016-02-28' as date), 1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11052) Spaces in the build dir causes failures in the build/mvn script

2015-10-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11052.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9065
[https://github.com/apache/spark/pull/9065]

> Spaces in the build dir causes failures in the build/mvn script
> ---
>
> Key: SPARK-11052
> URL: https://issues.apache.org/jira/browse/SPARK-11052
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Trystan Leftwich
>Priority: Minor
> Fix For: 1.6.0
>
>
> If you are running make-distribution in a path that contains a space in it 
> the build/mvn script will fail:
> {code}
> mkdir /tmp/test\ spaces
> cd /tmp/test\ spaces
> git clone https://github.com/apache/spark.git
> cd spark
> ./make-distribution.sh --name spark-1.5-test4 --tgz -Pyarn 
> -Phive-thriftserver -Phive
> {code}
> You will get the following errors
> {code}
> /tmp/test spaces/spark/build/mvn: line 107: cd: /../lib: No such file or 
> directory
> usage: dirname path
> /tmp/test spaces/spark/build/mvn: line 108: cd: /../lib: No such file or 
> directory
> /tmp/test spaces/spark/build/mvn: line 138: /tmp/test: No such file or 
> directory
> /tmp/test spaces/spark/build/mvn: line 140: /tmp/test: No such file or 
> directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11052) Spaces in the build dir causes failures in the build/mvn script

2015-10-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11052:
--
Assignee: Trystan Leftwich

> Spaces in the build dir causes failures in the build/mvn script
> ---
>
> Key: SPARK-11052
> URL: https://issues.apache.org/jira/browse/SPARK-11052
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Trystan Leftwich
>Assignee: Trystan Leftwich
>Priority: Minor
> Fix For: 1.6.0
>
>
> If you are running make-distribution in a path that contains a space in it 
> the build/mvn script will fail:
> {code}
> mkdir /tmp/test\ spaces
> cd /tmp/test\ spaces
> git clone https://github.com/apache/spark.git
> cd spark
> ./make-distribution.sh --name spark-1.5-test4 --tgz -Pyarn 
> -Phive-thriftserver -Phive
> {code}
> You will get the following errors
> {code}
> /tmp/test spaces/spark/build/mvn: line 107: cd: /../lib: No such file or 
> directory
> usage: dirname path
> /tmp/test spaces/spark/build/mvn: line 108: cd: /../lib: No such file or 
> directory
> /tmp/test spaces/spark/build/mvn: line 138: /tmp/test: No such file or 
> directory
> /tmp/test spaces/spark/build/mvn: line 140: /tmp/test: No such file or 
> directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11060) Fix some potential NPEs in DStream transformation

2015-10-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11060:
--
Priority: Minor  (was: Major)

> Fix some potential NPEs in DStream transformation
> -
>
> Key: SPARK-11060
> URL: https://issues.apache.org/jira/browse/SPARK-11060
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Saisai Shao
>Priority: Minor
>
> Guard out some potential NPEs when input stream returns None instead of empty 
> RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11070) Remove older releases on dist.apache.org

2015-10-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955675#comment-14955675
 ] 

Sean Owen commented on SPARK-11070:
---

[~pwendell] I updated the site download javascript, but I don't think I can 
actually delete from dist.apache.org (I think it requires PMC). This is 
basically a respin of https://issues.apache.org/jira/browse/SPARK-1449 and I 
think all but 1.3.1, 1.4.1, 1.5.1 can be removed from the active mirrors.

> Remove older releases on dist.apache.org
> 
>
> Key: SPARK-11070
> URL: https://issues.apache.org/jira/browse/SPARK-11070
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Sean Owen
>Priority: Trivial
> Attachments: SPARK-11070.patch
>
>
> dist.apache.org should be periodically cleaned up such that it only includes 
> the latest releases in each active minor release branch. This is to reduce 
> load on mirrors. It can probably lose the 1.2.x releases at this point. In 
> total this would clean out 6 of the 9 releases currently mirrored at 
> https://dist.apache.org/repos/dist/release/spark/ 
> All releases are always archived at archive.apache.org and continue to be 
> available. The JS behind spark.apache.org/downloads.html needs to be updated 
> to point at archive.apache.org for older releases, then.
> There won't be a pull request for this as it's strictly an update to the site 
> hosted in SVN, and the files hosted by Apache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10617) Leap year miscalculated in sql query

2015-10-13 Thread shao lo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955668#comment-14955668
 ] 

shao lo commented on SPARK-10617:
-

Yik!!  I was assuming

   add_months(cast('2015-02-28' as date), 12)

would return 2016-02-28.

The following fiddle indeed shows 29.
http://sqlfiddle.com/#!4/4e915/3

...So the million dollar question is...how would you add a year to 2015-02-28 
and get 2016-02-28?


> Leap year miscalculated in sql query
> 
>
> Key: SPARK-10617
> URL: https://issues.apache.org/jira/browse/SPARK-10617
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.6.0
>Reporter: shao lo
>
> -- This is wrong...returns 2016-03-01
> select date_add(add_months(cast('2015-02-28' as date), 1 * 12), 1)
> -- This is right...returns 2016-02-29
> select date_add(cast('2016-02-28' as date), 1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10983) Implement unified memory manager

2015-10-13 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-10983.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9084
[https://github.com/apache/spark/pull/9084]

> Implement unified memory manager
> 
>
> Key: SPARK-10983
> URL: https://issues.apache.org/jira/browse/SPARK-10983
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.6.0
>
>
> This builds on top of the MemoryManager interface introduced in SPARK-10956. 
> That issue implemented a StaticMemoryManager which implemented legacy 
> behavior. This issue is concerned with implementing a UnifiedMemoryManager 
> (or whatever we call it) according to the design doc posted in SPARK-1.
> Note: the scope of this issue is limited to implementing this new mode 
> without significant refactoring. If necessary, any such refactoring should 
> come later (or earlier) in a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9182) filter and groupBy on DataFrames are not passed through to jdbc source

2015-10-13 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955646#comment-14955646
 ] 

Davies Liu commented on SPARK-9182:
---

For JDBC, I think we could push more stuff (for example, a + b > 3) into remote 
database, which include casting. This is more useful for JDBC than other file 
based data sources, we may could spend more efforts on it.

> filter and groupBy on DataFrames are not passed through to jdbc source
> --
>
> Key: SPARK-9182
> URL: https://issues.apache.org/jira/browse/SPARK-9182
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Greg Rahn
>Assignee: Yijie Shen
>Priority: Critical
>
> When running all of these API calls, the only one that passes the filter 
> through to the backend jdbc source is equality.  All filters in these 
> commands should be able to be passed through to the jdbc database source.
> {code}
> val url="jdbc:postgresql:grahn"
> val prop = new java.util.Properties
> val emp = sqlContext.read.jdbc(url, "emp", prop)
> emp.filter(emp("sal") === 5000).show()
> emp.filter(emp("sal") < 5000).show()
> emp.filter("sal = 3000").show()
> emp.filter("sal > 2500").show()
> emp.filter("sal >= 2500").show()
> emp.filter("sal < 2500").show()
> emp.filter("sal <= 2500").show()
> emp.filter("sal != 3000").show()
> emp.filter("sal between 3000 and 5000").show()
> emp.filter("ename in ('SCOTT','BLAKE')").show()
> {code}
> We see from the PostgreSQL query log the following is run, and see that only 
> equality predicates are passed through.
> {code}
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp WHERE 
> sal = 5000
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp WHERE 
> sal = 3000
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10389) support order by non-attribute grouping expression on Aggregate

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955634#comment-14955634
 ] 

Apache Spark commented on SPARK-10389:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9102

> support order by non-attribute grouping expression on Aggregate
> ---
>
> Key: SPARK-10389
> URL: https://issues.apache.org/jira/browse/SPARK-10389
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>
> For example, we should support "SELECT MAX(value) FROM src GROUP BY key + 1 
> ORDER BY key + 1".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11008) Spark window function returns inconsistent/wrong results

2015-10-13 Thread Prasad Chalasani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955622#comment-14955622
 ] 

Prasad Chalasani commented on SPARK-11008:
--

great, so it wasn't a "minor" bug after all :)


> Spark window function returns inconsistent/wrong results
> 
>
> Key: SPARK-11008
> URL: https://issues.apache.org/jira/browse/SPARK-11008
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.0, 1.5.0
> Environment: Amazon Linux AMI (Amazon Linux version 2015.09)
>Reporter: Prasad Chalasani
>Priority: Minor
>
> Summary: applying a windowing function on a data-frame, followed by count() 
> gives widely varying results in repeated runs: none exceed the correct value, 
> but of course all but one are wrong. On large data-sets I sometimes get as 
> small as HALF of the correct value.
> A minimal reproducible example is here: 
> (1) start spark-shell
> (2) run these:
> val data = 1.to(100).map(x => (x,1))
> import sqlContext.implicits._
> val tbl = sc.parallelize(data).toDF("id", "time")
> tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> (3) exit the shell (this is important)
> (4) start spark-shell again
> (5) run these:
> import org.apache.spark.sql.expressions.Window
> val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> val win = Window.partitionBy("id").orderBy("time")
> df.select($"id", 
> (rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count()
> I get 98, but the correct result is 100. 
> If I re-run the code in step 5 in the same shell, then the result gets 
> "fixed" and I always get 100.
> Note this is only a minimal reproducible example to reproduce the error. In 
> my real application the size of the data is much larger and the window 
> function is not trivial as above (i.e. there are multiple "time" values per 
> "id", etc), and I see results sometimes as small as HALF of the correct value 
> (e.g. 120,000 while the correct value is 250,000). So this is a serious 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11008) Spark window function returns inconsistent/wrong results

2015-10-13 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955593#comment-14955593
 ] 

Herman van Hovell commented on SPARK-11008:
---

A fix for a bug concerning the use of Window functions in cluster mode has been 
pushed to the master today. This will hopefully fix the Window related problems 
described here.

Can anyone verify using the latest master?

> Spark window function returns inconsistent/wrong results
> 
>
> Key: SPARK-11008
> URL: https://issues.apache.org/jira/browse/SPARK-11008
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.0, 1.5.0
> Environment: Amazon Linux AMI (Amazon Linux version 2015.09)
>Reporter: Prasad Chalasani
>Priority: Minor
>
> Summary: applying a windowing function on a data-frame, followed by count() 
> gives widely varying results in repeated runs: none exceed the correct value, 
> but of course all but one are wrong. On large data-sets I sometimes get as 
> small as HALF of the correct value.
> A minimal reproducible example is here: 
> (1) start spark-shell
> (2) run these:
> val data = 1.to(100).map(x => (x,1))
> import sqlContext.implicits._
> val tbl = sc.parallelize(data).toDF("id", "time")
> tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> (3) exit the shell (this is important)
> (4) start spark-shell again
> (5) run these:
> import org.apache.spark.sql.expressions.Window
> val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> val win = Window.partitionBy("id").orderBy("time")
> df.select($"id", 
> (rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count()
> I get 98, but the correct result is 100. 
> If I re-run the code in step 5 in the same shell, then the result gets 
> "fixed" and I always get 100.
> Note this is only a minimal reproducible example to reproduce the error. In 
> my real application the size of the data is much larger and the window 
> function is not trivial as above (i.e. there are multiple "time" values per 
> "id", etc), and I see results sometimes as small as HALF of the correct value 
> (e.g. 120,000 while the correct value is 250,000). So this is a serious 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10619) Can't sort columns on Executor Page

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10619:


Assignee: Apache Spark  (was: Thomas Graves)

> Can't sort columns on Executor Page
> ---
>
> Key: SPARK-10619
> URL: https://issues.apache.org/jira/browse/SPARK-10619
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>
> I am using spark 1.5 running on yarn and go to the executors page.  It won't 
> allow sorting of the columns. This used to work in Spark 1.4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10619) Can't sort columns on Executor Page

2015-10-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955577#comment-14955577
 ] 

Apache Spark commented on SPARK-10619:
--

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/9101

> Can't sort columns on Executor Page
> ---
>
> Key: SPARK-10619
> URL: https://issues.apache.org/jira/browse/SPARK-10619
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> I am using spark 1.5 running on yarn and go to the executors page.  It won't 
> allow sorting of the columns. This used to work in Spark 1.4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10619) Can't sort columns on Executor Page

2015-10-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10619:


Assignee: Thomas Graves  (was: Apache Spark)

> Can't sort columns on Executor Page
> ---
>
> Key: SPARK-10619
> URL: https://issues.apache.org/jira/browse/SPARK-10619
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> I am using spark 1.5 running on yarn and go to the executors page.  It won't 
> allow sorting of the columns. This used to work in Spark 1.4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7402) JSON serialization of standard params

2015-10-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7402:
-
Summary: JSON serialization of standard params  (was: JSON serialization of 
params)

> JSON serialization of standard params
> -
>
> Key: SPARK-7402
> URL: https://issues.apache.org/jira/browse/SPARK-7402
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.6.0
>
>
> Add JSON support to Param in order to persist parameters with transformer, 
> estimator, and models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7402) JSON serialization of params

2015-10-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7402.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9090
[https://github.com/apache/spark/pull/9090]

> JSON serialization of params
> 
>
> Key: SPARK-7402
> URL: https://issues.apache.org/jira/browse/SPARK-7402
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.6.0
>
>
> Add JSON support to Param in order to persist parameters with transformer, 
> estimator, and models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10388) Public dataset loader interface

2015-10-13 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955567#comment-14955567
 ] 

Xiangrui Meng commented on SPARK-10388:
---

Discussed with [~rams] offline and he is interested in working together on this 
feature.

> Public dataset loader interface
> ---
>
> Key: SPARK-10388
> URL: https://issues.apache.org/jira/browse/SPARK-10388
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> It is very useful to have a public dataset loader to fetch ML datasets from 
> popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, 
> requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets = loader.ls("libsvm") // returns a local DataFrame
> datasets.show() // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the 
> API and implementation are pending discussion. Note that this requires http 
> and https support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-13 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1499#comment-1499
 ] 

Zhan Zhang commented on SPARK-11087:


no matter whether the table is sorted or not, the predicate pushdown should 
happen. Need to first add some debug msg on the driver side to make sure it 
happen.

> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone  int 
> z int 
> year  int 
> month int 
>
> # Detailed Table Information   
> Database: default  
> Owner:patcharee
> CreateTime:   Thu Jul 09 16:46:54 CEST 2015
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D   
>  
> Table Type:   EXTERNAL_TABLE   
> Table Parameters:  
>   EXTERNALTRUE
>   comment this table is imported from rwf_data/*/wrf/*
>   last_modified_bypatcharee   
>   last_modified_time  1439806692  
>   orc.compressZLIB
>   transient_lastDdlTime   1439806692  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:  org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
> OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>  
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   serialization.format1   
> Time taken: 0.388 seconds, Fetched: 58 row(s)
>

[jira] [Commented] (SPARK-10619) Can't sort columns on Executor Page

2015-10-13 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955548#comment-14955548
 ] 

Thomas Graves commented on SPARK-10619:
---

 looks like this was broken by commit: 
https://github.com/apache/spark/commit/fb1d06fc242ec00320f1a3049673fbb03c4a6eb9#diff-b8adb646ef90f616c34eb5c98d1ebd16

It looks like somet hings were change to use the UIUtils.listingTable but 
executor page wasn't converted so when it removed sortable from the UIUtils. 
TABLE_CLASS_NOT_STRIPED it broke this page.

> Can't sort columns on Executor Page
> ---
>
> Key: SPARK-10619
> URL: https://issues.apache.org/jira/browse/SPARK-10619
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> I am using spark 1.5 running on yarn and go to the executors page.  It won't 
> allow sorting of the columns. This used to work in Spark 1.4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10619) Can't sort columns on Executor Page

2015-10-13 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-10619:
-

Assignee: Thomas Graves

> Can't sort columns on Executor Page
> ---
>
> Key: SPARK-10619
> URL: https://issues.apache.org/jira/browse/SPARK-10619
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> I am using spark 1.5 running on yarn and go to the executors page.  It won't 
> allow sorting of the columns. This used to work in Spark 1.4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

2015-10-13 Thread nirav patel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955489#comment-14955489
 ] 

nirav patel edited comment on SPARK-2365 at 10/13/15 7:12 PM:
--

Can it also have a capability to do a "scan" over ordered rdd? "scan" 
functionality can be similar to hbase scan.


was (Author: tenstriker):
Can it also have a capability to do a "scan" over ordered rdd? 

> Add IndexedRDD, an efficient updatable key-value store
> --
>
> Key: SPARK-2365
> URL: https://issues.apache.org/jira/browse/SPARK-2365
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, Spark Core
>Reporter: Ankur Dave
>Assignee: Ankur Dave
> Attachments: 2014-07-07-IndexedRDD-design-review.pdf
>
>
> RDDs currently provide a bulk-updatable, iterator-based interface. This 
> imposes minimal requirements on the storage layer, which only needs to 
> support sequential access, enabling on-disk and serialized storage.
> However, many applications would benefit from a richer interface. Efficient 
> support for point lookups would enable serving data out of RDDs, but it 
> currently requires iterating over an entire partition to find the desired 
> element. Point updates similarly require copying an entire iterator. Joins 
> are also expensive, requiring a shuffle and local hash joins.
> To address these problems, we propose IndexedRDD, an efficient key-value 
> store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key 
> uniqueness and pre-indexing the entries for efficient joins and point 
> lookups, updates, and deletions.
> It would be implemented by (1) hash-partitioning the entries by key, (2) 
> maintaining a hash index within each partition, and (3) using purely 
> functional (immutable and efficiently updatable) data structures to enable 
> efficient modifications and deletions.
> GraphX would be the first user of IndexedRDD, since it currently implements a 
> limited form of this functionality in VertexRDD. We envision a variety of 
> other uses for IndexedRDD, including streaming updates to RDDs, direct 
> serving from RDDs, and as an execution strategy for Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >