[jira] [Updated] (SPARK-20939) Do not duplicate user-defined functions while optimizing logical query plans

2017-06-01 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-20939:
-
Issue Type: Improvement  (was: Bug)

> Do not duplicate user-defined functions while optimizing logical query plans
> 
>
> Key: SPARK-20939
> URL: https://issues.apache.org/jira/browse/SPARK-20939
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Lovasoa
>Priority: Minor
>  Labels: logical_plan, optimizer
>
> Currently, while optimizing a query plan, spark pushes filters down the query 
> plan tree, so that 
> {code:title=LogicalPlan}
> Join Inner, (a = b)
> +- Filter UDF(a)
> +- Relation A
> +- Relation B
> {code}
> becomes 
> {code:title=Optimized LogicalPlan}
>  Join Inner, (a = b)
>  +- Filter UDF(a)
>  +- Relation A
>  +- Filter UDF(b)
>  +- Relation B
> {code}
> In general, it is a good thing to push down filters as it reduces the number 
> of records that will go through the join.
> However, in the case where the filter is an user-defined function (UDF), we 
> cannot know if the cost of executing the function twice will be higher than 
> the eventual cost of joining more elements or not.
> So I think that the optimizer shouldn't move the user-defined function in the 
> query plan tree. The user will still be able to duplicate the function if he 
> wants to.
> See this question on stackoverflow: 
> https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20962) Support subquery column aliases in FROM clause

2017-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20962:


Assignee: Apache Spark

> Support subquery column aliases in FROM clause
> --
>
> Key: SPARK-20962
> URL: https://issues.apache.org/jira/browse/SPARK-20962
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
> Fix For: 2.3.0
>
>
> Currently, we do not support subquery column aliases;
> {code}
> scala> sql("SELECT * FROM (SELECT 1 AS col1, 1 AS col2) t(a, b)").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '(' expecting {, ',', 'WHERE', 'GROUP', 'ORDER', 
> 'HAVING', 'LIMIT', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 
> 'NATURAL', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
> 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 1, pos 45)
> == SQL ==
> SELECT * FROM (SELECT 1 AS col1, 1 AS col2) t(a, b)
> -^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
> {code}
> We could support this by referring;
> http://docs.aws.amazon.com/redshift/latest/dg/r_FROM_clause30.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20962) Support subquery column aliases in FROM clause

2017-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20962:


Assignee: (was: Apache Spark)

> Support subquery column aliases in FROM clause
> --
>
> Key: SPARK-20962
> URL: https://issues.apache.org/jira/browse/SPARK-20962
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
> Fix For: 2.3.0
>
>
> Currently, we do not support subquery column aliases;
> {code}
> scala> sql("SELECT * FROM (SELECT 1 AS col1, 1 AS col2) t(a, b)").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '(' expecting {, ',', 'WHERE', 'GROUP', 'ORDER', 
> 'HAVING', 'LIMIT', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 
> 'NATURAL', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
> 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 1, pos 45)
> == SQL ==
> SELECT * FROM (SELECT 1 AS col1, 1 AS col2) t(a, b)
> -^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
> {code}
> We could support this by referring;
> http://docs.aws.amazon.com/redshift/latest/dg/r_FROM_clause30.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20962) Support subquery column aliases in FROM clause

2017-06-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034228#comment-16034228
 ] 

Apache Spark commented on SPARK-20962:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/18185

> Support subquery column aliases in FROM clause
> --
>
> Key: SPARK-20962
> URL: https://issues.apache.org/jira/browse/SPARK-20962
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
> Fix For: 2.3.0
>
>
> Currently, we do not support subquery column aliases;
> {code}
> scala> sql("SELECT * FROM (SELECT 1 AS col1, 1 AS col2) t(a, b)").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '(' expecting {, ',', 'WHERE', 'GROUP', 'ORDER', 
> 'HAVING', 'LIMIT', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 
> 'NATURAL', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
> 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 1, pos 45)
> == SQL ==
> SELECT * FROM (SELECT 1 AS col1, 1 AS col2) t(a, b)
> -^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
> {code}
> We could support this by referring;
> http://docs.aws.amazon.com/redshift/latest/dg/r_FROM_clause30.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-06-01 Thread Nils Grabbert (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034223#comment-16034223
 ] 

Nils Grabbert commented on SPARK-19104:
---

[~marmbrus] Why are you moving this major bug to 2.3.0? As 
[~nathanwilliamgr...@gmail.com] has already mentioned, it is now almost 
impossible to work with case classes.

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Nils Grabbert
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>  
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311

[jira] [Updated] (SPARK-20950) Improve Serializerbuffersize configurable

2017-06-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20950:
--
Priority: Trivial  (was: Major)

[~heary-cao] please take more care in filling these out. This isn't Major

> Improve Serializerbuffersize configurable
> -
>
> Key: SPARK-20950
> URL: https://issues.apache.org/jira/browse/SPARK-20950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Trivial
>
> 1.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize 
> of UnsafeShuffleWriter.
> 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
> mergeSpillsWithFileStream function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20959) Add a parameter to UnsafeExternalSorter to configure filebuffersize

2017-06-01 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20959:
--
Priority: Trivial  (was: Major)

Sounds closely related to SPARK-20950, and I'm not clear about the use case for 
this either

> Add a parameter to UnsafeExternalSorter to configure filebuffersize
> ---
>
> Key: SPARK-20959
> URL: https://issues.apache.org/jira/browse/SPARK-20959
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Trivial
>
> Improvement with spark.shuffle.file.buffer configure fileBufferSizeBytes in 
> UnsafeExternalSorter. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20962) Support subquery column aliases in FROM clause

2017-06-01 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-20962:
-
Description: 
Currently, we do not support subquery column aliases;
{code}

scala> sql("SELECT * FROM (SELECT 1 AS col1, 1 AS col2) t(a, b)").show
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '(' expecting {, ',', 'WHERE', 'GROUP', 'ORDER', 
'HAVING', 'LIMIT', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 
'NATURAL', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 1, pos 45)


== SQL ==
SELECT * FROM (SELECT 1 AS col1, 1 AS col2) t(a, b)
-^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:114)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
{code}
We could support this by referring;
http://docs.aws.amazon.com/redshift/latest/dg/r_FROM_clause30.html

  was:
Currently, we do not support subquery aliases;
{code}

scala> sql("SELECT * FROM (SELECT 1 AS col1, 1 AS col2) t(a, b)").show
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '(' expecting {, ',', 'WHERE', 'GROUP', 'ORDER', 
'HAVING', 'LIMIT', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 
'NATURAL', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 1, pos 45)


== SQL ==
SELECT * FROM (SELECT 1 AS col1, 1 AS col2) t(a, b)
-^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:114)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
{code}
We could support this by referring;
http://docs.aws.amazon.com/redshift/latest/dg/r_FROM_clause30.html


> Support subquery column aliases in FROM clause
> --
>
> Key: SPARK-20962
> URL: https://issues.apache.org/jira/browse/SPARK-20962
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
> Fix For: 2.3.0
>
>
> Currently, we do not support subquery column aliases;
> {code}
> scala> sql("SELECT * FROM (SELECT 1 AS col1, 1 AS col2) t(a, b)").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '(' expecting {, ',', 'WHERE', 'GROUP', 'ORDER', 
> 'HAVING', 'LIMIT', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 
> 'NATURAL', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
> 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 1, pos 45)
> == SQL ==
> SELECT * FROM (SELECT 1 AS col1, 1 AS col2) t(a, b)
> -^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
> {code}
> We could support this by referring;
> http://docs.aws.amazon.com/redshift/latest/dg/r_FROM_clause30.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20963) Support column aliases for aliased relation in FROM clause

2017-06-01 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-20963:


 Summary: Support column aliases for aliased relation in FROM clause
 Key: SPARK-20963
 URL: https://issues.apache.org/jira/browse/SPARK-20963
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.1.1
Reporter: Takeshi Yamamuro


Currently, we do not support column aliases for aliased relation;
{code}
scala> Seq((1, 2), (2, 0)).toDF("id", "value").createOrReplaceTempView("t1")
scala> Seq((1, 2), (2, 0)).toDF("id", "value").createOrReplaceTempView("t2")
scala> sql("SELECT * FROM (t1 JOIN t2)")
scala> sql("SELECT * FROM (t1 INNER JOIN t2 ON t1.id = t2.id) AS t(a, b, c, 
d)").show
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '(' expecting {, ',', 'WHERE', 'GROUP', 'ORDER', 
'HAVING', 'LIMIT', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 
'NATURAL', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 1, pos 54)

== SQL ==
SELECT * FROM (t1 INNER JOIN t2 ON t1.id = t2.id) AS t(a, b, c, d)
--^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:114)
  at org.apache.spark.sql.execution.SparkSqlParser.parse(Spa
{code}
We could support this by referring;
http://docs.aws.amazon.com/redshift/latest/dg/r_FROM_clause30.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20962) Support subquery column aliases in FROM clause

2017-06-01 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-20962:
-
Summary: Support subquery column aliases in FROM clause  (was: Support 
subquery aliases in FROM clause)

> Support subquery column aliases in FROM clause
> --
>
> Key: SPARK-20962
> URL: https://issues.apache.org/jira/browse/SPARK-20962
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
> Fix For: 2.3.0
>
>
> Currently, we do not support subquery aliases;
> {code}
> scala> sql("SELECT * FROM (SELECT 1 AS col1, 1 AS col2) t(a, b)").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '(' expecting {, ',', 'WHERE', 'GROUP', 'ORDER', 
> 'HAVING', 'LIMIT', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 
> 'NATURAL', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
> 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 1, pos 45)
> == SQL ==
> SELECT * FROM (SELECT 1 AS col1, 1 AS col2) t(a, b)
> -^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
> {code}
> We could support this by referring;
> http://docs.aws.amazon.com/redshift/latest/dg/r_FROM_clause30.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20841) Support table column aliases in FROM clause

2017-06-01 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034209#comment-16034209
 ] 

Takeshi Yamamuro commented on SPARK-20841:
--

I made the two sub-tasks based on [~smilegator]'s suggestion: 
https://github.com/apache/spark/pull/18079#issuecomment-304493057

> Support table column aliases in FROM clause
> ---
>
> Key: SPARK-20841
> URL: https://issues.apache.org/jira/browse/SPARK-20841
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.3.0
>
>
> Some SQL dialects support a relatively obscure "table column aliases" feature 
> where you can rename columns when aliasing a relation in a {{FROM}} clause. 
> For example:
> {code}
> SELECT * FROM onecolumn AS a(x) JOIN onecolumn AS b(y) ON a.x = b.y
> {code}
> Spark does not currently support this. I would like to add support for this 
> in order to allow me to run a corpus of existing queries which depend on this 
> syntax.
> There's a good writeup on this at 
> http://modern-sql.com/feature/table-column-aliases, which has additional 
> examples and describes other databases' degrees of support for this feature.
> One tricky thing to figure out will be whether FROM clause column aliases 
> take precedence over aliases in the SELECT clause. When adding support for 
> this, we should make sure to add sufficient testing of several corner-cases, 
> including:
> * Aliasing in both the SELECT and FROM clause
> * Aliasing columns in the FROM clause both with and without an explicit AS.
> * Aliasing the wrong number of columns in the FROM clause, both greater and 
> fewer columns than were selected in the SELECT clause.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20962) Support subquery aliases in FROM clause

2017-06-01 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-20962:


 Summary: Support subquery aliases in FROM clause
 Key: SPARK-20962
 URL: https://issues.apache.org/jira/browse/SPARK-20962
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.1.1
Reporter: Takeshi Yamamuro


Currently, we do not support subquery aliases;
{code}

scala> sql("SELECT * FROM (SELECT 1 AS col1, 1 AS col2) t(a, b)").show
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '(' expecting {, ',', 'WHERE', 'GROUP', 'ORDER', 
'HAVING', 'LIMIT', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 
'NATURAL', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 1, pos 45)


== SQL ==
SELECT * FROM (SELECT 1 AS col1, 1 AS col2) t(a, b)
-^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:114)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
{code}
We could support this by referring;
http://docs.aws.amazon.com/redshift/latest/dg/r_FROM_clause30.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks

2017-06-01 Thread Patrick Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034206#comment-16034206
 ] 

Patrick Brown commented on SPARK-20760:
---

We have been having the same issue in a spark 2.0.2 cluster on yarn for a while 
now.

> Memory Leak of RDD blocks 
> --
>
> Key: SPARK-20760
> URL: https://issues.apache.org/jira/browse/SPARK-20760
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0
>Reporter: Binzi Cao
> Attachments: RDD Blocks .png
>
>
> Memory leak for RDD blocks for a long time running rdd process.
> We  have a long term running application, which is doing computations of 
> RDDs. and we found the RDD blocks are keep increasing in the spark ui page. 
> The rdd blocks and memory usage do not mach the cached rdds and memory. It 
> looks like spark keeps old rdd in memory and never released it or never got a 
> chance to release it. The job will eventually die of out of memory. 
> In addition, I'm not seeing this issue in spark 1.6. We are seeing the same 
> issue in Yarn Cluster mode both in kafka streaming and batch applications. 
> The issue in streaming is similar, however, it seems the rdd blocks grows a 
> bit slower than batch jobs. 
> The below is the sample code and it is reproducible by justing running it in 
> local mode. 
> Scala file:
> {code}
> import scala.concurrent.duration.Duration
> import scala.util.{Try, Failure, Success}
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.rdd.RDD
> import scala.concurrent._
> import ExecutionContext.Implicits.global
> case class Person(id: String, name: String)
> object RDDApp {
>   def run(sc: SparkContext) = {
> while (true) {
>   val r = scala.util.Random
>   val data = (1 to r.nextInt(100)).toList.map { a =>
> Person(a.toString, a.toString)
>   }
>   val rdd = sc.parallelize(data)
>   rdd.cache
>   println("running")
>   val a = (1 to 100).toList.map { x =>
> Future(rdd.filter(_.id == x.toString).collect)
>   }
>   a.foreach { f =>
> println(Await.ready(f, Duration.Inf).value.get)
>   }
>   rdd.unpersist()
> }
>   }
>   def main(args: Array[String]): Unit = {
>val conf = new SparkConf().setAppName("test")
> val sc   = new SparkContext(conf)
> run(sc)
>   }
> }
> {code}
> build sbt file:
> {code}
> name := "RDDTest"
> version := "0.1.1"
> scalaVersion := "2.11.5"
> libraryDependencies ++= Seq (
> "org.scalaz" %% "scalaz-core" % "7.2.0",
> "org.scalaz" %% "scalaz-concurrent" % "7.2.0",
> "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
> "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
>   )
> addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
> mainClass in assembly := Some("RDDApp")
> test in assembly := {}
> {code}
> To reproduce it: 
> Just 
> {code}
> spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
> --executor-memory 4G \
> --executor-cores 1 \
> --num-executors 1 \
> --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20956) External shuffle server timeout

2017-06-01 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20956.
--
Resolution: Duplicate

Sounds a duplicate of SPARK-20640. Please reopen this if I misunderstood.

> External shuffle server timeout
> ---
>
> Key: SPARK-20956
> URL: https://issues.apache.org/jira/browse/SPARK-20956
> Project: Spark
>  Issue Type: Request
>  Components: Shuffle
>Affects Versions: 2.0.1, 2.0.2, 2.1.1
>Reporter: satheessh chinnusamy
>  Labels: easyfix
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20365) Not so accurate classpath format for AM and Containers

2017-06-01 Thread lyc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034176#comment-16034176
 ] 

lyc commented on SPARK-20365:
-

Thanks for reviewing.

> Not so accurate classpath format for AM and Containers
> --
>
> Key: SPARK-20365
> URL: https://issues.apache.org/jira/browse/SPARK-20365
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> In Spark on YARN, when configuring "spark.yarn.jars" with local jars (jars 
> started with "local" scheme), we will get inaccurate classpath for AM and 
> containers. This is because we don't remove "local" scheme when concatenating 
> classpath. It is OK to run because classpath is separated with ":" and java 
> treat "local" as a separate jar. But we could improve it to remove the scheme.
> {code}
> java.class.path = 
> /tmp/hadoop-sshao/nm-local-dir/usercache/sshao/appcache/application_1492057593145_0009/container_1492057593145_0009_01_03:/tmp/hadoop-sshao/nm-local-dir/usercache/sshao/appcache/application_1492057593145_0009/container_1492057593145_0009_01_03/__spark_conf__:/tmp/hadoop-sshao/nm-local-dir/usercache/sshao/appcache/application_1492057593145_0009/container_1492057593145_0009_01_03/__spark_libs__/*:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/activation-1.1.1.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/antlr-2.7.7.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/antlr-runtime-3.4.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/antlr4-runtime-4.5.3.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/aopalliance-1.0.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/aopalliance-repackaged-2.4.0-b34.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/apache-log4j-extras-1.2.17.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/apacheds-i18n-2.0.0-M15.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/apacheds-kerberos-codec-2.0.0-M15.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/api-asn1-api-1.0.0-M20.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/api-util-1.0.0-M20.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/arpack_combined_all-0.1.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/avro-1.7.7.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/avro-ipc-1.7.7-tests.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/avro-ipc-1.7.7.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/avro-mapred-1.7.7-hadoop2.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/base64-2.3.8.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/bcprov-jdk15on-1.51.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/bonecp-0.8.0.RELEASE.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/breeze-macros_2.11-0.12.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/breeze_2.11-0.12.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/calcite-avatica-1.2.0-incubating.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/calcite-core-1.2.0-incubating.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/calcite-linq4j-1.2.0-incubating.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/cglib-2.2.1-v20090111.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/chill-java-0.8.0.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/chill_2.11-0.8.0.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/commons-beanutils-1.7.0.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/commons-beanutils-core-1.8.0.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/commons-cli-1.2.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/commons-codec-1.10.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/commons-collections-3.2.2.jar:local:///Users/sshao/projects/apache-spark/assembly/target/scala-2.11/jars/commons-com

[jira] [Assigned] (SPARK-20961) generalize the dictionary in ColumnVector

2017-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20961:


Assignee: Apache Spark  (was: Wenchen Fan)

> generalize the dictionary in ColumnVector
> -
>
> Key: SPARK-20961
> URL: https://issues.apache.org/jira/browse/SPARK-20961
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20961) generalize the dictionary in ColumnVector

2017-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20961:


Assignee: Wenchen Fan  (was: Apache Spark)

> generalize the dictionary in ColumnVector
> -
>
> Key: SPARK-20961
> URL: https://issues.apache.org/jira/browse/SPARK-20961
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20961) generalize the dictionary in ColumnVector

2017-06-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034122#comment-16034122
 ] 

Apache Spark commented on SPARK-20961:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/18183

> generalize the dictionary in ColumnVector
> -
>
> Key: SPARK-20961
> URL: https://issues.apache.org/jira/browse/SPARK-20961
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20960) make ColumnVector public

2017-06-01 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20960:
---

Assignee: (was: Wenchen Fan)

> make ColumnVector public
> 
>
> Key: SPARK-20960
> URL: https://issues.apache.org/jira/browse/SPARK-20960
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>
> ColumnVector is an internal interface in Spark SQL, which is only used for 
> vectorized parquet reader to represent the in-memory columnar format.
> In Spark 2.3 we want to make ColumnVector public, so that we can provide a 
> more efficient way for data exchanges between Spark and external systems. For 
> example, we can use ColumnVector to build the columnar read API in data 
> source framework, we can use ColumnVector to build a more efficient UDF API, 
> etc.
> We also want to introduce a new ColumnVector implementation based on Apache 
> Arrow(basically just a wrapper over Arrow), so that external systems(like 
> Python Pandas DataFrame) can build ColumnVector very easily.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20961) generalize the dictionary in ColumnVector

2017-06-01 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-20961:
---

 Summary: generalize the dictionary in ColumnVector
 Key: SPARK-20961
 URL: https://issues.apache.org/jira/browse/SPARK-20961
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20960) make ColumnVector public

2017-06-01 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-20960:
---

 Summary: make ColumnVector public
 Key: SPARK-20960
 URL: https://issues.apache.org/jira/browse/SPARK-20960
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan


ColumnVector is an internal interface in Spark SQL, which is only used for 
vectorized parquet reader to represent the in-memory columnar format.

In Spark 2.3 we want to make ColumnVector public, so that we can provide a more 
efficient way for data exchanges between Spark and external systems. For 
example, we can use ColumnVector to build the columnar read API in data source 
framework, we can use ColumnVector to build a more efficient UDF API, etc.

We also want to introduce a new ColumnVector implementation based on Apache 
Arrow(basically just a wrapper over Arrow), so that external systems(like 
Python Pandas DataFrame) can build ColumnVector very easily.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20960) make ColumnVector public

2017-06-01 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034118#comment-16034118
 ] 

Wenchen Fan commented on SPARK-20960:
-

cc [~wesmckinn]

> make ColumnVector public
> 
>
> Key: SPARK-20960
> URL: https://issues.apache.org/jira/browse/SPARK-20960
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>
> ColumnVector is an internal interface in Spark SQL, which is only used for 
> vectorized parquet reader to represent the in-memory columnar format.
> In Spark 2.3 we want to make ColumnVector public, so that we can provide a 
> more efficient way for data exchanges between Spark and external systems. For 
> example, we can use ColumnVector to build the columnar read API in data 
> source framework, we can use ColumnVector to build a more efficient UDF API, 
> etc.
> We also want to introduce a new ColumnVector implementation based on Apache 
> Arrow(basically just a wrapper over Arrow), so that external systems(like 
> Python Pandas DataFrame) can build ColumnVector very easily.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20854) extend hint syntax to support any expression, not just identifiers or strings

2017-06-01 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034115#comment-16034115
 ] 

Felix Cheung commented on SPARK-20854:
--

seems like would be good to add support for the same in python/R as well

> extend hint syntax to support any expression, not just identifiers or strings
> -
>
> Key: SPARK-20854
> URL: https://issues.apache.org/jira/browse/SPARK-20854
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Currently the SQL hint syntax supports as parameters only identifiers while 
> the Dataset hint syntax supports only strings.
> They should support any expression as parameters, for example numbers. This 
> is useful for implementing other hints in the future.
> Examples:
> {code}
> df.hint("hint1", Seq(1, 2, 3))
> df.hint("hint2", "A", 1)
> sql("select /*+ hint1((1,2,3)) */")
> sql("select /*+ hint2('A', 1) */")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20149) Audit PySpark code base for 2.6 specific work arounds

2017-06-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034090#comment-16034090
 ] 

Hyukjin Kwon edited comment on SPARK-20149 at 6/2/17 4:32 AM:
--

[~holdenk], I quickly looked through the Python 2.7 changes in 
http://svn.python.org/projects/python/tags/r27/Misc/NEWS. Mostly about the 
words, "backport" and "deprecate". IMHO, I think it is fine to resolve this 
issue for now. If there are 2.6 specific workarounds, I guess there'd be not so 
many instances even if I missed some of them.

I could identify notable changes:

- Issue #2335: Backport set literals syntax from Python 3.x. (for example, {1, 
2} for a set)

- Issue #2333: Backport set and dict comprehensions syntax from Python 3.x. 
(for example {x : 1 for x in [1, 2, 3]} for a dict)

However, I guess these are not required to be changed.


was (Author: hyukjin.kwon):
[~holdenk], I quickly look the Python 2.7 changes in 
http://svn.python.org/projects/python/tags/r27/Misc/NEWS. Mostly about the 
words, "backport" and "deprecate". IMHO, I think it is fine to resolve this 
issue for now. If there are 2.6 specific workarounds, I guess there'd be not so 
many instances even if I missed some of them.

I could identify notable changes:

- Issue #2335: Backport set literals syntax from Python 3.x. (for example, {1, 
2} for a set)

- Issue #2333: Backport set and dict comprehensions syntax from Python 3.x. 
(for example {x : 1 for x in [1, 2, 3]} for a dict)

However, I guess these are not required to be changed.

> Audit PySpark code base for 2.6 specific work arounds
> -
>
> Key: SPARK-20149
> URL: https://issues.apache.org/jira/browse/SPARK-20149
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: holdenk
>
> We should determine what the areas in PySpark are that have specific 2.6 work 
> arounds and create issues for them. The audit can be started during 2.2.0, 
> but cleaning up all the 2.6 specific code is likely too much to try and get 
> in so the actual fixing should probably be considered for 2.2.1 or 2.3 
> (unless 2.2.0 is delayed).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20149) Audit PySpark code base for 2.6 specific work arounds

2017-06-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034090#comment-16034090
 ] 

Hyukjin Kwon edited comment on SPARK-20149 at 6/2/17 3:48 AM:
--

[~holdenk], I quickly look the Python 2.7 changes in 
http://svn.python.org/projects/python/tags/r27/Misc/NEWS. Mostly about the 
words, "backport" and "deprecate". IMHO, I think it is fine to resolve this 
issue for now. If there are 2.6 specific workarounds, I guess there'd be not so 
many instances even if I missed some of them.

I could identify notable changes:

- Issue #2335: Backport set literals syntax from Python 3.x. (for example, {1, 
2} for a set)

- Issue #2333: Backport set and dict comprehensions syntax from Python 3.x. 
(for example {x : 1 for x in [1, 2, 3]} for a dict)

However, I guess these are not required to be changed.


was (Author: hyukjin.kwon):
[~holdenk], I quickly look the Python 2.7 changes in 
http://svn.python.org/projects/python/tags/r27/Misc/NEWS. Mostly about the 
words, "backport" and "deprecate". IMHO, I think it is fine to resolve this 
issue for now. If there are 2.6 specific workarounds, I guess there'd be not so 
many instances. 

I could identify notable changes:

- Issue #2335: Backport set literals syntax from Python 3.x. (for example, {1, 
2} for a set)

- Issue #2333: Backport set and dict comprehensions syntax from Python 3.x. 
(for example {x : 1 for x in [1, 2, 3]} for a dict)

However, I guess these are not required to be changed.

> Audit PySpark code base for 2.6 specific work arounds
> -
>
> Key: SPARK-20149
> URL: https://issues.apache.org/jira/browse/SPARK-20149
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: holdenk
>
> We should determine what the areas in PySpark are that have specific 2.6 work 
> arounds and create issues for them. The audit can be started during 2.2.0, 
> but cleaning up all the 2.6 specific code is likely too much to try and get 
> in so the actual fixing should probably be considered for 2.2.1 or 2.3 
> (unless 2.2.0 is delayed).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20149) Audit PySpark code base for 2.6 specific work arounds

2017-06-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034090#comment-16034090
 ] 

Hyukjin Kwon commented on SPARK-20149:
--

[~holdenk], I quickly look the Python 2.7 changes in 
http://svn.python.org/projects/python/tags/r27/Misc/NEWS. Mostly about the 
words, "backport" and "deprecate". IMHO, I think it is fine to resolve this 
issue for now. If there are 2.6 specific workarounds, I guess there'd be not so 
many instances. 

I could identify notable changes:

- Issue #2335: Backport set literals syntax from Python 3.x. (for example, {1, 
2} for a set)

- Issue #2333: Backport set and dict comprehensions syntax from Python 3.x. 
(for example {x : 1 for x in [1, 2, 3]} for a dict)

However, I guess these are not required to be changed.

> Audit PySpark code base for 2.6 specific work arounds
> -
>
> Key: SPARK-20149
> URL: https://issues.apache.org/jira/browse/SPARK-20149
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: holdenk
>
> We should determine what the areas in PySpark are that have specific 2.6 work 
> arounds and create issues for them. The audit can be started during 2.2.0, 
> but cleaning up all the 2.6 specific code is likely too much to try and get 
> in so the actual fixing should probably be considered for 2.2.1 or 2.3 
> (unless 2.2.0 is delayed).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15682) Hive ORC partition write looks for root hdfs folder for existence

2017-06-01 Thread lyc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034076#comment-16034076
 ] 

lyc commented on SPARK-15682:
-

Hi, I tried this both for `orc` and `parquet`, and they both throws `path 
already exists`. The reason is that spark will check if path in `save(path)` 
exists, and if it exists and mode is `ErrorIfExists`(default), it will throws. 
You can overwrite the whole table by specifying `mode("overwrite")`, but there 
seems no way to overwrite a specific partition. By the way, if you try 
`save("test.sms_outbound_view_orc/proc_date=2016-05-30")`, the path will be 
treated as a table path, so, if you succeed, the final partition path for 
`2016-05-30` will be 
`test.sms_outbound_view_orc/proc_date=2016-05-30/proc_date=2016-05-30`.

What do you mean `have handle to the hive table`?

> Hive ORC partition write looks for root hdfs folder for existence
> -
>
> Key: SPARK-15682
> URL: https://issues.apache.org/jira/browse/SPARK-15682
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Dipankar
>
> Scenario:
> I am using the below program to create new partition based on the current 
> date which signifies the run date.
> However, it fails citing hdfs folder already exists. It checks the root 
> folder and not new partition value.
> Is partitionBy clause actually not checking the hive metastore or folder till 
> proc_date= some value. ? and it's just a way to create folders based on 
> partition key. Not any way related to hive partition ??
> Alternatively, should i use
> result.write.format("orc").save("test.sms_outbound_view_orc/proc_date=2016-05-30")
>  to achieve the result.
> But this will not update hive metastore with new partition details.
> Is spark orc support not equivalent to HCatStorer API?
> My hive table is built with proc_date as partition column. 
> Source code :
> result.registerTempTable("result_tab")
> val result_partition = sqlContext.sql("FROM result_tab select 
> *,'"+curr_date+"' as proc_date")
> result_partition.write.format("orc").partitionBy("proc_date").save("test.sms_outbound_view_orc")
> Exception
> 16/05/31 15:57:34 INFO ParseDriver: Parsing command: FROM result_tab select 
> *,'2016-05-31' as proc_date
> 16/05/31 15:57:34 INFO ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: path 
> hdfs://hdpprod/user/dipankar.ghosal/test.sms_outbound_view_orc already 
> exists.;
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>   at SampleApp$.main(SampleApp.scala:31)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20950) Improve Serializerbuffersize configurable

2017-06-01 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20950:
--
Component/s: (was: SQL)

> Improve Serializerbuffersize configurable
> -
>
> Key: SPARK-20950
> URL: https://issues.apache.org/jira/browse/SPARK-20950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>
> 1.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize 
> of UnsafeShuffleWriter.
> 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
> mergeSpillsWithFileStream function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20959) Add a parameter to UnsafeExternalSorter to configure filebuffersize

2017-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20959:


Assignee: (was: Apache Spark)

> Add a parameter to UnsafeExternalSorter to configure filebuffersize
> ---
>
> Key: SPARK-20959
> URL: https://issues.apache.org/jira/browse/SPARK-20959
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>
> Improvement with spark.shuffle.file.buffer configure fileBufferSizeBytes in 
> UnsafeExternalSorter. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20959) Add a parameter to UnsafeExternalSorter to configure filebuffersize

2017-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20959:


Assignee: Apache Spark

> Add a parameter to UnsafeExternalSorter to configure filebuffersize
> ---
>
> Key: SPARK-20959
> URL: https://issues.apache.org/jira/browse/SPARK-20959
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Assignee: Apache Spark
>
> Improvement with spark.shuffle.file.buffer configure fileBufferSizeBytes in 
> UnsafeExternalSorter. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20959) Add a parameter to UnsafeExternalSorter to configure filebuffersize

2017-06-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034060#comment-16034060
 ] 

Apache Spark commented on SPARK-20959:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/18182

> Add a parameter to UnsafeExternalSorter to configure filebuffersize
> ---
>
> Key: SPARK-20959
> URL: https://issues.apache.org/jira/browse/SPARK-20959
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>
> Improvement with spark.shuffle.file.buffer configure fileBufferSizeBytes in 
> UnsafeExternalSorter. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20959) Add a parameter to UnsafeExternalSorter to configure filebuffersize

2017-06-01 Thread caoxuewen (JIRA)
caoxuewen created SPARK-20959:
-

 Summary: Add a parameter to UnsafeExternalSorter to configure 
filebuffersize
 Key: SPARK-20959
 URL: https://issues.apache.org/jira/browse/SPARK-20959
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.2.0
Reporter: caoxuewen


Improvement with spark.shuffle.file.buffer configure fileBufferSizeBytes in 
UnsafeExternalSorter. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20950) Improve Serializerbuffersize configurable

2017-06-01 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20950:
--
Description: 
1.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize 
of UnsafeShuffleWriter.
2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function.

  was:
1.With spark.shuffle.file.buffer configure fileBufferSizeBytes of 
UnsafeExternalSorter .  
2.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize 
of UnsafeShuffleWriter.
3.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function.


> Improve Serializerbuffersize configurable
> -
>
> Key: SPARK-20950
> URL: https://issues.apache.org/jira/browse/SPARK-20950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>
> 1.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize 
> of UnsafeShuffleWriter.
> 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
> mergeSpillsWithFileStream function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20950) Improve Serializerbuffersize configurable

2017-06-01 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20950:
--
Summary: Improve Serializerbuffersize configurable  (was: Improve 
Serializerbuffersize and filebuffersize configurable)

> Improve Serializerbuffersize configurable
> -
>
> Key: SPARK-20950
> URL: https://issues.apache.org/jira/browse/SPARK-20950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>
> 1.With spark.shuffle.file.buffer configure fileBufferSizeBytes of 
> UnsafeExternalSorter .  
> 2.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize 
> of UnsafeShuffleWriter.
> 3.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
> mergeSpillsWithFileStream function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20950) Improve Serializerbuffersize and filebuffersize configurable

2017-06-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034053#comment-16034053
 ] 

Apache Spark commented on SPARK-20950:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/18182

> Improve Serializerbuffersize and filebuffersize configurable
> 
>
> Key: SPARK-20950
> URL: https://issues.apache.org/jira/browse/SPARK-20950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>
> 1.With spark.shuffle.file.buffer configure fileBufferSizeBytes of 
> UnsafeExternalSorter .  
> 2.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize 
> of UnsafeShuffleWriter.
> 3.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
> mergeSpillsWithFileStream function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20935) A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating StreamingContext.

2017-06-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034036#comment-16034036
 ] 

Hyukjin Kwon commented on SPARK-20935:
--

Thanks for pinging me. Could we just always stop() {{ReceivedBlockTracker}} and 
require {{WriteAheadLog.close()}} to be idempotent and make its implementations 
as so? I could propose a PR if it sounds okay instead.

> A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating 
> StreamingContext.
> ---
>
> Key: SPARK-20935
> URL: https://issues.apache.org/jira/browse/SPARK-20935
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.3, 2.1.1
>Reporter: Terence Yim
>
> With batched write ahead log on by default in driver (SPARK-11731), if there 
> is no receiver based {{InputDStream}}, the "BatchedWriteAheadLog Writer" 
> thread created by {{BatchedWriteAheadLog}} never get shutdown. 
> The root cause is due to 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L168
> that it never call {{ReceivedBlockTracker.stop()}} (which in turn call 
> {{BatchedWriteAheadLog.close()}}) if there is no receiver based input.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20935) A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating StreamingContext.

2017-06-01 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034035#comment-16034035
 ] 

Hyukjin Kwon commented on SPARK-20935:
--

Thanks for pinging me. Could we just always stop() {{ReceivedBlockTracker}} and 
require {{WriteAheadLog.close()}} to be idempotent and make its implementations 
as so? I could propose a PR if it sounds okay instead.

> A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating 
> StreamingContext.
> ---
>
> Key: SPARK-20935
> URL: https://issues.apache.org/jira/browse/SPARK-20935
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.3, 2.1.1
>Reporter: Terence Yim
>
> With batched write ahead log on by default in driver (SPARK-11731), if there 
> is no receiver based {{InputDStream}}, the "BatchedWriteAheadLog Writer" 
> thread created by {{BatchedWriteAheadLog}} never get shutdown. 
> The root cause is due to 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L168
> that it never call {{ReceivedBlockTracker.stop()}} (which in turn call 
> {{BatchedWriteAheadLog.close()}}) if there is no receiver based input.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20935) A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating StreamingContext.

2017-06-01 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-20935:
-
Comment: was deleted

(was: Thanks for pinging me. Could we just always stop() 
{{ReceivedBlockTracker}} and require {{WriteAheadLog.close()}} to be idempotent 
and make its implementations as so? I could propose a PR if it sounds okay 
instead.)

> A daemon thread, "BatchedWriteAheadLog Writer", left behind after terminating 
> StreamingContext.
> ---
>
> Key: SPARK-20935
> URL: https://issues.apache.org/jira/browse/SPARK-20935
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.3, 2.1.1
>Reporter: Terence Yim
>
> With batched write ahead log on by default in driver (SPARK-11731), if there 
> is no receiver based {{InputDStream}}, the "BatchedWriteAheadLog Writer" 
> thread created by {{BatchedWriteAheadLog}} never get shutdown. 
> The root cause is due to 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L168
> that it never call {{ReceivedBlockTracker.stop()}} (which in turn call 
> {{BatchedWriteAheadLog.close()}}) if there is no receiver based input.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20943) Correct BypassMergeSortShuffleWriter's comment

2017-06-01 Thread CanBin Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033988#comment-16033988
 ] 

CanBin Zheng edited comment on SPARK-20943 at 6/2/17 1:11 AM:
--

Look at there two cases.
 {code}
 //Has Aggregator defined
  @Test
  def testGroupByKeyUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1)).groupByKey(2)
rdd.collect()
  }

  //Has Ordering defined
  @Test
  def testShuffleWithKeyOrderingUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1))
val ord = implicitly[Ordering[String]]
val shuffledRDD = new ShuffledRDD[String, Int, Int](rdd, new 
HashPartitioner(2)).setKeyOrdering(ord)
shuffledRDD.collect()
  }
{code}


was (Author: canbinzheng):
Look at there two cases.
 
 `//Has Aggregator defined
  @Test
  def testGroupByKeyUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1)).groupByKey(2)
rdd.collect()
  }

  //Has Ordering defined
  @Test
  def testShuffleWithKeyOrderingUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1))
val ord = implicitly[Ordering[String]]
val shuffledRDD = new ShuffledRDD[String, Int, Int](rdd, new 
HashPartitioner(2)).setKeyOrdering(ord)
shuffledRDD.collect()
  }`

> Correct BypassMergeSortShuffleWriter's comment
> --
>
> Key: SPARK-20943
> URL: https://issues.apache.org/jira/browse/SPARK-20943
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Shuffle
>Affects Versions: 2.1.1
>Reporter: CanBin Zheng
>Priority: Trivial
>  Labels: starter
>
> There are some comments written in BypassMergeSortShuffleWriter.java about 
> when to select this write path, the three required conditions are described 
> as follows:  
> 1. no Ordering is specified, and
> 2. no Aggregator is specified, and
> 3. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold
> Obviously, the conditions written are partially wrong and misleading, the 
> right conditions should be:
> 1. map-side combine is false, and
> 2. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20943) Correct BypassMergeSortShuffleWriter's comment

2017-06-01 Thread CanBin Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033988#comment-16033988
 ] 

CanBin Zheng edited comment on SPARK-20943 at 6/2/17 1:09 AM:
--

Look at there two cases.
 
 `//Has Aggregator defined
  @Test
  def testGroupByKeyUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1)).groupByKey(2)
rdd.collect()
  }

  //Has Ordering defined
  @Test
  def testShuffleWithKeyOrderingUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1))
val ord = implicitly[Ordering[String]]
val shuffledRDD = new ShuffledRDD[String, Int, Int](rdd, new 
HashPartitioner(2)).setKeyOrdering(ord)
shuffledRDD.collect()
  }`


was (Author: canbinzheng):
Look at there two cases.
 
 //Has Aggregator defined
  @Test
  def testGroupByKeyUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1)).groupByKey(2)
rdd.collect()
  }

  //Has Ordering defined
  @Test
  def testShuffleWithKeyOrderingUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1))
val ord = implicitly[Ordering[String]]
val shuffledRDD = new ShuffledRDD[String, Int, Int](rdd, new 
HashPartitioner(2)).setKeyOrdering(ord)
shuffledRDD.collect()
  }

> Correct BypassMergeSortShuffleWriter's comment
> --
>
> Key: SPARK-20943
> URL: https://issues.apache.org/jira/browse/SPARK-20943
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Shuffle
>Affects Versions: 2.1.1
>Reporter: CanBin Zheng
>Priority: Trivial
>  Labels: starter
>
> There are some comments written in BypassMergeSortShuffleWriter.java about 
> when to select this write path, the three required conditions are described 
> as follows:  
> 1. no Ordering is specified, and
> 2. no Aggregator is specified, and
> 3. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold
> Obviously, the conditions written are partially wrong and misleading, the 
> right conditions should be:
> 1. map-side combine is false, and
> 2. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20943) Correct BypassMergeSortShuffleWriter's comment

2017-06-01 Thread CanBin Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033988#comment-16033988
 ] 

CanBin Zheng commented on SPARK-20943:
--

Look at there two cases.
 
 //Has Aggregator defined
  @Test
  def testGroupByKeyUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1)).groupByKey(2)
rdd.collect()
  }

  //Has Ordering defined
  @Test
  def testShuffleWithKeyOrderingUsingBypassMergeSort(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data).map((_, 1))
val ord = implicitly[Ordering[String]]
val shuffledRDD = new ShuffledRDD[String, Int, Int](rdd, new 
HashPartitioner(2)).setKeyOrdering(ord)
shuffledRDD.collect()
  }

> Correct BypassMergeSortShuffleWriter's comment
> --
>
> Key: SPARK-20943
> URL: https://issues.apache.org/jira/browse/SPARK-20943
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Shuffle
>Affects Versions: 2.1.1
>Reporter: CanBin Zheng
>Priority: Trivial
>  Labels: starter
>
> There are some comments written in BypassMergeSortShuffleWriter.java about 
> when to select this write path, the three required conditions are described 
> as follows:  
> 1. no Ordering is specified, and
> 2. no Aggregator is specified, and
> 3. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold
> Obviously, the conditions written are partially wrong and misleading, the 
> right conditions should be:
> 1. map-side combine is false, and
> 2. the number of partitions is less than 
>  spark.shuffle.sort.bypassMergeThreshold



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20958:


Assignee: Apache Spark  (was: Cheng Lian)

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> , Spark 2.2.0-rc2 introduced two incompatible versions of avro (1.7.7 and 
> 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the reasons 
> mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20958:


Assignee: Cheng Lian  (was: Apache Spark)

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> , Spark 2.2.0-rc2 introduced two incompatible versions of avro (1.7.7 and 
> 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the reasons 
> mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033987#comment-16033987
 ] 

Apache Spark commented on SPARK-20958:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/18181

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> , Spark 2.2.0-rc2 introduced two incompatible versions of avro (1.7.7 and 
> 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the reasons 
> mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-01 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-20958:
--

 Summary: Roll back parquet-mr 1.8.2 to parquet-1.8.1
 Key: SPARK-20958
 URL: https://issues.apache.org/jira/browse/SPARK-20958
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Cheng Lian
Assignee: Cheng Lian


We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 and 
avro 1.7.7 used by spark-core 2.2.0-rc2.

, Spark 2.2.0-rc2 introduced two incompatible versions of avro (1.7.7 and 
1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the reasons 
mentioned in [PR 
#17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
Therefore, we don't really have many choices here and have to roll back 
parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20957) Flaky Test: o.a.s.sql.streaming.StreamingQueryManagerSuite listing

2017-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20957:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Flaky Test: o.a.s.sql.streaming.StreamingQueryManagerSuite listing
> --
>
> Key: SPARK-20957
> URL: https://issues.apache.org/jira/browse/SPARK-20957
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@74d70cd4 did 
> not equal null
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(StreamingQueryManagerSuite.scala:82)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(StreamingQueryManagerSuite.scala:61)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$org$apache$spark$sql$streaming$StreamingQueryManagerSuite$$withQueriesOn$1.apply$mcV$sp(StreamingQueryManagerSuite.scala:268)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$org$apache$spark$sql$streaming$StreamingQueryManagerSuite$$withQueriesOn$1.apply(StreamingQueryManagerSuite.scala:244)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$org$apache$spark$sql$streaming$StreamingQueryManagerSuite$$withQueriesOn$1.apply(StreamingQueryManagerSuite.scala:244)
>   at 
> org.scalatest.concurrent.Timeouts$class.timeoutAfter(Timeouts.scala:326)
>   at org.scalatest.concurrent.Timeouts$class.failAfter(Timeouts.scala:245)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.failAfter(StreamingQueryManagerSuite.scala:39)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.org$apache$spark$sql$streaming$StreamingQueryManagerSuite$$withQueriesOn(StreamingQueryManagerSuite.scala:244)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3.apply$mcV$sp(StreamingQueryManagerSuite.scala:61)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3.apply(StreamingQueryManagerSuite.scala:56)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3.apply(StreamingQueryManagerSuite.scala:56)
>   at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:310)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:310)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:310)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StreamingQueryManagerSuite.scala:39)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingQueryManagerSuite.scala:39)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.runTest(StreamingQueryManagerSuite.scala:39)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1

[jira] [Assigned] (SPARK-20957) Flaky Test: o.a.s.sql.streaming.StreamingQueryManagerSuite listing

2017-06-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20957:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Flaky Test: o.a.s.sql.streaming.StreamingQueryManagerSuite listing
> --
>
> Key: SPARK-20957
> URL: https://issues.apache.org/jira/browse/SPARK-20957
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@74d70cd4 did 
> not equal null
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(StreamingQueryManagerSuite.scala:82)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(StreamingQueryManagerSuite.scala:61)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$org$apache$spark$sql$streaming$StreamingQueryManagerSuite$$withQueriesOn$1.apply$mcV$sp(StreamingQueryManagerSuite.scala:268)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$org$apache$spark$sql$streaming$StreamingQueryManagerSuite$$withQueriesOn$1.apply(StreamingQueryManagerSuite.scala:244)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$org$apache$spark$sql$streaming$StreamingQueryManagerSuite$$withQueriesOn$1.apply(StreamingQueryManagerSuite.scala:244)
>   at 
> org.scalatest.concurrent.Timeouts$class.timeoutAfter(Timeouts.scala:326)
>   at org.scalatest.concurrent.Timeouts$class.failAfter(Timeouts.scala:245)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.failAfter(StreamingQueryManagerSuite.scala:39)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.org$apache$spark$sql$streaming$StreamingQueryManagerSuite$$withQueriesOn(StreamingQueryManagerSuite.scala:244)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3.apply$mcV$sp(StreamingQueryManagerSuite.scala:61)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3.apply(StreamingQueryManagerSuite.scala:56)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3.apply(StreamingQueryManagerSuite.scala:56)
>   at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:310)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:310)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:310)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StreamingQueryManagerSuite.scala:39)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingQueryManagerSuite.scala:39)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.runTest(StreamingQueryManagerSuite.scala:39)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1

[jira] [Commented] (SPARK-20957) Flaky Test: o.a.s.sql.streaming.StreamingQueryManagerSuite listing

2017-06-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033955#comment-16033955
 ] 

Apache Spark commented on SPARK-20957:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/18180

> Flaky Test: o.a.s.sql.streaming.StreamingQueryManagerSuite listing
> --
>
> Key: SPARK-20957
> URL: https://issues.apache.org/jira/browse/SPARK-20957
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@74d70cd4 did 
> not equal null
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(StreamingQueryManagerSuite.scala:82)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(StreamingQueryManagerSuite.scala:61)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$org$apache$spark$sql$streaming$StreamingQueryManagerSuite$$withQueriesOn$1.apply$mcV$sp(StreamingQueryManagerSuite.scala:268)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$org$apache$spark$sql$streaming$StreamingQueryManagerSuite$$withQueriesOn$1.apply(StreamingQueryManagerSuite.scala:244)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$org$apache$spark$sql$streaming$StreamingQueryManagerSuite$$withQueriesOn$1.apply(StreamingQueryManagerSuite.scala:244)
>   at 
> org.scalatest.concurrent.Timeouts$class.timeoutAfter(Timeouts.scala:326)
>   at org.scalatest.concurrent.Timeouts$class.failAfter(Timeouts.scala:245)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.failAfter(StreamingQueryManagerSuite.scala:39)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.org$apache$spark$sql$streaming$StreamingQueryManagerSuite$$withQueriesOn(StreamingQueryManagerSuite.scala:244)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3.apply$mcV$sp(StreamingQueryManagerSuite.scala:61)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3.apply(StreamingQueryManagerSuite.scala:56)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3.apply(StreamingQueryManagerSuite.scala:56)
>   at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:310)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:310)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:310)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StreamingQueryManagerSuite.scala:39)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingQueryManagerSuite.scala:39)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.runTest(StreamingQueryManagerSuite.scala:39)
>   at 
> org.scalatest.FunSuiteLike$$anonf

[jira] [Created] (SPARK-20957) Flaky Test: o.a.s.sql.streaming.StreamingQueryManagerSuite listing

2017-06-01 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-20957:


 Summary: Flaky Test: 
o.a.s.sql.streaming.StreamingQueryManagerSuite listing
 Key: SPARK-20957
 URL: https://issues.apache.org/jira/browse/SPARK-20957
 Project: Spark
  Issue Type: Test
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


{code}
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@74d70cd4 did not 
equal null
at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
at 
org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(StreamingQueryManagerSuite.scala:82)
at 
org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(StreamingQueryManagerSuite.scala:61)
at 
org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$org$apache$spark$sql$streaming$StreamingQueryManagerSuite$$withQueriesOn$1.apply$mcV$sp(StreamingQueryManagerSuite.scala:268)
at 
org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$org$apache$spark$sql$streaming$StreamingQueryManagerSuite$$withQueriesOn$1.apply(StreamingQueryManagerSuite.scala:244)
at 
org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$org$apache$spark$sql$streaming$StreamingQueryManagerSuite$$withQueriesOn$1.apply(StreamingQueryManagerSuite.scala:244)
at 
org.scalatest.concurrent.Timeouts$class.timeoutAfter(Timeouts.scala:326)
at org.scalatest.concurrent.Timeouts$class.failAfter(Timeouts.scala:245)
at 
org.apache.spark.sql.streaming.StreamingQueryManagerSuite.failAfter(StreamingQueryManagerSuite.scala:39)
at 
org.apache.spark.sql.streaming.StreamingQueryManagerSuite.org$apache$spark$sql$streaming$StreamingQueryManagerSuite$$withQueriesOn(StreamingQueryManagerSuite.scala:244)
at 
org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3.apply$mcV$sp(StreamingQueryManagerSuite.scala:61)
at 
org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3.apply(StreamingQueryManagerSuite.scala:56)
at 
org.apache.spark.sql.streaming.StreamingQueryManagerSuite$$anonfun$3.apply(StreamingQueryManagerSuite.scala:56)
at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
at 
org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply$mcV$sp(SQLTestUtils.scala:310)
at 
org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:310)
at 
org.apache.spark.sql.test.SQLTestUtils$$anonfun$testQuietly$1.apply(SQLTestUtils.scala:310)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at 
org.apache.spark.sql.streaming.StreamingQueryManagerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StreamingQueryManagerSuite.scala:39)
at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
at 
org.apache.spark.sql.streaming.StreamingQueryManagerSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingQueryManagerSuite.scala:39)
at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
at 
org.apache.spark.sql.streaming.StreamingQueryManagerSuite.runTest(StreamingQueryManagerSuite.scala:39)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.scal

[jira] [Resolved] (SPARK-19150) completely support using hive as data source to create tables

2017-06-01 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19150.
-
  Resolution: Fixed
   Fix Version/s: 2.2.0
Target Version/s:   (was: 2.3.0)

all sub-tasks are done

> completely support using hive as data source to create tables
> -
>
> Key: SPARK-19150
> URL: https://issues.apache.org/jira/browse/SPARK-19150
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 2.2.0
>
>
> After SPARK-19107, we now can treat hive as a data source and create hive 
> tables with DataFrameWriter and Catalog. However, the support is not 
> completed, there are still some cases we do not support.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17203) data source options should always be case insensitive

2017-06-01 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17203.
-
  Resolution: Fixed
   Fix Version/s: 2.2.0
Target Version/s:   (was: 2.3.0)

this is already fixed by other PRs

> data source options should always be case insensitive
> -
>
> Key: SPARK-17203
> URL: https://issues.apache.org/jira/browse/SPARK-17203
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20520) R streaming tests failed on Windows

2017-06-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033918#comment-16033918
 ] 

Dongjoon Hyun edited comment on SPARK-20520 at 6/1/17 11:46 PM:


Hi, [~felixcheung].
-Is this still targeting 2.2.0?-
Oh, I see. It should be tested on RC.


was (Author: dongjoon):
Hi, [~felixcheung].
Is this still targeting 2.2.0?

> R streaming tests failed on Windows
> ---
>
> Key: SPARK-20520
> URL: https://issues.apache.org/jira/browse/SPARK-20520
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Critical
>
> Running R CMD check on SparkR 2.2 RC1 packages 
> {code}
> Failed 
> -
> 1. Failure: read.stream, write.stream, awaitTermination, stopQuery 
> (@test_streaming.R#56) 
> head(sql("SELECT count(*) FROM people"))[[1]] not equal to 3.
> 1/1 mismatches
> [1] 0 - 3 == -3
> 2. Failure: read.stream, write.stream, awaitTermination, stopQuery 
> (@test_streaming.R#60) 
> head(sql("SELECT count(*) FROM people"))[[1]] not equal to 6.
> 1/1 mismatches
> [1] 3 - 6 == -3
> 3. Failure: print from explain, lastProgress, status, isActive 
> (@test_streaming.R#75) 
> any(grepl("\"description\" : \"MemorySink\"", 
> capture.output(lastProgress(q isn't true.
> 4. Failure: Stream other format (@test_streaming.R#95) 
> -
> head(sql("SELECT count(*) FROM people3"))[[1]] not equal to 3.
> 1/1 mismatches
> [1] 0 - 3 == -3
> 5. Failure: Stream other format (@test_streaming.R#98) 
> -
> any(...) isn't true.
> {code}
> Need to investigate



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20520) R streaming tests failed on Windows

2017-06-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033918#comment-16033918
 ] 

Dongjoon Hyun commented on SPARK-20520:
---

Hi, [~felixcheung].
Is this still targeting 2.2.0?

> R streaming tests failed on Windows
> ---
>
> Key: SPARK-20520
> URL: https://issues.apache.org/jira/browse/SPARK-20520
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Critical
>
> Running R CMD check on SparkR 2.2 RC1 packages 
> {code}
> Failed 
> -
> 1. Failure: read.stream, write.stream, awaitTermination, stopQuery 
> (@test_streaming.R#56) 
> head(sql("SELECT count(*) FROM people"))[[1]] not equal to 3.
> 1/1 mismatches
> [1] 0 - 3 == -3
> 2. Failure: read.stream, write.stream, awaitTermination, stopQuery 
> (@test_streaming.R#60) 
> head(sql("SELECT count(*) FROM people"))[[1]] not equal to 6.
> 1/1 mismatches
> [1] 3 - 6 == -3
> 3. Failure: print from explain, lastProgress, status, isActive 
> (@test_streaming.R#75) 
> any(grepl("\"description\" : \"MemorySink\"", 
> capture.output(lastProgress(q isn't true.
> 4. Failure: Stream other format (@test_streaming.R#95) 
> -
> head(sql("SELECT count(*) FROM people3"))[[1]] not equal to 3.
> 1/1 mismatches
> [1] 0 - 3 == -3
> 5. Failure: Stream other format (@test_streaming.R#98) 
> -
> any(...) isn't true.
> {code}
> Need to investigate



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-01 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033915#comment-16033915
 ] 

Shixiong Zhu edited comment on SPARK-20952 at 6/1/17 11:43 PM:
---

InheritableThreadLocal only works when creating a new thread. Here you were 
talking about thread pools. Reusing a thread may get a wrong TaskContext if 
it's InheritableThreadLocal.


was (Author: zsxwing):
InheritableThreadLocal only works when creating a new thread. Here you were 
talking about thread pools.

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-01 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033915#comment-16033915
 ] 

Shixiong Zhu commented on SPARK-20952:
--

InheritableThreadLocal only works when creating a new thread. Here you were 
talking about thread pools.

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20025) Driver fail over will not work, if SPARK_LOCAL* env is set.

2017-06-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033912#comment-16033912
 ] 

Dongjoon Hyun commented on SPARK-20025:
---

Hi, [~scrapco...@gmail.com].
Could you adjust the target version here?

> Driver fail over will not work, if SPARK_LOCAL* env is set.
> ---
>
> Key: SPARK-20025
> URL: https://issues.apache.org/jira/browse/SPARK-20025
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Prashant Sharma
>
> In a bare metal system with No DNS setup, spark may be configured with 
> SPARK_LOCAL* for IP and host properties.
> During a driver failover, in cluster deployment mode. SPARK_LOCAL* should be 
> ignored while auto deploying and should be picked up from target system's 
> local environment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20129) JavaSparkContext should use SparkContext.getOrCreate

2017-06-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033911#comment-16033911
 ] 

Dongjoon Hyun commented on SPARK-20129:
---

Hi, [~mengxr].
Is this resolved at 2.2.0?

> JavaSparkContext should use SparkContext.getOrCreate
> 
>
> Key: SPARK-20129
> URL: https://issues.apache.org/jira/browse/SPARK-20129
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> It should re-use an existing SparkContext if there is a live one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19035) rand() function in case when cause failed

2017-06-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033908#comment-16033908
 ] 

Dongjoon Hyun commented on SPARK-19035:
---

If this is not resolved at 2.2.0, shall we remove the target version 2.2.0 here?

> rand() function in case when cause failed
> -
>
> Key: SPARK-19035
> URL: https://issues.apache.org/jira/browse/SPARK-19035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Feng Yuan
>
> *In this case:*
>select 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end b,count(1) 
>from 
>yuanfeng1_a 
>group by 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end;
> *Throw error:*
> Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group 
> by, nor is it an aggregate function. Add to group by or wrap in first() (or 
> first_value) if you don't care which value you get.;;
> Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE 
> concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) 
> END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 
> as string), cast(rand(8090243936131101651) as string)) END AS b#2074]
> +- MetastoreRelation default, yuanfeng1_a
> select case when a=1 then 1 else rand() end b,count(1) from yuanfeng1_a group 
> by case when a=1 then rand() end also output this
> *Notice*:
> If replace rand() as 1,it work.
> A simpler way to reproduce this bug: `SELECT a + rand() FROM t GROUP BY a + 
> rand()`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18451) Always set -XX:+HeapDumpOnOutOfMemoryError for Spark tests

2017-06-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033904#comment-16033904
 ] 

Dongjoon Hyun commented on SPARK-18451:
---

Hi, [~lian cheng].
Shall we remove the target version here if Jenkins doesn't have these options?

> Always set -XX:+HeapDumpOnOutOfMemoryError for Spark tests
> --
>
> Key: SPARK-18451
> URL: https://issues.apache.org/jira/browse/SPARK-18451
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Reporter: Cheng Lian
>
> It would be nice if we always set {{-XX:+HeapDumpOnOutOfMemoryError}} and 
> {{-XX:+HeapDumpPath}} for open source Spark tests. So that it would be easier 
> to investigate issues like SC-5041.
> Note:
> - We need to ensure that the heap dumps are stored in a location on Jenkins 
> that won't be automatically cleaned up.
> - It would be nice to be able to customize the customize the heap dump output 
> paths on a per build basis so that it's easier to find the heap dump file of 
> any given build.
> The 2nd point is optional since we can probably identify wanted heap dump 
> files by looking at the creation timestamp.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors

2017-06-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033900#comment-16033900
 ] 

Dongjoon Hyun commented on SPARK-17637:
---

Shall we remove the target version 2.2.0 here?

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Zhan Zhang
>Assignee: Zhan Zhang
>Priority: Minor
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20894) Error while checkpointing to HDFS

2017-06-01 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-20894:


Assignee: Shixiong Zhu

> Error while checkpointing to HDFS
> -
>
> Key: SPARK-20894
> URL: https://issues.apache.org/jira/browse/SPARK-20894
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Ubuntu, Spark 2.1.1, hadoop 2.7
>Reporter: kant kodali
>Assignee: Shixiong Zhu
> Fix For: 2.3.0
>
> Attachments: driver_info_log, executor1_log, executor2_log
>
>
> Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 
> hours", "24 hours"), df1.col("AppName")).count();
> StreamingQuery query = df2.writeStream().foreach(new 
> KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start();
> query.awaitTermination();
> This for some reason fails with the Error 
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalStateException: Error reading delta file 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = 
> (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist
> I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/  and all 
> consumer offsets in Kafka from all brokers prior to running and yet this 
> error still persists. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20894) Error while checkpointing to HDFS

2017-06-01 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20894:
-
Fix Version/s: 2.3.0

> Error while checkpointing to HDFS
> -
>
> Key: SPARK-20894
> URL: https://issues.apache.org/jira/browse/SPARK-20894
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Ubuntu, Spark 2.1.1, hadoop 2.7
>Reporter: kant kodali
>Assignee: Shixiong Zhu
> Fix For: 2.3.0
>
> Attachments: driver_info_log, executor1_log, executor2_log
>
>
> Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 
> hours", "24 hours"), df1.col("AppName")).count();
> StreamingQuery query = df2.writeStream().foreach(new 
> KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start();
> query.awaitTermination();
> This for some reason fails with the Error 
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalStateException: Error reading delta file 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = 
> (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist
> I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/  and all 
> consumer offsets in Kafka from all brokers prior to running and yet this 
> error still persists. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15352) Topology aware block replication

2017-06-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033899#comment-16033899
 ] 

Dongjoon Hyun commented on SPARK-15352:
---

Hi, [~shubhamc].
Can we resolve this issue at 2.2.0 right now?
The ongoing PR seems to be the documentation only issue.

> Topology aware block replication
> 
>
> Key: SPARK-15352
> URL: https://issues.apache.org/jira/browse/SPARK-15352
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Mesos, Spark Core, YARN
>Reporter: Shubham Chopra
>Assignee: Shubham Chopra
>
> With cached RDDs, Spark can be used for online analytics where it is used to 
> respond to online queries. But loss of RDD partitions due to node/executor 
> failures can cause huge delays in such use cases as the data would have to be 
> regenerated.
> Cached RDDs, even when using multiple replicas per block, are not currently 
> resilient to node failures when multiple executors are started on the same 
> node. Block replication currently chooses a peer at random, and this peer 
> could also exist on the same host. 
> This effort would add topology aware replication to Spark that can be enabled 
> with pluggable strategies. For ease of development/review, this is being 
> broken down to three major work-efforts:
> 1.Making peer selection for replication pluggable
> 2.Providing pluggable implementations for providing topology and topology 
> aware replication
> 3.Pro-active replenishment of lost blocks



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20894) Error while checkpointing to HDFS

2017-06-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033898#comment-16033898
 ] 

Apache Spark commented on SPARK-20894:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/18179

> Error while checkpointing to HDFS
> -
>
> Key: SPARK-20894
> URL: https://issues.apache.org/jira/browse/SPARK-20894
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Ubuntu, Spark 2.1.1, hadoop 2.7
>Reporter: kant kodali
> Attachments: driver_info_log, executor1_log, executor2_log
>
>
> Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 
> hours", "24 hours"), df1.col("AppName")).count();
> StreamingQuery query = df2.writeStream().foreach(new 
> KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start();
> query.awaitTermination();
> This for some reason fails with the Error 
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalStateException: Error reading delta file 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = 
> (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist
> I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/  and all 
> consumer offsets in Kafka from all brokers prior to running and yet this 
> error still persists. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2017-06-01 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15044:
--
Target Version/s:   (was: 2.2.0)

> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15044) spark-sql will throw "input path does not exist" exception if it handles a partition which exists in hive table, but the path is removed manually

2017-06-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033895#comment-16033895
 ] 

Dongjoon Hyun commented on SPARK-15044:
---

Hi, All.
According to SPARK-10198, that option seems to be deprecated by [~marmbrus] due 
to correctness issues.
IMO, let's remove the target version, 2.2.0, here. This issue cannot be 
bypassed by that option.


> spark-sql will throw "input path does not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually
> -
>
> Key: SPARK-15044
> URL: https://issues.apache.org/jira/browse/SPARK-15044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: huangyu
>
> spark-sql will throw "input path not exist" exception if it handles a 
> partition which exists in hive table, but the path is removed manually.The 
> situation is as follows:
> 1) Create a table "test". "create table test (n string) partitioned by (p 
> string)"
> 2) Load some data into partition(p='1')
> 3)Remove the path related to partition(p='1') of table test manually. "hadoop 
> fs -rmr /warehouse//test/p=1"
> 4)Run spark sql, spark-sql -e "select n from test where p='1';"
> Then it throws exception:
> {code}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> ./test/p=1
> at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> {code}
> The bug is in spark 1.6.1, if I use spark 1.4.0, It is OK
> I think spark-sql should ignore the path, just like hive or it dose in early 
> versions, rather than throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20953) Add hash map metrics to aggregate and join

2017-06-01 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033891#comment-16033891
 ] 

Liang-Chi Hsieh commented on SPARK-20953:
-

[~rxin] Yeah, thanks for pinging me. I'll look into this.

> Add hash map metrics to aggregate and join
> --
>
> Key: SPARK-20953
> URL: https://issues.apache.org/jira/browse/SPARK-20953
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> It would be useful if we can identify hash map collision issues early on.
> We should add avg hash map probe metric to aggregate operator and hash join 
> operator and report them. If the avg probe is greater than a specific 
> (configurable) threshold, we should log an error at runtime.
> The primary classes to look at are UnsafeFixedWidthAggregationMap, 
> HashAggregateExec, HashedRelation, HashJoin.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark

2017-06-01 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033886#comment-16033886
 ] 

Dongjoon Hyun commented on SPARK-12661:
---

Hi, All.
Is it enough to resolve this issue? Or, do we need to change the target version 
from 2.2.0 to 2.3.0?

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20922) Unsafe deserialization in Spark LauncherConnection

2017-06-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033877#comment-16033877
 ] 

Apache Spark commented on SPARK-20922:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/18178

> Unsafe deserialization in Spark LauncherConnection
> --
>
> Key: SPARK-20922
> URL: https://issues.apache.org/jira/browse/SPARK-20922
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Aditya Sharad
>Assignee: Marcelo Vanzin
>  Labels: security
> Fix For: 2.0.3, 2.1.2, 2.2.1, 2.3.0
>
> Attachments: spark-deserialize-master.zip
>
>
> The {{run()}} method of the class 
> {{org.apache.spark.launcher.LauncherConnection}} performs unsafe 
> deserialization of data received by its socket. This makes Spark applications 
> launched programmatically using the {{SparkLauncher}} framework potentially 
> vulnerable to remote code execution by an attacker with access to any user 
> account on the local machine. Such an attacker could send a malicious 
> serialized Java object to multiple ports on the local machine, and if this 
> port matches the one (randomly) chosen by the Spark launcher, the malicious 
> object will be deserialized. By making use of gadget chains in code present 
> on the Spark application classpath, the deserialization process can lead to 
> RCE or privilege escalation.
> This vulnerability is identified by the “Unsafe deserialization” rule on 
> lgtm.com:
> https://lgtm.com/projects/g/apache/spark/snapshot/80fdc2c9d1693f5b3402a79ca4ec76f6e422ff13/files/launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java#V58
>  
> Attached is a proof-of-concept exploit involving a simple 
> {{SparkLauncher}}-based application and a known gadget chain in the Apache 
> Commons Beanutils library referenced by Spark.
> See the readme file for demonstration instructions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15693) Write schema definition out for file-based data sources to avoid schema inference

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15693:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Write schema definition out for file-based data sources to avoid schema 
> inference
> -
>
> Key: SPARK-15693
> URL: https://issues.apache.org/jira/browse/SPARK-15693
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark supports reading a variety of data format, many of which don't have 
> self-describing schema. For these file formats, Spark often can infer the 
> schema by going through all the data. However, schema inference is expensive 
> and does not always infer the intended schema (for example, with json data 
> Spark always infer integer types as long, rather than int).
> It would be great if Spark can write the schema definition out for file-based 
> formats, and when reading the data in, schema can be "inferred" directly by 
> reading the schema definition file without going through full schema 
> inference. If the file does not exist, then the good old schema inference 
> should be performed.
> This ticket certainly merits a design doc that should discuss the spec for 
> schema definition, as well as all the corner cases that this feature needs to 
> handle (e.g. schema merging, schema evolution, partitioning). It would be 
> great if the schema definition is using a human readable format (e.g. JSON).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15380) Generate code that stores a float/double value in each column from ColumnarBatch when DataFrame.cache() is used

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15380:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Generate code that stores a float/double value in each column from 
> ColumnarBatch when DataFrame.cache() is used
> ---
>
> Key: SPARK-15380
> URL: https://issues.apache.org/jira/browse/SPARK-15380
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> When DataFrame.cache() is called, data will be stored as column-oriented 
> storage in CachedBatch. The current Catalyst generates Java program to store 
> a computed value to an InternalRow and then the value is stored into 
> CachedBatch even if data is read from ColumnarBatch for ParquetReader. 
> This JIRA generates Java code to store a value into a ColumnarBatch, and 
> store data from the ColumnarBatch to the CachedBatch. This JIRA handles only 
> float and double types for a value.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19084) conditional function: field

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19084:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> conditional function: field
> ---
>
> Key: SPARK-19084
> URL: https://issues.apache.org/jira/browse/SPARK-19084
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Chenzhao Guo
>
> field(str, str1, str2, ... ) is a variable-length(>=2) function which returns 
> the index of str in the list (str1, str2, ... ) or 0 if not found.
> Every parameter is required to be subtype of AtomicType.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15691) Refactor and improve Hive support

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15691:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog (So, for a CatalogTable returned by HiveExternalCatalog, 
> we do not need to distinguish tables stored in hive formats and data source 
> tables).
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14878) Support Trim characters in the string trim function

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14878:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Support Trim characters in the string trim function
> ---
>
> Key: SPARK-14878
> URL: https://issues.apache.org/jira/browse/SPARK-14878
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: kevin yu
>
> The current Spark SQL does not support the trim characters in the string trim 
> function, which is part of ANSI SQL2003’s standard. For example, IBM DB2 
> fully supports it as shown in the 
> https://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0023198.html.
>  We propose to implement it in this JIRA..
> The ANSI SQL2003's trim Syntax:
> {noformat}
> SQL
>  ::= TRIM   
>  ::= [ [  ] [  ] FROM ] 
> 
>  ::= 
>  ::=
>   LEADING
> | TRAILING
> | BOTH
>  ::= 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16496) Add wholetext as option for reading text in SQL.

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-16496:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Add wholetext as option for reading text in SQL.
> 
>
> Key: SPARK-16496
> URL: https://issues.apache.org/jira/browse/SPARK-16496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prashant Sharma
>
> In multiple text analysis problems, it is not often desirable for the rows to 
> be split by "\n". There exists a wholeText reader for RDD API, and this JIRA 
> just adds the same support for Dataset API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19027) estimate size of object buffer for object hash aggregate

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19027:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> estimate size of object buffer for object hash aggregate
> 
>
> Key: SPARK-19027
> URL: https://issues.apache.org/jira/browse/SPARK-19027
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19104:
-
Target Version/s: 2.3.0  (was: 2.2.0)

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Nils Grabbert
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>  
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
>  
>   at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) 
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:196) 
>   at org.codehaus.c

[jira] [Updated] (SPARK-19241) remove hive generated table properties if they are not useful in Spark

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19241:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> remove hive generated table properties if they are not useful in Spark
> --
>
> Key: SPARK-19241
> URL: https://issues.apache.org/jira/browse/SPARK-19241
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> When we save a table into hive metastore, hive will generate some table 
> properties automatically. e.g. transient_lastDdlTime, last_modified_by, 
> rawDataSize, etc. Some of them are useless in Spark SQL, we should remove 
> them.
> It will be good if we can get the list of Hive-generated table properties via 
> Hive API, so that we don't need to hardcode them.
> We can take a look at Hive code to see how it excludes these auto-generated 
> table properties when describe table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16317) Add file filtering interface for FileFormat

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-16317:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Add file filtering interface for FileFormat
> ---
>
> Key: SPARK-16317
> URL: https://issues.apache.org/jira/browse/SPARK-16317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Priority: Minor
>
> {{FileFormat}} data sources like Parquet and Avro (provided by spark-avro) 
> have customized file filtering logics. For example, Parquet needs to filter 
> out summary files, while Avro provides a Hadoop configuration option to 
> filter out all files whose names don't end with ".avro".
> It would be nice to have a general file filtering interface in {{FileFormat}} 
> to handle similar requirements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16011) SQL metrics include duplicated attempts

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-16011:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> SQL metrics include duplicated attempts
> ---
>
> Key: SPARK-16011
> URL: https://issues.apache.org/jira/browse/SPARK-16011
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Davies Liu
>Assignee: Wenchen Fan
>
> When I ran a simple scan and aggregate query, the number of rows in scan 
> could be different from run to run, but actually scanned result is correct, 
> the SQL metrics is wrong (should not include duplicated attempt), this is a 
> regression since 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18388) Running aggregation on many columns throws SOE

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18388:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Running aggregation on many columns throws SOE
> --
>
> Key: SPARK-18388
> URL: https://issues.apache.org/jira/browse/SPARK-18388
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
> Environment: PySpark 2.0.1, Jupyter
>Reporter: Raviteja Lokineni
> Attachments: spark-bug.csv, spark-bug-jupyter.py, 
> spark-bug-stacktrace.txt
>
>
> Usecase: I am generating weekly aggregates of every column of data
> {code}
> from pyspark.sql.window import Window
> from pyspark.sql.functions import *
> timeSeries = sqlContext.read.option("header", 
> "true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv")
> # Hive timestamp is interpreted as UNIX timestamp in seconds*
> days = lambda i: i * 86400
> w = (Window()
>  .partitionBy("id")
>  .orderBy(col("dt").cast("timestamp").cast("long"))
>  .rangeBetween(-days(6), 0))
> cols = ["id", "dt"]
> skipCols = ["id", "dt"]
> for col in timeSeries.columns:
> if col in skipCols:
> continue
> cols.append(mean(col).over(w).alias("mean_7_"+col))
> cols.append(count(col).over(w).alias("count_7_"+col))
> cols.append(sum(col).over(w).alias("sum_7_"+col))
> cols.append(min(col).over(w).alias("min_7_"+col))
> cols.append(max(col).over(w).alias("max_7_"+col))
> df = timeSeries.select(cols)
> df.orderBy('id', 
> 'dt').write.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").save("file:///tmp/spark-bug-out.csv")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18245) Improving support for bucketed table

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18245:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Improving support for bucketed table
> 
>
> Key: SPARK-18245
> URL: https://issues.apache.org/jira/browse/SPARK-18245
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> This is an umbrella ticket for improving various execution planning for 
> bucketed tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14098) Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-14098:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Generate Java code to build CachedColumnarBatch and get values from 
> CachedColumnarBatch when DataFrame.cache() is called
> 
>
> Key: SPARK-14098
> URL: https://issues.apache.org/jira/browse/SPARK-14098
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> [Here|https://docs.google.com/document/d/1-2BnW5ibuHIeQzmHEGIGkEcuMUCTk87pmPis2DKRg-Q/edit?usp=sharing]
>  is a design document for this change (***TODO: Update the document***).
> This JIRA implements a new in-memory cache feature used by DataFrame.cache 
> and Dataset.cache. The followings are basic design based on discussions with 
> Sameer, Weichen, Xiao, Herman, and Nong.
> * Use ColumnarBatch with ColumnVector that are common data representations 
> for columnar storage
> * Use multiple compression scheme (such as RLE, intdelta, and so on) for each 
> ColumnVector in ColumnarBatch depends on its data typpe
> * Generate code that is simple and specialized for each in-memory cache to 
> build an in-memory cache
> * Generate code that directly reads data from ColumnVector for the in-memory 
> cache by whole-stage codegen.
> * Enhance ColumnVector to keep UnsafeArrayData
> * Use primitive-type array for primitive uncompressed data type in 
> ColumnVector
> * Use byte[] for UnsafeArrayData and compressed data
> Based on this design, this JIRA generates two kinds of Java code for 
> DataFrame.cache()/Dataset.cache()
> * Generate Java code to build CachedColumnarBatch, which keeps data in 
> ColumnarBatch
> * Generate Java code to get a value of each column from ColumnarBatch
> ** a Get a value directly from from ColumnarBatch in code generated by whole 
> stage code gen (primary path)
> ** b Get a value thru an iterator if whole stage code gen is disabled (e.g. # 
> of columns is more than 100, as backup path)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19014) support complex aggregate buffer in HashAggregateExec

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19014:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> support complex aggregate buffer in HashAggregateExec
> -
>
> Key: SPARK-19014
> URL: https://issues.apache.org/jira/browse/SPARK-19014
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19989) Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19989:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite
> 
>
> Key: SPARK-19989
> URL: https://issues.apache.org/jira/browse/SPARK-19989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming, Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>Priority: Minor
>  Labels: flaky-test
>
> This test failed recently here: 
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceStressSuite/stress_test_with_multiple_topics_and_partitions/
> And based on Josh's dashboard 
> (https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressSuite&test_name=stress+test+with+multiple+topics+and+partitions),
>  seems to fail a few times every month.  Here's the full error from the most 
> recent failure:
> Error Message
> {code}
> org.scalatest.exceptions.TestFailedException:  Error adding data: replication 
> factor: 1 larger than available brokers: 0 
> kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117)  
> kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403)  
> org.apache.spark.sql.kafka010.KafkaTestUtils.createTopic(KafkaTestUtils.scala:173)
>   
> org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:903)
>   
> org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:901)
>   
> org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:93)
>   
> org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:92)
>   scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)  
> org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData.addData(KafkaSourceSuite.scala:92)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:494)
> {code}
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> Error adding data: replication factor: 1 larger than available brokers: 0
> kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117)
>   kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403)
>   
> org.apache.spark.sql.kafka010.KafkaTestUtils.createTopic(KafkaTestUtils.scala:173)
>   
> org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:903)
>   
> org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:901)
>   
> org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:93)
>   
> org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:92)
>   scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
>   
> org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData.addData(KafkaSourceSuite.scala:92)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:494)
> == Progress ==
>AssertOnQuery(, )
>CheckAnswer: 
>StopStream
>
> StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@5d888be0,Map())
>AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), 
> data = Range(0, 1, 2, 3, 4, 5, 6, 7, 8), message = )
>CheckAnswer: [1],[2],[3],[4],[5],[6],[7],[8],[9]
>StopStream
>
> StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@1be724ee,Map())
>AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), 
> data = Range(9, 10, 11, 12, 13, 14), message = )
>CheckAnswer: 
> [1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15]
>StopStream
>AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), 
> data = Range(), message = )
> => AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, 
> stress3), data = Range(15), message = Add topic stress7)
>AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, 
> stress3), data = Range(16, 17, 18, 19, 20, 21, 22), message = Add partition)
>AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, 
> stress3), data = Range(23, 24), message = Add partition)
>AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, 
> stress5, stress3), data = Range(

[jira] [Updated] (SPARK-17915) Prepare ColumnVector implementation for UnsafeData

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17915:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Prepare ColumnVector implementation for UnsafeData
> --
>
> Key: SPARK-17915
> URL: https://issues.apache.org/jira/browse/SPARK-17915
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kazuaki Ishizaki
>
> Current implementations of {{ColumnarVector}} are {{OnHeapColumnarVector}} 
> and {{OffHeapColumnarVector}}, which are optimized for reading data from 
> Parquet. If they get an array, an map, or an struct from a {{Unsafe}} related 
> data structure, it is inefficient.
> This JIRA prepares a new implementation {{OnHeapUnsafeColumnarVector}} that 
> is optimized for reading data from a {{Unsafe}} related data structure.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18134) SQL: MapType in Group BY and Joins not working

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18134:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> SQL: MapType in Group BY and Joins not working
> --
>
> Key: SPARK-18134
> URL: https://issues.apache.org/jira/browse/SPARK-18134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1, 
> 2.1.0
>Reporter: Christian Zorneck
>
> Since version 1.5 and issue SPARK-9415, MapTypes can no longer be used in 
> GROUP BY and join clauses. This makes it incompatible to HiveQL. So, a Hive 
> feature was removed from Spark. This makes Spark incompatible to various 
> HiveQL statements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18455) General support for correlated subquery processing

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18455:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> General support for correlated subquery processing
> --
>
> Key: SPARK-18455
> URL: https://issues.apache.org/jira/browse/SPARK-18455
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Nattavut Sutyanyong
> Attachments: SPARK-18455-scoping-doc.pdf
>
>
> Subquery support has been introduced in Spark 2.0. The initial implementation 
> covers the most common subquery use case: the ones used in TPC queries for 
> instance.
> Spark currently supports the following subqueries:
> * Uncorrelated Scalar Subqueries. All cases are supported.
> * Correlated Scalar Subqueries. We only allow subqueries that are aggregated 
> and use equality predicates.
> * Predicate Subqueries. IN or Exists type of queries. We allow most 
> predicates, except when they are pulled from under an Aggregate or Window 
> operator. In that case we only support equality predicates.
> However this does not cover the full range of possible subqueries. This, in 
> part, has to do with the fact that we currently rewrite all correlated 
> subqueries into a (LEFT/LEFT SEMI/LEFT ANTI) join.
> We currently lack supports for the following use cases:
> * The use of predicate subqueries in a projection.
> * The use of non-equality predicates below Aggregates and or Window operators.
> * The use of non-Aggregate subqueries for correlated scalar subqueries.
> This JIRA aims to lift these current limitations in subquery processing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15690:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13184) Support minPartitions parameter for JSON and CSV datasources as options

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13184:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Support minPartitions parameter for JSON and CSV datasources as options
> ---
>
> Key: SPARK-13184
> URL: https://issues.apache.org/jira/browse/SPARK-13184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> After looking through the pull requests below at Spark CSV datasources,
> https://github.com/databricks/spark-csv/pull/256
> https://github.com/databricks/spark-csv/issues/141
> https://github.com/databricks/spark-csv/pull/186
> It looks Spark might need to be able to set {{minPartitions}}.
> {{repartition()}} or {{coalesce()}} can be alternatives but it looks it needs 
> to shuffle the data for most cases.
> Although I am still not sure if it needs this, I will open this ticket just 
> for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13682) Finalize the public API for FileFormat

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13682:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Finalize the public API for FileFormat
> --
>
> Key: SPARK-13682
> URL: https://issues.apache.org/jira/browse/SPARK-13682
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>
> The current file format interface needs to be cleaned up before its 
> acceptable for public consumption:
>  - Have a version that takes Row and does a conversion, hide the internal API.
>  - Remove bucketing
>  - Remove RDD and the broadcastedConf
>  - Remove SQLContext (maybe include SparkSession?)
>  - Pass a better conf object



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9221) Support IntervalType in Range Frame

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9221:

Target Version/s: 2.3.0  (was: 2.2.0)

> Support IntervalType in Range Frame
> ---
>
> Key: SPARK-9221
> URL: https://issues.apache.org/jira/browse/SPARK-9221
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>
> Support the IntervalType in window range frames, as mentioned in the 
> conclusion of the databricks  blog 
> [post|https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html]
>  on window functions.
> This actualy requires us to support Literals instead of Integer constants in 
> Range Frames. The following things will have to be modified:
> * org.apache.spark.sql.hive.HiveQl
> * org.apache.spark.sql.catalyst.expressions.SpecifiedWindowFrame
> * org.apache.spark.sql.execution.Window
> * org.apache.spark.sql.expressions.Window



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15689) Data source API v2

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15689:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20319) Already quoted identifiers are getting wrapped with additional quotes

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20319:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Already quoted identifiers are getting wrapped with additional quotes
> -
>
> Key: SPARK-20319
> URL: https://issues.apache.org/jira/browse/SPARK-20319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Umesh Chaudhary
>
> The issue was caused by 
> [SPARK-16387|https://issues.apache.org/jira/browse/SPARK-16387] where 
> reserved SQL words are honored by wrapping quotes on column names. 
> In our test we found that when quotes are explicitly wrapped in column names 
> then Oracle JDBC driver is throwing : 
> java.sql.BatchUpdateException: ORA-01741: illegal zero-length identifier 
> at 
> oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:12296)
>  
> at 
> oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:246)
>  
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:597)
>  
> and Cassandra JDBC driver is throwing : 
> 17/04/12 19:03:48 ERROR executor.Executor: Exception in task 0.0 in stage 5.0 
> (TID 6)
> java.sql.SQLSyntaxErrorException: [FMWGEN][Cassandra JDBC 
> Driver][Cassandra]syntax error or access rule violation: base table or view 
> not found: 
>   at weblogic.jdbc.cassandrabase.ddcl.b(Unknown Source)
>   at weblogic.jdbc.cassandrabase.ddt.a(Unknown Source)
>   at weblogic.jdbc.cassandrabase.BaseConnection.prepareStatement(Unknown 
> Source)
>   at weblogic.jdbc.cassandrabase.BaseConnection.prepareStatement(Unknown 
> Source)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.insertStatement(JdbcUtils.scala:118)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:571)
> CC: [~rxin] , [~joshrosen]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18950) Report conflicting fields when merging two StructTypes.

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18950:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Report conflicting fields when merging two StructTypes.
> ---
>
> Key: SPARK-18950
> URL: https://issues.apache.org/jira/browse/SPARK-18950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Priority: Minor
>  Labels: starter
>
> Currently, {{StructType.merge()}} only reports data types of conflicting 
> fields when merging two incompatible schemas. It would be nice to also report 
> the field names for easier debugging.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15117) Generate code that get a value in each compressed column from CachedBatch when DataFrame.cache() is called

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15117:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Generate code that get a value in each compressed column from CachedBatch 
> when DataFrame.cache() is called
> --
>
> Key: SPARK-15117
> URL: https://issues.apache.org/jira/browse/SPARK-15117
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> Once SPARK-14098 is merged, we will migrate a feature in this JIRA entry.
> When DataFrame.cache() is called, data is stored as column-oriented storage 
> in CachedBatch. The current Catalyst generates Java program to get a value of 
> a column from an InternalRow that is translated from CachedBatch. This issue 
> generates Java code to get a value of a column from CachedBatch. This JIRA 
> entry supports other primitive types (boolean/byte/short/int/long) whose 
> column may be compressed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17626:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> TPC-DS performance improvements using star-schema heuristics
> 
>
> Key: SPARK-17626
> URL: https://issues.apache.org/jira/browse/SPARK-17626
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ioana Delaney
>Priority: Critical
> Attachments: StarSchemaJoinReordering.pptx
>
>
> *TPC-DS performance improvements using star-schema heuristics*
> \\
> \\
> TPC-DS consists of multiple snowflake schema, which are multiple star schema 
> with dimensions linking to dimensions. A star schema consists of a fact table 
> referencing a number of dimension tables. Fact table holds the main data 
> about a business. Dimension table, a usually smaller table, describes data 
> reflecting the dimension/attribute of a business.
> \\
> \\
> As part of the benchmark performance investigation, we observed a pattern of 
> sub-optimal execution plans of large fact tables joins. Manual rewrite of 
> some of the queries into selective fact-dimensions joins resulted in 
> significant performance improvement. This prompted us to develop a simple 
> join reordering algorithm based on star schema detection. The performance 
> testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. 
> \\
> \\
> *Summary of the results:*
> {code}
> Passed 99
> Failed  0
> Total q time (s)   14,962
> Max time1,467
> Min time3
> Mean time 145
> Geomean44
> {code}
> *Compared to baseline* (Negative = improvement; Positive = Degradation):
> {code}
> End to end improved (%)  -19% 
> Mean time improved (%)   -19%
> Geomean improved (%) -24%
> End to end improved (seconds)  -3,603
> Number of queries improved (>10%)  45
> Number of queries degraded (>10%)   6
> Number of queries unchanged48
> Top 10 queries improved (%)  -20%
> {code}
> Cluster: 20-node cluster with each node having:
> * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 
> v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
> * Total memory for the cluster: 2.5TB
> * Total storage: 400TB
> * Total CPU cores: 480
> Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
> Database info:
> * Schema: TPCDS 
> * Scale factor: 1TB total space
> * Storage format: Parquet with Snappy compression
> Our investigation and results are included in the attached document.
> There are two parts to this improvement:
> # Join reordering using star schema detection
> # New selectivity hint to specify the selectivity of the predicates over base 
> tables. Selectivity hint is optional and it was not used in the above TPC-DS 
> tests. 
> \\



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15867) Use bucket files for TABLESAMPLE BUCKET

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15867:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Use bucket files for TABLESAMPLE BUCKET
> ---
>
> Key: SPARK-15867
> URL: https://issues.apache.org/jira/browse/SPARK-15867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Andrew Or
>
> {code}
> SELECT * FROM boxes TABLESAMPLE (BUCKET 3 OUT OF 16)
> {code}
> In Hive, this would select the 3rd bucket out of every 16 buckets there are 
> in the table. E.g. if the table was clustered by 32 buckets then this would 
> sample the 3rd and the 19th bucket. (See 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling)
> In Spark, however, we simply sample 3/16 of the number of input rows.
> Either we don't support it in Spark or do it in a way that's consistent with 
> Hive.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18394) Executing the same query twice in a row results in CodeGenerator cache misses

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18394:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Executing the same query twice in a row results in CodeGenerator cache misses
> -
>
> Key: SPARK-18394
> URL: https://issues.apache.org/jira/browse/SPARK-18394
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: HiveThriftServer2 running on branch-2.0 on Mac laptop
>Reporter: Jonny Serencsa
>
> Executing the query:
> {noformat}
> select
> l_returnflag,
> l_linestatus,
> sum(l_quantity) as sum_qty,
> sum(l_extendedprice) as sum_base_price,
> sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
> sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
> avg(l_quantity) as avg_qty,
> avg(l_extendedprice) as avg_price,
> avg(l_discount) as avg_disc,
> count(*) as count_order
> from
> lineitem_1_row
> where
> l_shipdate <= date_sub('1998-12-01', '90')
> group by
> l_returnflag,
> l_linestatus
> ;
> {noformat}
> twice (in succession), will result in CodeGenerator cache misses in BOTH 
> executions. Since the query is identical, I would expect the same code to be 
> generated. 
> Turns out, the generated code is not exactly the same, resulting in cache 
> misses when performing the lookup in the CodeGenerator cache. Yet, the code 
> is equivalent. 
> Below is (some portion of the) generated code for two runs of the query:
> run-1
> {noformat}
> import java.nio.ByteBuffer;
> import java.nio.ByteOrder;
> import scala.collection.Iterator;
> import org.apache.spark.sql.types.DataType;
> import org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder;
> import org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter;
> import org.apache.spark.sql.execution.columnar.MutableUnsafeRow;
> public SpecificColumnarIterator generate(Object[] references) {
> return new SpecificColumnarIterator();
> }
> class SpecificColumnarIterator extends 
> org.apache.spark.sql.execution.columnar.ColumnarIterator {
> private ByteOrder nativeOrder = null;
> private byte[][] buffers = null;
> private UnsafeRow unsafeRow = new UnsafeRow(7);
> private BufferHolder bufferHolder = new BufferHolder(unsafeRow);
> private UnsafeRowWriter rowWriter = new UnsafeRowWriter(bufferHolder, 7);
> private MutableUnsafeRow mutableRow = null;
> private int currentRow = 0;
> private int numRowsInBatch = 0;
> private scala.collection.Iterator input = null;
> private DataType[] columnTypes = null;
> private int[] columnIndexes = null;
> private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor accessor;
> private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor 
> accessor1;
> private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor 
> accessor2;
> private org.apache.spark.sql.execution.columnar.StringColumnAccessor 
> accessor3;
> private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor 
> accessor4;
> private org.apache.spark.sql.execution.columnar.StringColumnAccessor 
> accessor5;
> private org.apache.spark.sql.execution.columnar.StringColumnAccessor 
> accessor6;
> public SpecificColumnarIterator() {
> this.nativeOrder = ByteOrder.nativeOrder();
> this.buffers = new byte[7][];
> this.mutableRow = new MutableUnsafeRow(rowWriter);
> }
> public void initialize(Iterator input, DataType[] columnTypes, int[] 
> columnIndexes) {
> this.input = input;
> this.columnTypes = columnTypes;
> this.columnIndexes = columnIndexes;
> }
> public boolean hasNext() {
> if (currentRow < numRowsInBatch) {
> return true;
> }
> if (!input.hasNext()) {
> return false;
> }
> org.apache.spark.sql.execution.columnar.CachedBatch batch = 
> (org.apache.spark.sql.execution.columnar.CachedBatch) input.next();
> currentRow = 0;
> numRowsInBatch = batch.numRows();
> for (int i = 0; i < columnIndexes.length; i ++) {
> buffers[i] = batch.buffers()[columnIndexes[i]];
> }
> accessor = new 
> org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[0]).order(nativeOrder));
> accessor1 = new 
> org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[1]).order(nativeOrder));
> accessor2 = new 
> org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[2]).order(nativeOrder));
> accessor3 = new 
> org.apache.spark.sql.execution.columnar.StringColumnAccessor(ByteBuffer.wrap(buffers[3]).order(nativeOrder));
> accessor4 = new 
> org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[4]).order(nativeOrder));
> accessor5 = new 
> org.apache.spark.sql.execution.columnar.StringColumnAccessor(ByteBuffer.wrap(buffers[5]).order(nativeOrder));
> ac

[jira] [Updated] (SPARK-18891) Support for specific collection types

2017-06-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18891:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Support for specific collection types
> -
>
> Key: SPARK-18891
> URL: https://issues.apache.org/jira/browse/SPARK-18891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Michael Armbrust
>Priority: Critical
>
> Encoders treat all collections the same (i.e. {{Seq}} vs {{List}}) which 
> force users to only define classes with the most generic type.
> An [example 
> error|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880241/2840265927289860/latest.html]:
> {code}
> case class SpecificCollection(aList: List[Int])
> Seq(SpecificCollection(1 :: Nil)).toDS().collect()
> {code}
> {code}
> java.lang.RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 98, Column 120: No applicable constructor/method found 
> for actual parameters "scala.collection.Seq"; candidates are: 
> "line29e7e4b1e36445baa3505b2e102aa86b29.$read$$iw$$iw$$iw$$iw$SpecificCollection(scala.collection.immutable.List)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >