date:20191029

[jira] [Updated] (SPARK-29647) Use Python 3.7 in GitHub Action to recover lint-python

2019-10-29 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29647:
--
Component/s: PySpark

> Use Python 3.7 in GitHub Action to recover lint-python
> --
>
> Key: SPARK-29647
> URL: https://issues.apache.org/jira/browse/SPARK-29647
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.4.5
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> `branch-2.4` seems incompatible with Python 3.7.
> Currently, GitHub Action on `branch-2.4` is broken.
> This issue aims to recover the GitHub Action `lint-python` first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29647) Use Python 3.7 in GitHub Action to recover lint-python

2019-10-29 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-29647:
-

 Summary: Use Python 3.7 in GitHub Action to recover lint-python
 Key: SPARK-29647
 URL: https://issues.apache.org/jira/browse/SPARK-29647
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.4.5
Reporter: Dongjoon Hyun


`branch-2.4` seems incompatible with Python 3.7.
Currently, GitHub Action on `branch-2.4` is broken.
This issue aims to recover the GitHub Action `lint-python` first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29646) Allow pyspark version name format `${versionNumber}-preview` in release script

2019-10-29 Thread Xingbo Jiang (Jira)

Xingbo Jiang created SPARK-29646:


 Summary: Allow pyspark version name format 
`${versionNumber}-preview` in release script
 Key: SPARK-29646
 URL: https://issues.apache.org/jira/browse/SPARK-29646
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Xingbo Jiang


We shall allow pyspark version name format `${versionNumber}-preview` in 
release script, to support preview releases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29645) ML add param RelativeError

2019-10-29 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-29645:


 Summary: ML add param RelativeError
 Key: SPARK-29645
 URL: https://issues.apache.org/jira/browse/SPARK-29645
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


{color:#172b4d}It makes sense to expose {{RelativeError}} to end users, since 
it controls both the{color}

{color:#172b4d}precision and memory overhead.
{color}

{color:#172b4d}[QuantileDiscretizer 
|https://github.com/apache/spark/compare/master...zhengruifeng:add_relative_err?expand=1#diff-bf4cb764860f82d632ac0730e3d8c605]had
 added this param, while other algs not yet.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29372) Codegen grows beyond 64 KB for more columns in case of SupportsScanColumnarBatch

2019-10-29 Thread Razibul Hossain (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962663#comment-16962663
 ] 

Razibul Hossain commented on SPARK-29372:
-

I kind of face same issue following some error details

CodeGenerator:91 - failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code 
of method "processNext()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0"
 grows beyond 64 KB
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code 
of method "processNext()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0"
 grows beyond 64 KB
 at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
 at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
 at 
org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
 at 
org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
 at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
 at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
 at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1417)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
 at 
org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
 at 
org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
 at 
org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
 at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
 at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
 at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
 at 
org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:579)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:578)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.doExecute(DataSourceV2ScanExec.scala:90)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:578)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExec.doExecute(DataSourceV2ScanExec.scala:90)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at

[jira] [Comment Edited] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation

2019-10-29 Thread Suchintak Patnaik (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962658#comment-16962658
 ] 

Suchintak Patnaik edited comment on SPARK-29621 at 10/30/19 3:38 AM:
-

[~hyukjin.kwon]As per this,
As per 
*https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
 it should not allow referencing only internal corrupt column right??

Then how come df.filter(df._corrupt_record.isNotNull()).count() shows error and 
df.filter(df._corrupt_record.isNotNull()).show() doesn't??


was (Author: patnaik):
[~gurwls223] it should not allow referencing only internal corrupt record 
right??

Then how come filter.count() shows error and filter.show() doesn't??

> Querying internal corrupt record column should not be allowed in filter 
> operation
> -
>
> Key: SPARK-29621
> URL: https://issues.apache.org/jira/browse/SPARK-29621
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>  Labels: PySpark, SparkSQL
>
> As per 
> *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
> _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when 
> the referenced columns only include the internal corrupt record column"_
> But it's allowing while querying only the internal corrupt record column in 
> case of *filter* operation.
> {code}
> from pyspark.sql.types import *
> schema = StructType([
> StructField("_corrupt_record", StringType(), False),
> StructField("Name", StringType(), False),
> StructField("Colour", StringType(), True),
> StructField("Price", IntegerType(), True),
> StructField("Quantity", IntegerType(), True)])
> df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE")
> df.filter(df._corrupt_record.isNotNull()).show()  # Allowed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation

2019-10-29 Thread Suchintak Patnaik (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962658#comment-16962658
 ] 

Suchintak Patnaik commented on SPARK-29621:
---

[~gurwls223] it should not allow referencing only internal corrupt record 
right??

Then how come filter.count() shows error and filter.show() doesn't??

> Querying internal corrupt record column should not be allowed in filter 
> operation
> -
>
> Key: SPARK-29621
> URL: https://issues.apache.org/jira/browse/SPARK-29621
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>  Labels: PySpark, SparkSQL
>
> As per 
> *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
> _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when 
> the referenced columns only include the internal corrupt record column"_
> But it's allowing while querying only the internal corrupt record column in 
> case of *filter* operation.
> {code}
> from pyspark.sql.types import *
> schema = StructType([
> StructField("_corrupt_record", StringType(), False),
> StructField("Name", StringType(), False),
> StructField("Colour", StringType(), True),
> StructField("Price", IntegerType(), True),
> StructField("Quantity", IntegerType(), True)])
> df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE")
> df.filter(df._corrupt_record.isNotNull()).show()  # Allowed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29644) ShortType is wrongly set as Int in JDBCUtils.scala

2019-10-29 Thread Shiv Prashant Sood (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiv Prashant Sood updated SPARK-29644:
---
Description: 
@maropu pointed out this issue during  [PR 
25344|https://github.com/apache/spark/pull/25344]  review discussion.
 In 
[JDBCUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala]
 line number 547

case ShortType =>
 (stmt: PreparedStatement, row: Row, pos: Int) =>
 stmt.setInt(pos + 1, row.getShort(pos))


I dont see any reproducible issue, but this is clearly a problem that must be 
fixed.

  was:
@maropu pointed out this issue in [PR 
25344|https://github.com/apache/spark/pull/25344] 
 In 
[JDBCUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala]
 line number 547

case ShortType =>
 (stmt: PreparedStatement, row: Row, pos: Int) =>
 stmt.setInt(pos + 1, row.getShort(pos))

 

 


> ShortType is wrongly set as Int in JDBCUtils.scala
> --
>
> Key: SPARK-29644
> URL: https://issues.apache.org/jira/browse/SPARK-29644
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Shiv Prashant Sood
>Priority: Minor
>
> @maropu pointed out this issue during  [PR 
> 25344|https://github.com/apache/spark/pull/25344]  review discussion.
>  In 
> [JDBCUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala]
>  line number 547
> case ShortType =>
>  (stmt: PreparedStatement, row: Row, pos: Int) =>
>  stmt.setInt(pos + 1, row.getShort(pos))
> I dont see any reproducible issue, but this is clearly a problem that must be 
> fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29644) ShortType is wrongly set as Int in JDBCUtils.scala

2019-10-29 Thread Shiv Prashant Sood (Jira)

Shiv Prashant Sood created SPARK-29644:
--

 Summary: ShortType is wrongly set as Int in JDBCUtils.scala
 Key: SPARK-29644
 URL: https://issues.apache.org/jira/browse/SPARK-29644
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.4, 3.0.0
Reporter: Shiv Prashant Sood


@maropu pointed out this issue in [PR 
25344|https://github.com/apache/spark/pull/25344] 
 In 
[JDBCUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala]
 line number 547

case ShortType =>
 (stmt: PreparedStatement, row: Row, pos: Int) =>
 stmt.setInt(pos + 1, row.getShort(pos))

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27763) Port test cases from PostgreSQL to Spark SQL

2019-10-29 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962588#comment-16962588
 ] 

Hyukjin Kwon commented on SPARK-27763:
--

+1 !!

> Port test cases from PostgreSQL to Spark SQL
> 
>
> Key: SPARK-27763
> URL: https://issues.apache.org/jira/browse/SPARK-27763
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
>
> To improve the test coverage, we can port the regression tests from the other 
> popular open source projects to Spark SQL. PostgreSQL is one of the best SQL 
> systems. Below are the links to the test cases and results. 
>  * Regression test cases: 
> [https://github.com/postgres/postgres/tree/master/src/test/regress/sql]
>  * Expected results: 
> [https://github.com/postgres/postgres/tree/master/src/test/regress/expected]
> Spark SQL does not support all the feature sets of PostgreSQL. In the current 
> stage, we should first comment out these test cases and create the 
> corresponding JIRAs in SPARK-27764. We can discuss and prioritize which 
> features we should support. Also, these PostgreSQL regression tests could 
> also expose the existing bugs of Spark SQL. We should also create the JIRAs 
> and track them in SPARK-27764. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29620) UnsafeKVExternalSorterSuite failure on bigendian system

2019-10-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29620:
-
Description: 
{code}
spark/sql/core# ../../build/mvn -Dtest=none 
-DwildcardSuites=org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite test
{code}

{code}
UnsafeKVExternalSorterSuite:
12:24:24.305 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
- kv sorting key schema [] and value schema [] *** FAILED ***
 java.lang.AssertionError: sizeInBytes (4) should be a multiple of 8
 at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:168)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter$KVSorterIterator.next(UnsafeKVExternalSorter.java:297)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite.org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter(UnsafeKVExternalSorterSuite.scala:145)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply$mcV$sp(UnsafeKVExternalSorterSuite.scala:86)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86)
 at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 at org.scalatest.Transformer.apply(Transformer.scala:22)
 at org.scalatest.Transformer.apply(Transformer.scala:20)
 ...
- kv sorting key schema [int] and value schema [] *** FAILED ***
 java.lang.AssertionError: sizeInBytes (20) should be a multiple of 8
 at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:168)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter$KVSorterIterator.next(UnsafeKVExternalSorter.java:297)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite.org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter(UnsafeKVExternalSorterSuite.scala:145)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply$mcV$sp(UnsafeKVExternalSorterSuite.scala:86)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86)
 at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 at org.scalatest.Transformer.apply(Transformer.scala:22)
 at org.scalatest.Transformer.apply(Transformer.scala:20)
 ...
- kv sorting key schema [] and value schema [int] *** FAILED ***
 java.lang.AssertionError: sizeInBytes (20) should be a multiple of 8
 at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:168)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter$KVSorterIterator.next(UnsafeKVExternalSorter.java:297)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite.org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter(UnsafeKVExternalSorterSuite.scala:145)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply$mcV$sp(UnsafeKVExternalSorterSuite.scala:86)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorterSuite$$anonfun$org$apache$spark$sql$execution$UnsafeKVExternalSorterSuite$$testKVSorter$1.apply(UnsafeKVExternalSorterSuite.scala:86)
 at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 at org.scalatest.Transformer.apply(Transformer.scala:22)
 at org.scalatest.Transformer.apply(Transformer.scala:20)
 ...
- kv sorting key schema [int] and value schema 
[float,float,double,string,float] *** FAILED ***
 java.lang.AssertionError: sizeInBytes (2732) should be a multiple of 8
 at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.pointTo(UnsafeRow.java:168)
 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter$KVSorterIterator.next(UnsafeKVExternalSorter.java:297)
 at

[jira] [Commented] (SPARK-29643) ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands

2019-10-29 Thread Huaxin Gao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962581#comment-16962581
 ] 

Huaxin Gao commented on SPARK-29643:


I am working on this

> ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands
> --
>
> Key: SPARK-29643
> URL: https://issues.apache.org/jira/browse/SPARK-29643
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: ALTER TABLE (DROP PARTITION) should look up 
> catalog/table like v2 commands
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation

2019-10-29 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962580#comment-16962580
 ] 

Hyukjin Kwon commented on SPARK-29621:
--

Filter is fine. it doesn't have any problem.

> Querying internal corrupt record column should not be allowed in filter 
> operation
> -
>
> Key: SPARK-29621
> URL: https://issues.apache.org/jira/browse/SPARK-29621
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>  Labels: PySpark, SparkSQL
>
> As per 
> *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
> _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when 
> the referenced columns only include the internal corrupt record column"_
> But it's allowing while querying only the internal corrupt record column in 
> case of *filter* operation.
> {code}
> from pyspark.sql.types import *
> schema = StructType([
> StructField("_corrupt_record", StringType(), False),
> StructField("Name", StringType(), False),
> StructField("Colour", StringType(), True),
> StructField("Price", IntegerType(), True),
> StructField("Quantity", IntegerType(), True)])
> df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE")
> df.filter(df._corrupt_record.isNotNull()).show()  # Allowed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation

2019-10-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29621.
--
Resolution: Not A Problem

> Querying internal corrupt record column should not be allowed in filter 
> operation
> -
>
> Key: SPARK-29621
> URL: https://issues.apache.org/jira/browse/SPARK-29621
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>  Labels: PySpark, SparkSQL
>
> As per 
> *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
> _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when 
> the referenced columns only include the internal corrupt record column"_
> But it's allowing while querying only the internal corrupt record column in 
> case of *filter* operation.
> {code}
> from pyspark.sql.types import *
> schema = StructType([
> StructField("_corrupt_record", StringType(), False),
> StructField("Name", StringType(), False),
> StructField("Colour", StringType(), True),
> StructField("Price", IntegerType(), True),
> StructField("Quantity", IntegerType(), True)])
> df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE")
> df.filter(df._corrupt_record.isNotNull()).show()  # Allowed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29643) ALTER TABLE (DROP PARTITION) should look up catalog/table like v2 commands

2019-10-29 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-29643:
--

 Summary: ALTER TABLE (DROP PARTITION) should look up catalog/table 
like v2 commands
 Key: SPARK-29643
 URL: https://issues.apache.org/jira/browse/SPARK-29643
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
 Environment: ALTER TABLE (DROP PARTITION) should look up catalog/table 
like v2 commands
Reporter: Huaxin Gao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation

2019-10-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29621:
-
Description: 
As per 
*https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
_"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the 
referenced columns only include the internal corrupt record column"_

But it's allowing while querying only the internal corrupt record column in 
case of *filter* operation.

{code}
from pyspark.sql.types import *

schema = StructType([
StructField("_corrupt_record", StringType(), False),
StructField("Name", StringType(), False),
StructField("Colour", StringType(), True),
StructField("Price", IntegerType(), True),
StructField("Quantity", IntegerType(), True)])
df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE")
df.filter(df._corrupt_record.isNotNull()).show()  # Allowed
{code}


  was:
As per 
*https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
_"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the 
referenced columns only include the internal corrupt record column"_

But it's allowing while querying only the internal corrupt record column in 
case of *filter* operation.

{code}
from pyspark.sql.types import *

schema = StructType([
StructField("_corrupt_record",StringType(),False),
StructField("Name",StringType(),False),
StructField("Colour",StringType(),True),
StructField("Price",IntegerType(),True),
StructField("Quantity",IntegerType(),True)])
df = spark.read.csv("fruit.csv",schema=schema,mode="PERMISSIVE")
df.filter(df._corrupt_record.isNotNull()).show()   // Allowed
{code}



> Querying internal corrupt record column should not be allowed in filter 
> operation
> -
>
> Key: SPARK-29621
> URL: https://issues.apache.org/jira/browse/SPARK-29621
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>  Labels: PySpark, SparkSQL
>
> As per 
> *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
> _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when 
> the referenced columns only include the internal corrupt record column"_
> But it's allowing while querying only the internal corrupt record column in 
> case of *filter* operation.
> {code}
> from pyspark.sql.types import *
> schema = StructType([
> StructField("_corrupt_record", StringType(), False),
> StructField("Name", StringType(), False),
> StructField("Colour", StringType(), True),
> StructField("Price", IntegerType(), True),
> StructField("Quantity", IntegerType(), True)])
> df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE")
> df.filter(df._corrupt_record.isNotNull()).show()  # Allowed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation

2019-10-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29621:
-
Description: 
As per 
*https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
_"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the 
referenced columns only include the internal corrupt record column"_

But it's allowing while querying only the internal corrupt record column in 
case of *filter* operation.

{code}
from pyspark.sql.types import *

schema = StructType([
StructField("_corrupt_record",StringType(),False),
StructField("Name",StringType(),False),
StructField("Colour",StringType(),True),
StructField("Price",IntegerType(),True),
StructField("Quantity",IntegerType(),True)])
df = spark.read.csv("fruit.csv",schema=schema,mode="PERMISSIVE")
df.filter(df._corrupt_record.isNotNull()).show()   // Allowed
{code}


  was:
As per 
*https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
_"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the 
referenced columns only include the internal corrupt record column"_

But it's allowing while querying only the internal corrupt record column in 
case of *filter* operation.

from pyspark.sql.types import *
schema = 
StructType([StructField("_corrupt_record",StringType(),False),StructField("Name",StringType(),False),StructField("Colour",StringType(),True),StructField("Price",IntegerType(),True),StructField("Quantity",IntegerType(),True)])

df = spark.read.csv("fruit.csv",schema=schema,mode="PERMISSIVE")

df.filter(df._corrupt_record.isNotNull()).show()   // Allowed


> Querying internal corrupt record column should not be allowed in filter 
> operation
> -
>
> Key: SPARK-29621
> URL: https://issues.apache.org/jira/browse/SPARK-29621
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>  Labels: PySpark, SparkSQL
>
> As per 
> *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
> _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when 
> the referenced columns only include the internal corrupt record column"_
> But it's allowing while querying only the internal corrupt record column in 
> case of *filter* operation.
> {code}
> from pyspark.sql.types import *
> schema = StructType([
> StructField("_corrupt_record",StringType(),False),
> StructField("Name",StringType(),False),
> StructField("Colour",StringType(),True),
> StructField("Price",IntegerType(),True),
> StructField("Quantity",IntegerType(),True)])
> df = spark.read.csv("fruit.csv",schema=schema,mode="PERMISSIVE")
> df.filter(df._corrupt_record.isNotNull()).show()   // Allowed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice

2019-10-29 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962576#comment-16962576
 ] 

Hyukjin Kwon commented on SPARK-29625:
--

[~sanysand...@gmail.com] Please don't just copy and paste log and exception but 
describe some analysis with a self-contained reproducer. Otherwise, no one can 
investigate.

> Spark Structure Streaming Kafka Wrong Reset Offset twice
> 
>
> Key: SPARK-29625
> URL: https://issues.apache.org/jira/browse/SPARK-29625
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Sandish Kumar HN
>Priority: Major
>
> Spark Structure Streaming Kafka Reset Offset twice, once with right offsets 
> and second time with very old offsets 
> {code}
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-151 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-118 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-85 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-52 to offset 122677634.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-19 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-52 to offset 120504922.*
> [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO ContextCleaner: Cleaned accumulator 810
> {code}
> which is causing a Data loss issue.  
> {code}
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> ERROR StreamExecution: Query [id = d62ca9e4-6650-454f-8691-a3d576d1e4ba, 
> runId = 3946389f-222b-495c-9ab2-832c0422cbbb] terminated with error
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 
> java.lang.IllegalStateException: Partition topic-52's offset was changed from 
> 122677598 to 120504922, some data may have been missed.
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - Some data may have 
> been lost because they are not available in Kafka any more; either the
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  data was aged out 
> by Kafka or the topic may have been deleted before all the data in the
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  topic was 
> processed. If you don't want your streaming query to fail on such cases, set 
> the
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  source option 
> "failOnDataLoss" to "false".
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> org.apache.spark.sql.kafka010.KafkaSource.org$apache$spark$sql$kafka010$KafkaSource$$reportDataLoss(KafkaSource.scala:329)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:283)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:281)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
>

[jira] [Updated] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice

2019-10-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29625:
-
Description: 
Spark Structure Streaming Kafka Reset Offset twice, once with right offsets and 
second time with very old offsets 

{code}
[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-151 to offset 0.
[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-118 to offset 0.
[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-85 to offset 0.
[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-52 to offset 122677634.
[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-19 to offset 0.
[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
Fetcher: [Consumer clientId=consumer-1, 
groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
 Resetting offset for partition topic-52 to offset 120504922.*
[2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 INFO 
ContextCleaner: Cleaned accumulator 810
{code}

which is causing a Data loss issue.  

{code}
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
ERROR StreamExecution: Query [id = d62ca9e4-6650-454f-8691-a3d576d1e4ba, runId 
= 3946389f-222b-495c-9ab2-832c0422cbbb] terminated with error
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 
java.lang.IllegalStateException: Partition topic-52's offset was changed from 
122677598 to 120504922, some data may have been missed.
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - Some data may have 
been lost because they are not available in Kafka any more; either the
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  data was aged out by 
Kafka or the topic may have been deleted before all the data in the
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  topic was processed. 
If you don't want your streaming query to fail on such cases, set the
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  source option 
"failOnDataLoss" to "false".
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
org.apache.spark.sql.kafka010.KafkaSource.org$apache$spark$sql$kafka010$KafkaSource$$reportDataLoss(KafkaSource.scala:329)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:283)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:281)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
scala.collection.AbstractTraversable.filter(Traversable.scala:104)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:281)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$7.apply(StreamExecution.scala:614)
[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -at

[jira] [Resolved] (SPARK-29627) array_contains should allow column instances in PySpark

2019-10-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29627.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26288
[https://github.com/apache/spark/pull/26288]

> array_contains should allow column instances in PySpark
> ---
>
> Key: SPARK-29627
> URL: https://issues.apache.org/jira/browse/SPARK-29627
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> Scala API works well with column instances:
> {code}
> import org.apache.spark.sql.functions._
> val df = Seq(Array("a", "b", "c"), Array.empty[String]).toDF("data")
> df.select(array_contains($"data", lit("a"))).collect()
> {code}
> {code}
> Array[org.apache.spark.sql.Row] = Array([true], [false])
> {code}
> However, seems PySpark one doesn't:
> {code}
> from pyspark.sql.functions import array_contains, lit
> df = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
> df.select(array_contains(df.data, lit("a"))).show()
> {code}
> {code}
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "/.../spark/python/pyspark/sql/functions.py", line 1950, in 
> array_contains
>  return Column(sc._jvm.functions.array_contains(_to_java_column(col), value))
>  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", 
> line 1277, in __call__
>  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", 
> line 1241, in _build_args
>  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", 
> line 1228, in _get_args
>  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_collections.py", 
> line 500, in convert
>  File "/.../spark/python/pyspark/sql/column.py", line 344, in __iter__
>  raise TypeError("Column is not iterable")
> TypeError: Column is not iterable
> {code}
> We should let it allow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29629) Support typed integer literal expression

2019-10-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29629.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26291
[https://github.com/apache/spark/pull/26291]

> Support typed integer literal expression
> 
>
> Key: SPARK-29629
> URL: https://issues.apache.org/jira/browse/SPARK-29629
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:java}
> postgres=# select date '2001-09-28' + integer '7';
>   ?column?
> 
>  2001-10-05
> (1 row)postgres=# select integer '7';
>  int4
> --
> 7
> (1 row)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29629) Support typed integer literal expression

2019-10-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-29629:


Assignee: Kent Yao

> Support typed integer literal expression
> 
>
> Key: SPARK-29629
> URL: https://issues.apache.org/jira/browse/SPARK-29629
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> {code:java}
> postgres=# select date '2001-09-28' + integer '7';
>   ?column?
> 
>  2001-10-05
> (1 row)postgres=# select integer '7';
>  int4
> --
> 7
> (1 row)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29638) Spark handles 'NaN' as 0 in sums

2019-10-29 Thread Dylan Guedes (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dylan Guedes updated SPARK-29638:
-
Description: 
Currently, Spark handles 'NaN' as 0 in window functions, such that 3+'NaN'=3. 
PgSQL, on the other hand, handles the entire result as 'NaN', as in 3+'NaN' = 
'NaN'

I experienced this with the query below:

{code:sql}
SELECT a, b,
   SUM(b) OVER(ORDER BY A ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
FROM (VALUES(1,1),(2,2),(3,(cast('nan' as int))),(4,3),(5,4)) t(a,b);
{code}


  was:Currently, Spark handles 'NaN' as 0 in window functions, such that 
3+'NaN'=3. PgSQL, on the other hand, handles the entire result as 'NaN', as in 
3+'NaN' = 'NaN'


> Spark handles 'NaN' as 0 in sums
> 
>
> Key: SPARK-29638
> URL: https://issues.apache.org/jira/browse/SPARK-29638
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
>
> Currently, Spark handles 'NaN' as 0 in window functions, such that 3+'NaN'=3. 
> PgSQL, on the other hand, handles the entire result as 'NaN', as in 3+'NaN' = 
> 'NaN'
> I experienced this with the query below:
> {code:sql}
> SELECT a, b,
>SUM(b) OVER(ORDER BY A ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
> FROM (VALUES(1,1),(2,2),(3,(cast('nan' as int))),(4,3),(5,4)) t(a,b);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-29120) Add create_view.sql

2019-10-29 Thread Aman Omer (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Omer updated SPARK-29120:
--
Comment: was deleted

(was: I will work on this.)

> Add create_view.sql
> ---
>
> Key: SPARK-29120
> URL: https://issues.apache.org/jira/browse/SPARK-29120
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29636) Can't parse '11:00 BST' or '2000-10-19 10:23:54+01' signatures to timestamp

2019-10-29 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962542#comment-16962542
 ] 

Aman Omer commented on SPARK-29636:
---

Thanks [~DylanGuedes] . I am checking this one.

 

> Can't parse '11:00 BST' or '2000-10-19 10:23:54+01' signatures to timestamp
> ---
>
> Key: SPARK-29636
> URL: https://issues.apache.org/jira/browse/SPARK-29636
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
>
> Currently, Spark can't parse a string such as '11:00 BST' or '2000-10-19 
> 10:23:54+01' to timestamp:
> {code:sql}
> spark-sql> select cast ('11:00 BST' as timestamp);
> NULL
> Time taken: 2.248 seconds, Fetched 1 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28746) Add repartitionby hint to support RepartitionByExpression

2019-10-29 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-28746.
--
Fix Version/s: 3.0.0
 Assignee: ulysses you
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/25464

> Add repartitionby hint to support RepartitionByExpression
> -
>
> Key: SPARK-28746
> URL: https://issues.apache.org/jira/browse/SPARK-28746
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.0.0
>
>
> Now, `RepartitionByExpression` is allowed at Dataset method 
> `Dataset.repartition()`. But in spark sql,  we do not have an equivalent 
> functionality. 
> In hive, we can use `distribute by`, so it's worth to add a hint to support 
> such function.
> Similar jira https://issues.apache.org/jira/browse/SPARK-24940



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29592) ALTER TABLE (set partition location) should look up catalog/table like v2 commands

2019-10-29 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962531#comment-16962531
 ] 

Terry Kim commented on SPARK-29592:
---

OK. Thanks.

> ALTER TABLE (set partition location) should look up catalog/table like v2 
> commands
> --
>
> Key: SPARK-29592
> URL: https://issues.apache.org/jira/browse/SPARK-29592
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>
> ALTER TABLE (set partition location) should look up catalog/table like v2 
> commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29592) ALTER TABLE (set partition location) should look up catalog/table like v2 commands

2019-10-29 Thread Ryan Blue (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962527#comment-16962527
 ] 

Ryan Blue commented on SPARK-29592:
---

There is not currently a way to alter the partition spec for a table, so I 
don't think we need to worry about this for now.

> ALTER TABLE (set partition location) should look up catalog/table like v2 
> commands
> --
>
> Key: SPARK-29592
> URL: https://issues.apache.org/jira/browse/SPARK-29592
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>
> ALTER TABLE (set partition location) should look up catalog/table like v2 
> commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29592) ALTER TABLE (set partition location) should look up catalog/table like v2 commands

2019-10-29 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962522#comment-16962522
 ] 

Terry Kim commented on SPARK-29592:
---

+ [~rdblue] as well (for setting partition spec)

> ALTER TABLE (set partition location) should look up catalog/table like v2 
> commands
> --
>
> Key: SPARK-29592
> URL: https://issues.apache.org/jira/browse/SPARK-29592
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>
> ALTER TABLE (set partition location) should look up catalog/table like v2 
> commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27763) Port test cases from PostgreSQL to Spark SQL

2019-10-29 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962515#comment-16962515
 ] 

Takeshi Yamamuro commented on SPARK-27763:
--

Thanks for the check, [~dongjoon] and [~yumwang] ! ok so I'll file sub-tickets 
for the left ones later and make prs.

> Port test cases from PostgreSQL to Spark SQL
> 
>
> Key: SPARK-27763
> URL: https://issues.apache.org/jira/browse/SPARK-27763
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
>
> To improve the test coverage, we can port the regression tests from the other 
> popular open source projects to Spark SQL. PostgreSQL is one of the best SQL 
> systems. Below are the links to the test cases and results. 
>  * Regression test cases: 
> [https://github.com/postgres/postgres/tree/master/src/test/regress/sql]
>  * Expected results: 
> [https://github.com/postgres/postgres/tree/master/src/test/regress/expected]
> Spark SQL does not support all the feature sets of PostgreSQL. In the current 
> stage, we should first comment out these test cases and create the 
> corresponding JIRAs in SPARK-27764. We can discuss and prioritize which 
> features we should support. Also, these PostgreSQL regression tests could 
> also expose the existing bugs of Spark SQL. We should also create the JIRAs 
> and track them in SPARK-27764. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29639) Spark Kafka connector 0.10.0 generates incorrect end offsets for micro batch

2019-10-29 Thread Abhinav Choudhury (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhinav Choudhury resolved SPARK-29639.
---
Resolution: Abandoned

Resolved because of lack of enough info and steps to reproduce

> Spark Kafka connector 0.10.0 generates incorrect end offsets for micro batch
> 
>
> Key: SPARK-29639
> URL: https://issues.apache.org/jira/browse/SPARK-29639
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Abhinav Choudhury
>Priority: Major
>
> We have been running a Spark structured job on production for more than a 
> week now. Put simply, it reads data from source Kafka topics (with 4 
> partitions) and writes to another kafka topic. Everything has been running 
> fine until the job started failing with the following error:
>  
> {noformat}
> Driver stacktrace:
>  === Streaming Query ===
>  Identifier: MetricComputer [id = af26bab1-0d89-4766-934e-ad5752d6bb08, runId 
> = 613a21ad-86e3-4781-891b-17d92c18954a]
>  Current Committed Offsets: {KafkaV2[Subscribe[kafka-topic-name]]: 
> {"kafka-topic-name":
> {"2":10458347,"1":10460151,"3":10475678,"0":9809564}
> }}
>  Current Available Offsets: {KafkaV2[Subscribe[kafka-topic-name]]: 
> {"kafka-topic-name":
> {"2":10458347,"1":10460151,"3":10475678,"0":10509527}
> }}
> Current State: ACTIVE
>  Thread State: RUNNABLE
> <-- Removed Logical plan -->
>  Some data may have been lost because they are not available in Kafka any 
> more; either the
>  data was aged out by Kafka or the topic may have been deleted before all the 
> data in the
>  topic was processed. If you don't want your streaming query to fail on such 
> cases, set the
>  source option "failOnDataLoss" to "false".{noformat}
> Configuration:
> {noformat}
> Spark 2.4.0
> Spark-sql-kafka 0.10{noformat}
> Looking at the Spark structured streaming query progress logs, it seems like 
> the endOffsets computed for the next batch was actually smaller than the 
> starting offset:
> *Microbatch Trigger 1:*
> {noformat}
> 2019/10/26 23:53:51 INFO utils.Logging[26]: 2019-10-26 23:53:51.767 : ( : 
> Query {
>   "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b",
>   "runId" : "2d20d633-2768-446c-845b-893243361422",
>   "name" : "StreamingProcessorName",
>   "timestamp" : "2019-10-26T23:53:51.741Z",
>   "batchId" : 2145898,
>   "numInputRows" : 0,
>   "inputRowsPerSecond" : 0.0,
>   "processedRowsPerSecond" : 0.0,
>   "durationMs" : {
> "getEndOffset" : 0,
> "setOffsetRange" : 9,
> "triggerExecution" : 9
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "KafkaV2[Subscribe[kafka-topic-name]]",
> "startOffset" : {
>   "kafka-topic-name" : {
> "2" : 10452513,
> "1" : 10454326,
> "3" : 10469196,
> "0" : 10503762
>   }
> },
> "endOffset" : {
>   "kafka-topic-name" : {
> "2" : 10452513,
> "1" : 10454326,
> "3" : 10469196,
> "0" : 10503762
>   }
> },
> "numInputRows" : 0,
> "inputRowsPerSecond" : 0.0,
> "processedRowsPerSecond" : 0.0
>   } ],
>   "sink" : {
> "description" : "ForeachBatchSink"
>   }
> } in progress{noformat}
> *Next micro batch trigger:*
> {noformat}
> 2019/10/26 23:53:53 INFO utils.Logging[26]: 2019-10-26 23:53:53.951 : ( : 
> Query {
>   "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b",
>   "runId" : "2d20d633-2768-446c-845b-893243361422",
>   "name" : "StreamingProcessorName",
>   "timestamp" : "2019-10-26T23:53:52.907Z",
>   "batchId" : 2145898,
>   "numInputRows" : 0,
>   "inputRowsPerSecond" : 0.0,
>   "processedRowsPerSecond" : 0.0,
>   "durationMs" : {
> "addBatch" : 350,
> "getBatch" : 4,
> "getEndOffset" : 0,
> "queryPlanning" : 102,
> "setOffsetRange" : 24,
> "triggerExecution" : 1043,
> "walCommit" : 349
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "KafkaV2[Subscribe[kafka-topic-name]]",
> "startOffset" : {
>   "kafka-topic-name" : {
> "2" : 10452513,
> "1" : 10454326,
> "3" : 10469196,
> "0" : 10503762
>   }
> },
> "endOffset" : {
>   "kafka-topic-name" : {
> "2" : 10452513,
> "1" : 10454326,
> "3" : 9773098,
> "0" : 10503762
>   }
> },
> "numInputRows" : 0,
> "inputRowsPerSecond" : 0.0,
> "processedRowsPerSecond" : 0.0
>   } ],
>   "sink" : {
> "description" : "ForeachBatchSink"
>   }
> } in progress{noformat}
> Notice that for partition 3 of the kafka topic, the endOffsets are actually 
> smaller than the starting offsets!
> Checked the HDFS checkpoint dir and the checkpointed offsets look fine and 
> point to the last committed offsets
>  Why is the

[jira] [Created] (SPARK-29642) ContinuousMemoryStream throws error on String type

2019-10-29 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-29642:


 Summary: ContinuousMemoryStream throws error on String type
 Key: SPARK-29642
 URL: https://issues.apache.org/jira/browse/SPARK-29642
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


While we can set String as a generic type of ContinuousMemoryStream, it doesn't 
work really because it doesn't convert String to UTFString and accessing it 
from Row interface would throw error.

We should encode the input and convert the input to Row properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29641) Stage Level Sched: Add python api's and test

2019-10-29 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-29641:
-

 Summary: Stage Level Sched: Add python api's and test
 Key: SPARK-29641
 URL: https://issues.apache.org/jira/browse/SPARK-29641
 Project: Spark
  Issue Type: Story
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Thomas Graves


For the Stage Level scheduling feature, add any missing python api and test it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29497) Cannot assign instance of java.lang.invoke.SerializedLambda to field

2019-10-29 Thread Rob Russo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rob Russo updated SPARK-29497:
--
Description: 
Note this is for scala 2.12:

There seems to be an issue in spark with serializing a udf that is created from 
a function assigned to a class member that references another function assigned 
to a class member. This is similar to 
https://issues.apache.org/jira/browse/SPARK-25047 but it looks like the 
resolution has an issue with this case. After trimming it down to the base 
issue I came up with the following to reproduce:

 

 
{code:java}
object TestLambdaShell extends Serializable {
  val hello: String => String = s => s"hello $s!"  
  val lambdaTest: String => String = hello( _ )  
  def functionTest: String => String = hello( _ )
}

val hello = udf( TestLambdaShell.hello )
val functionTest = udf( TestLambdaShell.functionTest )
val lambdaTest = udf( TestLambdaShell.lambdaTest )

sc.parallelize(Seq("world"),1).toDF("test").select(hello($"test")).show(1)
sc.parallelize(Seq("world"),1).toDF("test").select(functionTest($"test")).show(1)
sc.parallelize(Seq("world"),1).toDF("test").select(lambdaTest($"test")).show(1)
{code}
 

All of which works except the last line which results in an exception on the 
executors:

 
{code:java}
Caused by: java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
$$$82b5b23cea489b2712a1db46c77e458w$TestLambdaShell$.lambdaTest of type 
scala.Function1 in instance of 
$$$82b5b23cea489b2712a1db46c77e458w$TestLambdaShell$
  at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
  at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2251)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
  at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
  at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)
  at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
  at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
  at 
scala.collection.immutable.List$SerializationProxy.readObject(List.scala:488)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2136)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
  at

[jira] [Commented] (SPARK-29561) Large Case Statement Code Generation OOM

2019-10-29 Thread Michael Chen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962361#comment-16962361
 ] 

Michael Chen commented on SPARK-29561:
--

If I increase the memory, it will run into the generated code grows beyond 64 
KB exception and disable whole stage code generation for the plan. So that is 
ok.
But if I increase the number of branches/complexity of the branches, it will 
just run into the OOM problem again.

> Large Case Statement Code Generation OOM
> 
>
> Key: SPARK-29561
> URL: https://issues.apache.org/jira/browse/SPARK-29561
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michael Chen
>Priority: Major
> Attachments: apacheSparkCase.sql
>
>
> Spark Configuration
> spark.driver.memory = 1g
>  spark.master = "local"
>  spark.deploy.mode = "client"
> Try to execute a case statement with 3000+ branches. Added sql statement as 
> attachment
>  Spark runs for a while before it OOM
> {noformat}
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:182)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320)
>   at 
> org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)
>   at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73)
> 19/10/22 16:19:54 ERROR FileFormatWriter: Aborting job null.
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at java.util.HashMap.newNode(HashMap.java:1750)
>   at java.util.HashMap.putVal(HashMap.java:631)
>   at java.util.HashMap.putMapEntries(HashMap.java:515)
>   at java.util.HashMap.putAll(HashMap.java:785)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3345)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:212)
>   at 
> org.codehaus.janino.UnitCompiler$8.visitLocalVariableDeclarationStatement(UnitCompiler.java:3230)
>   at 
> org.codehaus.janino.UnitCompiler$8.visitLocalVariableDeclarationStatement(UnitCompiler.java:3198)
>   at 
> org.codehaus.janino.Java$LocalVariableDeclarationStatement.accept(Java.java:3351)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3254)
>   at org.codehaus.janino.UnitCompiler.access$3900(UnitCompiler.java:212)
>   at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:3216)
>   at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:3198)
>   at org.codehaus.janino.Java$Block.accept(Java.java:2756)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3260)
>   at org.codehaus.janino.UnitCompiler.access$4000(UnitCompiler.java:212)
>   at 
> org.codehaus.janino.UnitCompiler$8.visitDoStatement(UnitCompiler.java:3217)
>   at 
> org.codehaus.janino.UnitCompiler$8.visitDoStatement(UnitCompiler.java:3198)
>   at org.codehaus.janino.Java$DoStatement.accept(Java.java:3304)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3186)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3009)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958)
>   at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:393)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:385)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1286)
> 19/10/22 16:19:54 ERROR Utils: throw uncaught fatal error in thread Spark 
> Context Cleaner
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:182)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320)
>   at 
> org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)
>   at 
>

[jira] [Comment Edited] (SPARK-29561) Large Case Statement Code Generation OOM

2019-10-29 Thread Michael Chen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962361#comment-16962361
 ] 

Michael Chen edited comment on SPARK-29561 at 10/29/19 7:59 PM:


If I increase the memory, it will run into the generated code grows beyond 64 
KB exception and disable whole stage code generation for the plan.
 But if I increase the number of branches/complexity of the branches, it will 
just run into the OOM problem again.


was (Author: mikechen):
If I increase the memory, it will run into the generated code grows beyond 64 
KB exception and disable whole stage code generation for the plan. So that is 
ok.
But if I increase the number of branches/complexity of the branches, it will 
just run into the OOM problem again.

> Large Case Statement Code Generation OOM
> 
>
> Key: SPARK-29561
> URL: https://issues.apache.org/jira/browse/SPARK-29561
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michael Chen
>Priority: Major
> Attachments: apacheSparkCase.sql
>
>
> Spark Configuration
> spark.driver.memory = 1g
>  spark.master = "local"
>  spark.deploy.mode = "client"
> Try to execute a case statement with 3000+ branches. Added sql statement as 
> attachment
>  Spark runs for a while before it OOM
> {noformat}
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:182)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320)
>   at 
> org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)
>   at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73)
> 19/10/22 16:19:54 ERROR FileFormatWriter: Aborting job null.
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at java.util.HashMap.newNode(HashMap.java:1750)
>   at java.util.HashMap.putVal(HashMap.java:631)
>   at java.util.HashMap.putMapEntries(HashMap.java:515)
>   at java.util.HashMap.putAll(HashMap.java:785)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3345)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:212)
>   at 
> org.codehaus.janino.UnitCompiler$8.visitLocalVariableDeclarationStatement(UnitCompiler.java:3230)
>   at 
> org.codehaus.janino.UnitCompiler$8.visitLocalVariableDeclarationStatement(UnitCompiler.java:3198)
>   at 
> org.codehaus.janino.Java$LocalVariableDeclarationStatement.accept(Java.java:3351)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3254)
>   at org.codehaus.janino.UnitCompiler.access$3900(UnitCompiler.java:212)
>   at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:3216)
>   at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:3198)
>   at org.codehaus.janino.Java$Block.accept(Java.java:2756)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3260)
>   at org.codehaus.janino.UnitCompiler.access$4000(UnitCompiler.java:212)
>   at 
> org.codehaus.janino.UnitCompiler$8.visitDoStatement(UnitCompiler.java:3217)
>   at 
> org.codehaus.janino.UnitCompiler$8.visitDoStatement(UnitCompiler.java:3198)
>   at org.codehaus.janino.Java$DoStatement.accept(Java.java:3304)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3186)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3009)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958)
>   at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:393)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:385)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1286)
> 19/10/22 16:19:54 ERROR Utils: throw uncaught fatal error in thread Spark 
> Context Cleaner
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at 
>

[jira] [Comment Edited] (SPARK-29561) Large Case Statement Code Generation OOM

2019-10-29 Thread Michael Chen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962361#comment-16962361
 ] 

Michael Chen edited comment on SPARK-29561 at 10/29/19 7:59 PM:


Yes it works when I increase the driver memory. It just runs into the generated 
code grows beyond 64 KB exception and then disables whole stage code generation 
for the plan.
 But if I increase the number of branches/complexity of the branches, it will 
just run into the OOM problem again.


was (Author: mikechen):
If I increase the memory, it will run into the generated code grows beyond 64 
KB exception and disable whole stage code generation for the plan.
 But if I increase the number of branches/complexity of the branches, it will 
just run into the OOM problem again.

> Large Case Statement Code Generation OOM
> 
>
> Key: SPARK-29561
> URL: https://issues.apache.org/jira/browse/SPARK-29561
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michael Chen
>Priority: Major
> Attachments: apacheSparkCase.sql
>
>
> Spark Configuration
> spark.driver.memory = 1g
>  spark.master = "local"
>  spark.deploy.mode = "client"
> Try to execute a case statement with 3000+ branches. Added sql statement as 
> attachment
>  Spark runs for a while before it OOM
> {noformat}
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:182)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320)
>   at 
> org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)
>   at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73)
> 19/10/22 16:19:54 ERROR FileFormatWriter: Aborting job null.
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at java.util.HashMap.newNode(HashMap.java:1750)
>   at java.util.HashMap.putVal(HashMap.java:631)
>   at java.util.HashMap.putMapEntries(HashMap.java:515)
>   at java.util.HashMap.putAll(HashMap.java:785)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3345)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:212)
>   at 
> org.codehaus.janino.UnitCompiler$8.visitLocalVariableDeclarationStatement(UnitCompiler.java:3230)
>   at 
> org.codehaus.janino.UnitCompiler$8.visitLocalVariableDeclarationStatement(UnitCompiler.java:3198)
>   at 
> org.codehaus.janino.Java$LocalVariableDeclarationStatement.accept(Java.java:3351)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3254)
>   at org.codehaus.janino.UnitCompiler.access$3900(UnitCompiler.java:212)
>   at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:3216)
>   at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:3198)
>   at org.codehaus.janino.Java$Block.accept(Java.java:2756)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3260)
>   at org.codehaus.janino.UnitCompiler.access$4000(UnitCompiler.java:212)
>   at 
> org.codehaus.janino.UnitCompiler$8.visitDoStatement(UnitCompiler.java:3217)
>   at 
> org.codehaus.janino.UnitCompiler$8.visitDoStatement(UnitCompiler.java:3198)
>   at org.codehaus.janino.Java$DoStatement.accept(Java.java:3304)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3197)
>   at 
> org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3186)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3009)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958)
>   at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:393)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:385)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1286)
> 19/10/22 16:19:54 ERROR Utils: throw uncaught fatal error in thread Spark 
> Context Cleaner
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at 
>

[jira] [Commented] (SPARK-29592) ALTER TABLE (set partition location) should look up catalog/table like v2 commands

2019-10-29 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962333#comment-16962333
 ] 

Terry Kim commented on SPARK-29592:
---

[~cloud_fan], is the spec for setting partitions finalized in v2 catalog?

For example, the location can be changed as follow:
{code:scala}
val changes = Seq(TableChange.setProperty("location", newLoc))
createAlterTable(nameParts, catalog, tableName, changes)
{code}

Can partitions be also changed via `TableChange`?

> ALTER TABLE (set partition location) should look up catalog/table like v2 
> commands
> --
>
> Key: SPARK-29592
> URL: https://issues.apache.org/jira/browse/SPARK-29592
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>
> ALTER TABLE (set partition location) should look up catalog/table like v2 
> commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29637) SHS Endpoint /applications//jobs/ doesn't include description

2019-10-29 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-29637.

Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 26295
[https://github.com/apache/spark/pull/26295]

> SHS Endpoint /applications//jobs/ doesn't include description
> -
>
> Key: SPARK-29637
> URL: https://issues.apache.org/jira/browse/SPARK-29637
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> Starting from Spark 2.3, the SHS REST API endpoint 
> /applications//jobs/  is not including description in the JobData 
> returned. This is not the case until Spark 2.2.
> Steps to reproduce:
>  * Open spark-shell
> {code:java}
> scala> sc.setJobGroup("test", "job", false); 
> scala> val foo = sc.textFile("/user/foo.txt");
> foo: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at 
> textFile at :24
> scala> foo.foreach(println);
> {code}
>  * Access end REST API 
> [http://SHS-host:port/api/v1/applications/|http://shs-host:port/api/v1/applications/]/jobs/
>  * REST API of Spark 2.3 and above will not return description



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29637) SHS Endpoint /applications//jobs/ doesn't include description

2019-10-29 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-29637:
--

Assignee: Gabor Somogyi

> SHS Endpoint /applications//jobs/ doesn't include description
> -
>
> Key: SPARK-29637
> URL: https://issues.apache.org/jira/browse/SPARK-29637
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Minor
>
> Starting from Spark 2.3, the SHS REST API endpoint 
> /applications//jobs/  is not including description in the JobData 
> returned. This is not the case until Spark 2.2.
> Steps to reproduce:
>  * Open spark-shell
> {code:java}
> scala> sc.setJobGroup("test", "job", false); 
> scala> val foo = sc.textFile("/user/foo.txt");
> foo: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at 
> textFile at :24
> scala> foo.foreach(println);
> {code}
>  * Access end REST API 
> [http://SHS-host:port/api/v1/applications/|http://shs-host:port/api/v1/applications/]/jobs/
>  * REST API of Spark 2.3 and above will not return description



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29640) [K8S] Make it possible to set DNS option to use TCP instead of UDP

2019-10-29 Thread Andy Grove (Jira)

Andy Grove created SPARK-29640:
--

 Summary: [K8S] Make it possible to set DNS option to use TCP 
instead of UDP
 Key: SPARK-29640
 URL: https://issues.apache.org/jira/browse/SPARK-29640
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.4
Reporter: Andy Grove
 Fix For: 2.4.5


We are running into intermittent DNS issues where the Spark driver fails to 
resolve "kubernetes.default.svc" and this seems to be caused by 
[https://github.com/kubernetes/kubernetes/issues/76790]

One suggested workaround is to specify TCP mode for DNS lookups in the pod spec 
([https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-424498508]).

I would like the ability to provide a flag to spark-submit to specify to use 
TCP mode for DNS lookups.

I am working on a PR for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27676) InMemoryFileIndex should hard-fail on missing files instead of logging and continuing

2019-10-29 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962255#comment-16962255
 ] 

Steve Loughran commented on SPARK-27676:



1. Well see if you can create a backport PR and get it through.
1. If you are working on AWS with a S3 store that does not have a consistency 
layer on top of it (consistent EMR, S3Guard) and you are relying on directory 
listings to enumerate the content of your tables -you are doomed. Seriously. In 
particular if you're using the commit-by-rename committers then task and job 
commit may miss newly created files.

Which means, if you are seeing the problem discussed here it may be a symptom 
of a larger issue.

> InMemoryFileIndex should hard-fail on missing files instead of logging and 
> continuing
> -
>
> Key: SPARK-27676
> URL: https://issues.apache.org/jira/browse/SPARK-27676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
> Fix For: 3.0.0
>
>
> Spark's {{InMemoryFileIndex}} contains two places where {{FileNotFound}} 
> exceptions are caught and logged as warnings (during [directory 
> listing|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L274]
>  and [block location 
> lookup|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L333]).
>  I think that this is a dangerous default behavior and would prefer that 
> Spark hard-fails by default (with the ignore-and-continue behavior guarded by 
> a SQL session configuration).
> In SPARK-17599 and SPARK-24364, logic was added to ignore missing files. 
> Quoting from the PR for SPARK-17599:
> {quote}The {{ListingFileCatalog}} lists files given a set of resolved paths. 
> If a folder is deleted at any time between the paths were resolved and the 
> file catalog can check for the folder, the Spark job fails. This may abruptly 
> stop long running StructuredStreaming jobs for example.
> Folders may be deleted by users or automatically by retention policies. These 
> cases should not prevent jobs from successfully completing.
> {quote}
> Let's say that I'm *not* expecting to ever delete input files for my job. In 
> that case, this behavior can mask bugs.
> One straightforward masked bug class is accidental file deletion: if I'm 
> never expecting to delete files then I'd prefer to fail my job if Spark sees 
> deleted files.
> A more subtle bug can occur when using a S3 filesystem. Say I'm running a 
> Spark job against a partitioned Parquet dataset which is laid out like this:
> {code:java}
> data/
>   date=1/
> region=west/
>0.parquet
>1.parquet
> region=east/
>0.parquet
>1.parquet{code}
> If I do {{spark.read.parquet("/data/date=1/")}} then Spark needs to perform 
> multiple rounds of file listing, first listing {{/data/date=1}} to discover 
> the partitions for that date, then listing within each partition to discover 
> the leaf files. Due to the eventual consistency of S3 ListObjects, it's 
> possible that the first listing will show the {{region=west}} and 
> {{region=east}} partitions existing and then the next-level listing fails to 
> return any for some of the directories (e.g. {{/data/date=1/}} returns files 
> but {{/data/date=1/region=west/}} throws a {{FileNotFoundException}} in S3A 
> due to ListObjects inconsistency).
> If Spark propagated the {{FileNotFoundException}} and hard-failed in this 
> case then I'd be able to fail the job in this case where we _definitely_ know 
> that the S3 listing is inconsistent (failing here doesn't guard against _all_ 
> potential S3 list inconsistency issues (e.g. back-to-back listings which both 
> return a subset of the true set of objects), but I think it's still an 
> improvement to fail for the subset of cases that we _can_ detect even if 
> that's not a surefire failsafe against the more general problem).
> Finally, I'm unsure if the original patch will have the desired effect: if a 
> file is deleted once a Spark job expects to read it then that can cause 
> problems at multiple layers, both in the driver (multiple rounds of file 
> listing) and in executors (if the deletion occurs after the construction of 
> the catalog but before the scheduling of the read tasks); I think the 
> original patch only resolved the problem for the driver (unless I'm missing 
> similar executor-side code specific to the original streaming use-case).
> Given all of these reasons, I think that the "ignore potentially deleted

[jira] [Assigned] (SPARK-29607) Move static methods from CalendarInterval to IntervalUtils

2019-10-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29607:
---

Assignee: Maxim Gekk

> Move static methods from CalendarInterval to IntervalUtils
> --
>
> Key: SPARK-29607
> URL: https://issues.apache.org/jira/browse/SPARK-29607
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Move static methods from the CalendarInterval class to the helper object 
> IntervalUtils. Need to rewrite Java code to Scala code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29607) Move static methods from CalendarInterval to IntervalUtils

2019-10-29 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29607.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26261
[https://github.com/apache/spark/pull/26261]

> Move static methods from CalendarInterval to IntervalUtils
> --
>
> Key: SPARK-29607
> URL: https://issues.apache.org/jira/browse/SPARK-29607
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Move static methods from the CalendarInterval class to the helper object 
> IntervalUtils. Need to rewrite Java code to Scala code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27763) Port test cases from PostgreSQL to Spark SQL

2019-10-29 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962248#comment-16962248
 ] 

Dongjoon Hyun commented on SPARK-27763:
---

+1, [~maropu]!

> Port test cases from PostgreSQL to Spark SQL
> 
>
> Key: SPARK-27763
> URL: https://issues.apache.org/jira/browse/SPARK-27763
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
>
> To improve the test coverage, we can port the regression tests from the other 
> popular open source projects to Spark SQL. PostgreSQL is one of the best SQL 
> systems. Below are the links to the test cases and results. 
>  * Regression test cases: 
> [https://github.com/postgres/postgres/tree/master/src/test/regress/sql]
>  * Expected results: 
> [https://github.com/postgres/postgres/tree/master/src/test/regress/expected]
> Spark SQL does not support all the feature sets of PostgreSQL. In the current 
> stage, we should first comment out these test cases and create the 
> corresponding JIRAs in SPARK-27764. We can discuss and prioritize which 
> features we should support. Also, these PostgreSQL regression tests could 
> also expose the existing bugs of Spark SQL. We should also create the JIRAs 
> and track them in SPARK-27764. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28938) Move to supported OpenJDK docker image for Kubernetes

2019-10-29 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962243#comment-16962243
 ] 

Dongjoon Hyun commented on SPARK-28938:
---

Thank you, [~jkleckner]. It landed to `branch-2.4`, too.

> Move to supported OpenJDK docker image for Kubernetes
> -
>
> Key: SPARK-28938
> URL: https://issues.apache.org/jira/browse/SPARK-28938
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.4, 3.0.0
> Environment: Kubernetes
>Reporter: Rodney Aaron Stainback
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
> Attachments: cve-spark-py.txt, cve-spark-r.txt, cve-spark.txt, 
> twistlock.txt
>
>
> The current docker image used by Kubernetes
> {code:java}
> openjdk:8-alpine{code}
> is not supported 
> [https://github.com/docker-library/docs/blob/master/openjdk/README.md#supported-tags-and-respective-dockerfile-links]
> It was removed with this commit
> [https://github.com/docker-library/openjdk/commit/3eb0351b208d739fac35345c85e3c6237c2114ec#diff-f95ffa3d134732c33f7b8368e099]
> Quote from commit "4. no more OpenJDK 8 Alpine images (Alpine/musl is not 
> officially supported by the OpenJDK project, so this reflects that -- see 
> "Project Portola" for the Alpine porting efforts which I understand are still 
> in need of help)"
>  
> Please move to a supported image for Kubernetes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice

2019-10-29 Thread Sandish Kumar HN (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962228#comment-16962228
 ] 

Sandish Kumar HN commented on SPARK-29625:
--

[~tdas] It would be very helpful if you could review this.

> Spark Structure Streaming Kafka Wrong Reset Offset twice
> 
>
> Key: SPARK-29625
> URL: https://issues.apache.org/jira/browse/SPARK-29625
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Sandish Kumar HN
>Priority: Major
>
> Spark Structure Streaming Kafka Reset Offset twice, once with right offsets 
> and second time with very old offsets 
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-151 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-118 to offset 0.
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-85 to offset 0.
>  **
> *[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-52 to offset 122677634.*
> [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-19 to offset 0.
>  **
> *[2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO Fetcher: [Consumer clientId=consumer-1, 
> groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0]
>  Resetting offset for partition topic-52 to offset 120504922.*
> [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> INFO ContextCleaner: Cleaned accumulator 810
> which is causing a Data loss issue. 
>  
> {{[2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 
> ERROR StreamExecution: Query [id = d62ca9e4-6650-454f-8691-a3d576d1e4ba, 
> runId = 3946389f-222b-495c-9ab2-832c0422cbbb] terminated with error
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 
> java.lang.IllegalStateException: Partition topic-52's offset was changed from 
> 122677598 to 120504922, some data may have been missed.
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - Some data may have 
> been lost because they are not available in Kafka any more; either the
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  data was aged out 
> by Kafka or the topic may have been deleted before all the data in the
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  topic was 
> processed. If you don't want your streaming query to fail on such cases, set 
> the
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  source option 
> "failOnDataLoss" to "false".
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> org.apache.spark.sql.kafka010.KafkaSource.org$apache$spark$sql$kafka010$KafkaSource$$reportDataLoss(KafkaSource.scala:329)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:283)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:281)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
> [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO -  at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
> [2019-10-28 19:27:40,351]

[jira] [Updated] (SPARK-29639) Spark Kafka connector 0.10.0 generates incorrect end offsets for micro batch

2019-10-29 Thread Abhinav Choudhury (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhinav Choudhury updated SPARK-29639:
--
Description: 
We have been running a Spark structured job on production for more than a week 
now. Put simply, it reads data from source Kafka topics (with 4 partitions) and 
writes to another kafka topic. Everything has been running fine until the job 
started failing with the following error:

 
{noformat}
Driver stacktrace:
 === Streaming Query ===
 Identifier: MetricComputer [id = af26bab1-0d89-4766-934e-ad5752d6bb08, runId = 
613a21ad-86e3-4781-891b-17d92c18954a]
 Current Committed Offsets: {KafkaV2[Subscribe[kafka-topic-name]]: 
{"kafka-topic-name":
{"2":10458347,"1":10460151,"3":10475678,"0":9809564}
}}
 Current Available Offsets: {KafkaV2[Subscribe[kafka-topic-name]]: 
{"kafka-topic-name":
{"2":10458347,"1":10460151,"3":10475678,"0":10509527}
}}
Current State: ACTIVE
 Thread State: RUNNABLE
<-- Removed Logical plan -->
 Some data may have been lost because they are not available in Kafka any more; 
either the
 data was aged out by Kafka or the topic may have been deleted before all the 
data in the
 topic was processed. If you don't want your streaming query to fail on such 
cases, set the
 source option "failOnDataLoss" to "false".{noformat}
Configuration:
{noformat}
Spark 2.4.0
Spark-sql-kafka 0.10{noformat}
Looking at the Spark structured streaming query progress logs, it seems like 
the endOffsets computed for the next batch was actually smaller than the 
starting offset:

*Microbatch Trigger 1:*
{noformat}
2019/10/26 23:53:51 INFO utils.Logging[26]: 2019-10-26 23:53:51.767 : ( : Query 
{
  "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b",
  "runId" : "2d20d633-2768-446c-845b-893243361422",
  "name" : "StreamingProcessorName",
  "timestamp" : "2019-10-26T23:53:51.741Z",
  "batchId" : 2145898,
  "numInputRows" : 0,
  "inputRowsPerSecond" : 0.0,
  "processedRowsPerSecond" : 0.0,
  "durationMs" : {
"getEndOffset" : 0,
"setOffsetRange" : 9,
"triggerExecution" : 9
  },
  "stateOperators" : [ ],
  "sources" : [ {
"description" : "KafkaV2[Subscribe[kafka-topic-name]]",
"startOffset" : {
  "kafka-topic-name" : {
"2" : 10452513,
"1" : 10454326,
"3" : 10469196,
"0" : 10503762
  }
},
"endOffset" : {
  "kafka-topic-name" : {
"2" : 10452513,
"1" : 10454326,
"3" : 10469196,
"0" : 10503762
  }
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
  } ],
  "sink" : {
"description" : "ForeachBatchSink"
  }
} in progress{noformat}
*Next micro batch trigger:*
{noformat}
2019/10/26 23:53:53 INFO utils.Logging[26]: 2019-10-26 23:53:53.951 : ( : Query 
{
  "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b",
  "runId" : "2d20d633-2768-446c-845b-893243361422",
  "name" : "StreamingProcessorName",
  "timestamp" : "2019-10-26T23:53:52.907Z",
  "batchId" : 2145898,
  "numInputRows" : 0,
  "inputRowsPerSecond" : 0.0,
  "processedRowsPerSecond" : 0.0,
  "durationMs" : {
"addBatch" : 350,
"getBatch" : 4,
"getEndOffset" : 0,
"queryPlanning" : 102,
"setOffsetRange" : 24,
"triggerExecution" : 1043,
"walCommit" : 349
  },
  "stateOperators" : [ ],
  "sources" : [ {
"description" : "KafkaV2[Subscribe[kafka-topic-name]]",
"startOffset" : {
  "kafka-topic-name" : {
"2" : 10452513,
"1" : 10454326,
"3" : 10469196,
"0" : 10503762
  }
},
"endOffset" : {
  "kafka-topic-name" : {
"2" : 10452513,
"1" : 10454326,
"3" : 9773098,
"0" : 10503762
  }
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
  } ],
  "sink" : {
"description" : "ForeachBatchSink"
  }
} in progress{noformat}
Notice that for partition 3 of the kafka topic, the endOffsets are actually 
smaller than the starting offsets!

Checked the HDFS checkpoint dir and the checkpointed offsets look fine and 
point to the last committed offsets
 Why is the end offset for a partition being computed to a smaller value?

 

  was:
We have been running a Spark structured job on production for more than a week 
now. Put simply, it reads data from source Kafka topics (with 4 partitions) and 
writes to another kafka topic. Everything has been running fine until the job 
started failing with the following error:

 
{noformat}
Driver stacktrace:
 === Streaming Query ===
 Identifier: MetricComputer [id = af26bab1-0d89-4766-934e-ad5752d6bb08, runId = 
613a21ad-86e3-4781-891b-17d92c18954a]
 Current Committed Offsets: {KafkaV2[Subscribe[kafka-topic-name]]: 
{"kafka-topic-name":
{"2":10458347,"1":10460151,"3":10475678,"0":9809564}
}}
 Current Available Offsets: {KafkaV2[Subscribe[kf-adsk-inquirystep]]: 
{"kf-adsk-inquirystep":

[jira] [Updated] (SPARK-29639) Spark Kafka connector 0.10.0 generates incorrect end offsets for micro batch

2019-10-29 Thread Abhinav Choudhury (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhinav Choudhury updated SPARK-29639:
--
Target Version/s: 2.4.4, 2.4.5

> Spark Kafka connector 0.10.0 generates incorrect end offsets for micro batch
> 
>
> Key: SPARK-29639
> URL: https://issues.apache.org/jira/browse/SPARK-29639
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Abhinav Choudhury
>Priority: Major
>
> We have been running a Spark structured job on production for more than a 
> week now. Put simply, it reads data from source Kafka topics (with 4 
> partitions) and writes to another kafka topic. Everything has been running 
> fine until the job started failing with the following error:
>  
> {noformat}
> Driver stacktrace:
>  === Streaming Query ===
>  Identifier: MetricComputer [id = af26bab1-0d89-4766-934e-ad5752d6bb08, runId 
> = 613a21ad-86e3-4781-891b-17d92c18954a]
>  Current Committed Offsets: {KafkaV2[Subscribe[kafka-topic-name]]: 
> {"kafka-topic-name":
> {"2":10458347,"1":10460151,"3":10475678,"0":9809564}
> }}
>  Current Available Offsets: {KafkaV2[Subscribe[kf-adsk-inquirystep]]: 
> {"kf-adsk-inquirystep":
> {"2":10458347,"1":10460151,"3":10475678,"0":10509527}
> }}
> Current State: ACTIVE
>  Thread State: RUNNABLE
> <-- Removed Logical plan -->
>  Some data may have been lost because they are not available in Kafka any 
> more; either the
>  data was aged out by Kafka or the topic may have been deleted before all the 
> data in the
>  topic was processed. If you don't want your streaming query to fail on such 
> cases, set the
>  source option "failOnDataLoss" to "false".{noformat}
> Configuration:
> {noformat}
> Spark 2.4.0
> Spark-sql-kafka 0.10{noformat}
> Looking at the Spark structured streaming query progress logs, it seems like 
> the endOffsets computed for the next batch was actually smaller than the 
> starting offset:
> *Microbatch Trigger 1:*
> {noformat}
> 2019/10/26 23:53:51 INFO utils.Logging[26]: 2019-10-26 23:53:51.767 : ( : 
> Query {
>   "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b",
>   "runId" : "2d20d633-2768-446c-845b-893243361422",
>   "name" : "StreamingProcessorName",
>   "timestamp" : "2019-10-26T23:53:51.741Z",
>   "batchId" : 2145898,
>   "numInputRows" : 0,
>   "inputRowsPerSecond" : 0.0,
>   "processedRowsPerSecond" : 0.0,
>   "durationMs" : {
> "getEndOffset" : 0,
> "setOffsetRange" : 9,
> "triggerExecution" : 9
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "KafkaV2[Subscribe[kafka-topic-name]]",
> "startOffset" : {
>   "kafka-topic-name" : {
> "2" : 10452513,
> "1" : 10454326,
> "3" : 10469196,
> "0" : 10503762
>   }
> },
> "endOffset" : {
>   "kafka-topic-name" : {
> "2" : 10452513,
> "1" : 10454326,
> "3" : 10469196,
> "0" : 10503762
>   }
> },
> "numInputRows" : 0,
> "inputRowsPerSecond" : 0.0,
> "processedRowsPerSecond" : 0.0
>   } ],
>   "sink" : {
> "description" : "ForeachBatchSink"
>   }
> } in progress{noformat}
> *Next micro batch trigger:*
> {noformat}
> 2019/10/26 23:53:53 INFO utils.Logging[26]: 2019-10-26 23:53:53.951 : ( : 
> Query {
>   "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b",
>   "runId" : "2d20d633-2768-446c-845b-893243361422",
>   "name" : "StreamingProcessorName",
>   "timestamp" : "2019-10-26T23:53:52.907Z",
>   "batchId" : 2145898,
>   "numInputRows" : 0,
>   "inputRowsPerSecond" : 0.0,
>   "processedRowsPerSecond" : 0.0,
>   "durationMs" : {
> "addBatch" : 350,
> "getBatch" : 4,
> "getEndOffset" : 0,
> "queryPlanning" : 102,
> "setOffsetRange" : 24,
> "triggerExecution" : 1043,
> "walCommit" : 349
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "KafkaV2[Subscribe[kafka-topic-name]]",
> "startOffset" : {
>   "kafka-topic-name" : {
> "2" : 10452513,
> "1" : 10454326,
> "3" : 10469196,
> "0" : 10503762
>   }
> },
> "endOffset" : {
>   "kafka-topic-name" : {
> "2" : 10452513,
> "1" : 10454326,
> "3" : 9773098,
> "0" : 10503762
>   }
> },
> "numInputRows" : 0,
> "inputRowsPerSecond" : 0.0,
> "processedRowsPerSecond" : 0.0
>   } ],
>   "sink" : {
> "description" : "ForeachBatchSink"
>   }
> } in progress{noformat}
> Notice that for partition 3 of the kafka topic, the endOffsets are actually 
> smaller than the starting offsets!
> Checked the HDFS checkpoint dir and the checkpointed offsets look fine and 
> point to the last committed offsets
> Why is the end offset for a partition being computed to a

[jira] [Created] (SPARK-29639) Spark Kafka connector 0.10.0 generates incorrect end offsets for micro batch

2019-10-29 Thread Abhinav Choudhury (Jira)

Abhinav Choudhury created SPARK-29639:
-

 Summary: Spark Kafka connector 0.10.0 generates incorrect end 
offsets for micro batch
 Key: SPARK-29639
 URL: https://issues.apache.org/jira/browse/SPARK-29639
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Structured Streaming
Affects Versions: 2.4.0
Reporter: Abhinav Choudhury


We have been running a Spark structured job on production for more than a week 
now. Put simply, it reads data from source Kafka topics (with 4 partitions) and 
writes to another kafka topic. Everything has been running fine until the job 
started failing with the following error:

 
{noformat}
Driver stacktrace:
 === Streaming Query ===
 Identifier: MetricComputer [id = af26bab1-0d89-4766-934e-ad5752d6bb08, runId = 
613a21ad-86e3-4781-891b-17d92c18954a]
 Current Committed Offsets: {KafkaV2[Subscribe[kafka-topic-name]]: 
{"kafka-topic-name":
{"2":10458347,"1":10460151,"3":10475678,"0":9809564}
}}
 Current Available Offsets: {KafkaV2[Subscribe[kf-adsk-inquirystep]]: 
{"kf-adsk-inquirystep":
{"2":10458347,"1":10460151,"3":10475678,"0":10509527}
}}
Current State: ACTIVE
 Thread State: RUNNABLE
<-- Removed Logical plan -->
 Some data may have been lost because they are not available in Kafka any more; 
either the
 data was aged out by Kafka or the topic may have been deleted before all the 
data in the
 topic was processed. If you don't want your streaming query to fail on such 
cases, set the
 source option "failOnDataLoss" to "false".{noformat}

Configuration:
{noformat}
Spark 2.4.0
Spark-sql-kafka 0.10{noformat}

Looking at the Spark structured streaming query progress logs, it seems like 
the endOffsets computed for the next batch was actually smaller than the 
starting offset:


*Microbatch Trigger 1:*


{noformat}
2019/10/26 23:53:51 INFO utils.Logging[26]: 2019-10-26 23:53:51.767 : ( : Query 
{
  "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b",
  "runId" : "2d20d633-2768-446c-845b-893243361422",
  "name" : "StreamingProcessorName",
  "timestamp" : "2019-10-26T23:53:51.741Z",
  "batchId" : 2145898,
  "numInputRows" : 0,
  "inputRowsPerSecond" : 0.0,
  "processedRowsPerSecond" : 0.0,
  "durationMs" : {
"getEndOffset" : 0,
"setOffsetRange" : 9,
"triggerExecution" : 9
  },
  "stateOperators" : [ ],
  "sources" : [ {
"description" : "KafkaV2[Subscribe[kafka-topic-name]]",
"startOffset" : {
  "kafka-topic-name" : {
"2" : 10452513,
"1" : 10454326,
"3" : 10469196,
"0" : 10503762
  }
},
"endOffset" : {
  "kafka-topic-name" : {
"2" : 10452513,
"1" : 10454326,
"3" : 10469196,
"0" : 10503762
  }
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
  } ],
  "sink" : {
"description" : "ForeachBatchSink"
  }
} in progress{noformat}
*Next micro batch trigger:*


{noformat}
2019/10/26 23:53:53 INFO utils.Logging[26]: 2019-10-26 23:53:53.951 : ( : Query 
{
  "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b",
  "runId" : "2d20d633-2768-446c-845b-893243361422",
  "name" : "StreamingProcessorName",
  "timestamp" : "2019-10-26T23:53:52.907Z",
  "batchId" : 2145898,
  "numInputRows" : 0,
  "inputRowsPerSecond" : 0.0,
  "processedRowsPerSecond" : 0.0,
  "durationMs" : {
"addBatch" : 350,
"getBatch" : 4,
"getEndOffset" : 0,
"queryPlanning" : 102,
"setOffsetRange" : 24,
"triggerExecution" : 1043,
"walCommit" : 349
  },
  "stateOperators" : [ ],
  "sources" : [ {
"description" : "KafkaV2[Subscribe[kafka-topic-name]]",
"startOffset" : {
  "kafka-topic-name" : {
"2" : 10452513,
"1" : 10454326,
"3" : 10469196,
"0" : 10503762
  }
},
"endOffset" : {
  "kafka-topic-name" : {
"2" : 10452513,
"1" : 10454326,
"3" : 9773098,
"0" : 10503762
  }
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
  } ],
  "sink" : {
"description" : "ForeachBatchSink"
  }
} in progress{noformat}

Notice that for partition 3 of the kafka topic, the endOffsets are actually 
smaller than the starting offsets!

Checked the HDFS checkpoint dir and the checkpointed offsets look fine and 
point to the last committed offsets
Why is the end offset for a partition being computed to a smaller value?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28938) Move to supported OpenJDK docker image for Kubernetes

2019-10-29 Thread Jim Kleckner (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962131#comment-16962131
 ] 

Jim Kleckner commented on SPARK-28938:
--

I created a minor one-line patch fix for this here.  Let me know if it needs a 
different story.
[https://github.com/apache/spark/pull/26296]

> Move to supported OpenJDK docker image for Kubernetes
> -
>
> Key: SPARK-28938
> URL: https://issues.apache.org/jira/browse/SPARK-28938
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.4, 3.0.0
> Environment: Kubernetes
>Reporter: Rodney Aaron Stainback
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
> Attachments: cve-spark-py.txt, cve-spark-r.txt, cve-spark.txt, 
> twistlock.txt
>
>
> The current docker image used by Kubernetes
> {code:java}
> openjdk:8-alpine{code}
> is not supported 
> [https://github.com/docker-library/docs/blob/master/openjdk/README.md#supported-tags-and-respective-dockerfile-links]
> It was removed with this commit
> [https://github.com/docker-library/openjdk/commit/3eb0351b208d739fac35345c85e3c6237c2114ec#diff-f95ffa3d134732c33f7b8368e099]
> Quote from commit "4. no more OpenJDK 8 Alpine images (Alpine/musl is not 
> officially supported by the OpenJDK project, so this reflects that -- see 
> "Project Portola" for the Alpine porting efforts which I understand are still 
> in need of help)"
>  
> Please move to a supported image for Kubernetes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29595) Insertion with named_struct should match by name

2019-10-29 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962096#comment-16962096
 ] 

Aman Omer commented on SPARK-29595:
---

Thanks [~srowen] .

Kindly check PR [https://github.com/apache/spark/pull/26293] .

> Insertion with named_struct should match by name
> 
>
> Key: SPARK-29595
> URL: https://issues.apache.org/jira/browse/SPARK-29595
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> {code:java}
> spark-sql> create table str using parquet as(select named_struct('a', 1, 'b', 
> 2) as data);
> spark-sql>  insert into str values named_struct("b", 3, "a", 1);
> spark-sql> select * from str;
> {"a":3,"b":1}
> {"a":1,"b":2}
> {code}
> The result should be 
> {code:java}
> {"a":1,"b":3}
> {"a":1,"b":2}
> {code}
> Spark should match the field names of named_struct on insertion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-29595) Insertion with named_struct should match by name

2019-10-29 Thread Aman Omer (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Omer updated SPARK-29595:
--
Comment: was deleted

(was: Raised a PR [https://github.com/apache/spark/pull/26293] . 

Kindly review.

Thanks.)

> Insertion with named_struct should match by name
> 
>
> Key: SPARK-29595
> URL: https://issues.apache.org/jira/browse/SPARK-29595
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> {code:java}
> spark-sql> create table str using parquet as(select named_struct('a', 1, 'b', 
> 2) as data);
> spark-sql>  insert into str values named_struct("b", 3, "a", 1);
> spark-sql> select * from str;
> {"a":3,"b":1}
> {"a":1,"b":2}
> {code}
> The result should be 
> {code:java}
> {"a":1,"b":3}
> {"a":1,"b":2}
> {code}
> Spark should match the field names of named_struct on insertion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28478) Optimizer rule to remove unnecessary explicit null checks for null-intolerant expressions (e.g. if(x is null, x, f(x)))

2019-10-29 Thread David Vrba (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962029#comment-16962029
 ] 

David Vrba commented on SPARK-28478:


If no one is working on this, i will take this one.

> Optimizer rule to remove unnecessary explicit null checks for null-intolerant 
> expressions (e.g. if(x is null, x, f(x)))
> ---
>
> Key: SPARK-28478
> URL: https://issues.apache.org/jira/browse/SPARK-28478
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> I ran across a family of expressions like
> {code:java}
> if(x is null, x, substring(x, 0, 1024)){code}
> or 
> {code:java}
> when($"x".isNull, $"x", substring($"x", 0, 1024)){code}
> that were written this way because the query author was unsure about whether 
> {{substring}} would return {{null}} when its input string argument is null.
> This explicit null-handling is unnecessary and adds bloat to the generated 
> code, especially if it's done via a {{CASE}} statement (which compiles down 
> to a {{do-while}} loop).
> In another case I saw a query compiler which automatically generated this 
> type of code.
> It would be cool if Spark could automatically optimize such queries to remove 
> these redundant null checks. Here's a sketch of what such a rule might look 
> like (assuming that SPARK-28477 has been implement so we only need to worry 
> about the {{IF}} case):
>  * In the pattern match, check the following three conditions in the 
> following order (to benefit from short-circuiting)
>  ** The {{IF}} condition is an explicit null-check of a column {{c}}
>  ** The {{true}} expression returns either {{c}} or {{null}}
>  ** The {{false}} expression is a _null-intolerant_ expression with {{c}} as 
> a _direct_ child. 
>  * If this condition matches, replace the entire {{If}} with the {{false}} 
> branch's expression..
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27763) Port test cases from PostgreSQL to Spark SQL

2019-10-29 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962021#comment-16962021
 ] 

Yuming Wang commented on SPARK-27763:
-

+1 for porting:
{noformat}
alter_table.sql
create_table.sql
groupingsets.sql
insert.sql
limit.sql
{noformat}


> Port test cases from PostgreSQL to Spark SQL
> 
>
> Key: SPARK-27763
> URL: https://issues.apache.org/jira/browse/SPARK-27763
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
>
> To improve the test coverage, we can port the regression tests from the other 
> popular open source projects to Spark SQL. PostgreSQL is one of the best SQL 
> systems. Below are the links to the test cases and results. 
>  * Regression test cases: 
> [https://github.com/postgres/postgres/tree/master/src/test/regress/sql]
>  * Expected results: 
> [https://github.com/postgres/postgres/tree/master/src/test/regress/expected]
> Spark SQL does not support all the feature sets of PostgreSQL. In the current 
> stage, we should first comment out these test cases and create the 
> corresponding JIRAs in SPARK-27764. We can discuss and prioritize which 
> features we should support. Also, these PostgreSQL regression tests could 
> also expose the existing bugs of Spark SQL. We should also create the JIRAs 
> and track them in SPARK-27764. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian

2019-10-29 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-26985:
-
Fix Version/s: 2.4.5

> Test "access only some column of the all of columns " fails on big endian
> -
>
> Key: SPARK-26985
> URL: https://issues.apache.org/jira/browse/SPARK-26985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Linux Ubuntu 16.04 
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed 
> References 20190205_218 (JIT enabled, AOT enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>  
>Reporter: Anuja Jakhade
>Assignee: ketan kunde
>Priority: Major
>  Labels: BigEndian, correctness
> Fix For: 2.4.5, 3.0.0
>
> Attachments: DataFrameTungstenSuite.txt, 
> InMemoryColumnarQuerySuite.txt, access only some column of the all of 
> columns.txt
>
>
> While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
> observing test failures for 2 Suites of Project SQL.
>  1. InMemoryColumnarQuerySuite
>  2. DataFrameTungstenSuite
>  In both the cases test "access only some column of the all of columns" fails 
> due to mismatch in the final assert.
> Observed that the data obtained after df.cache() is causing the error. Please 
> find attached the log with the details. 
> cache() works perfectly fine if double and  float values are not in picture.
> Inside test !!- access only some column of the all of columns *** FAILED 
> ***



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29628) Forcibly create a temporary view in CREATE VIEW if referencing a temporary view

2019-10-29 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961973#comment-16961973
 ] 

Takeshi Yamamuro commented on SPARK-29628:
--

Thanks!

> Forcibly create a temporary view in CREATE VIEW if referencing a temporary 
> view
> ---
>
> Key: SPARK-29628
> URL: https://issues.apache.org/jira/browse/SPARK-29628
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> {code}
> CREATE TEMPORARY VIEW temp_table AS SELECT * FROM VALUES
>   (1, 1) as temp_table(a, id);
> CREATE VIEW v1_temp AS SELECT * FROM temp_table;
> // In Spark
> org.apache.spark.sql.AnalysisException
> Not allowed to create a permanent view `v1_temp` by referencing a temporary 
> view `temp_table`;
> // In PostgreSQL
> NOTICE:  view "v1_temp" will be a temporary view
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27763) Port test cases from PostgreSQL to Spark SQL

2019-10-29 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961970#comment-16961970
 ] 

Takeshi Yamamuro commented on SPARK-27763:
--

Hi, all,

I've checked the other left (not-merged-yet) regressions tests in PgSQL: 
[https://github.com/postgres/postgres/tree/REL_12_STABLE/src/test/regress/sql]

In my opinion, there are more five tests (see a list in the bottom) that we 
need to check for porting,
 because they are more basic ones than the others.

If there is no problem for that, I'll check the left five tests (make PRs if 
necessary) then
 resolve this parent issue as resolved.

WDYT? [~smilegator] [~dongjoon] [~hyukjin.kwon] [~yumwang]

 
{code:java}
-- the left regression tests we need to check
alter_table.sql
create_table.sql
groupingsets.sql
insert.sql
limit.sql

-- the regression tests we don't port now
advisory_lock.sql
alter_generic.sql
alter_operator.sql
amutils.sql
arrays.sql
async.sql
bit.sql
bitmapops.sql
box.sql
brin.sql
btree_index.sql
char.sql
circle.sql
cluster.sql
collate.icu.utf8.sql
collate.linux.utf8.sql
collate.sql
combocid.sql
conversion.sql
copy2.sql
copydml.sql
copyselect.sql
create_aggregate.sql
create_am.sql
create_cast.sql
create_function_3.sql
create_index.sql
create_index_spgist.sql
create_misc.sql
create_operator.sql
create_procedure.sql
create_table_like.sql
create_type.sql
dbsize.sql
delete.sql
dependency.sql
domain.sql
drop_if_exists.sql
drop_operator.sql
enum.sql
equivclass.sql
errors.sql
event_trigger.sql
expressions.sql
fast_default.sql
foreign_data.sql
foreign_key.sql
functional_deps.sql
generated.sql
geometry.sql
gin.sql
gist.sql
guc.sql
hash_func.sql
hash_index.sql
hash_part.sql
horology.sql
hs_primary_extremes.sql
hs_primary_setup.sql
hs_standby_allowed.sql
hs_standby_check.sql
hs_standby_disallowed.sql
hs_standby_functions.sql
identity.sql
index_including.sql
index_including_gist.sql
indexing.sql
indirect_toast.sql
inet.sql
inherit.sql
init_privs.sql
insert_conflict.sql
join_hash.sql
json.sql
json_encoding.sql
jsonb.sql
jsonb_jsonpath.sql
jsonpath.sql
jsonpath_encoding.sql
line.sql
lock.sql
lseg.sql
macaddr.sql
macaddr8.sql
matview.sql
misc_functions.sql
misc_sanity.sql
money.sql
name.sql
namespace.sql
numeric_big.sql
numerology.sql
object_address.sql
oid.sql
oidjoins.sql
opr_sanity.sql
partition_aggregate.sql
partition_info.sql
partition_join.sql
partition_prune.sql
password.sql
path.sql
pg_lsn.sql
pg_regress.list
plancache.sql
plpgsql.sql
point.sql
polygon.sql
polymorphism.sql
portals.sql
portals_p2.sql
prepare.sql
prepared_xacts.sql
privileges.sql
psql.sql
psql_crosstab.sql
publication.sql
random.sql
rangefuncs.sql
rangetypes.sql
regex.linux.utf8.sql
regex.sql
regproc.sql
reindex_catalog.sql
reloptions.sql
replica_identity.sql
returning.sql
roleattributes.sql
rowsecurity.sql
rowtypes.sql
rules.sql
sanity_check.sql
security_label.sql
select_distinct_on.sql
select_into.sql
select_parallel.sql
select_views.sql
sequence.sql
spgist.sql
stats.sql
stats_ext.sql
subscription.sql
subselect.sql
sysviews.sql
tablesample.sql
temp.sql
tidscan.sql
time.sql
timestamptz.sql
timetz.sql
transactions.sql
triggers.sql
truncate.sql
tsdicts.sql
tsearch.sql
tsrf.sql
tstypes.sql
txid.sql
type_sanity.sql
typed_table.sql
updatable_views.sql
update.sql
uuid.sql
vacuum.sql
varchar.sql
write_parallel.sql
xml.sql
xmlmap.sql
{code}
 

> Port test cases from PostgreSQL to Spark SQL
> 
>
> Key: SPARK-27763
> URL: https://issues.apache.org/jira/browse/SPARK-27763
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
>
> To improve the test coverage, we can port the regression tests from the other 
> popular open source projects to Spark SQL. PostgreSQL is one of the best SQL 
> systems. Below are the links to the test cases and results. 
>  * Regression test cases: 
> [https://github.com/postgres/postgres/tree/master/src/test/regress/sql]
>  * Expected results: 
> [https://github.com/postgres/postgres/tree/master/src/test/regress/expected]
> Spark SQL does not support all the feature sets of PostgreSQL. In the current 
> stage, we should first comment out these test cases and create the 
> corresponding JIRAs in SPARK-27764. We can discuss and prioritize which 
> features we should support. Also, these PostgreSQL regression tests could 
> also expose the existing bugs of Spark SQL. We should also create the JIRAs 
> and track them in SPARK-27764. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29595) Insertion with named_struct should match by name

2019-10-29 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961964#comment-16961964
 ] 

Sean R. Owen commented on SPARK-29595:
--

As I recall that behavior is intended to match SQL semantics? I don't recall 
the details of the discussion.

> Insertion with named_struct should match by name
> 
>
> Key: SPARK-29595
> URL: https://issues.apache.org/jira/browse/SPARK-29595
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> {code:java}
> spark-sql> create table str using parquet as(select named_struct('a', 1, 'b', 
> 2) as data);
> spark-sql>  insert into str values named_struct("b", 3, "a", 1);
> spark-sql> select * from str;
> {"a":3,"b":1}
> {"a":1,"b":2}
> {code}
> The result should be 
> {code:java}
> {"a":1,"b":3}
> {"a":1,"b":2}
> {code}
> Spark should match the field names of named_struct on insertion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29451) Some queries with divisions in SQL windows are failling in Thrift

2019-10-29 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29451:
-
Parent: SPARK-27764
Issue Type: Sub-task  (was: Bug)

> Some queries with divisions in SQL windows are failling in Thrift
> -
>
> Key: SPARK-29451
> URL: https://issues.apache.org/jira/browse/SPARK-29451
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
>
> Hello,
> the following queries are not properly working on Thrift. The only difference 
> between them and some other queries that works fine are the numeric 
> divisions, I think.
> {code:sql}
> SELECT four, ten/4 as two,
> sum(ten/4) over (partition by four order by ten/4 rows between unbounded 
> preceding and current row),
> last(ten/4) over (partition by four order by ten/4 rows between unbounded 
> preceding and current row)
> FROM (select distinct ten, four from tenk1) ss;
> {code}
> {code:sql}
> SELECT four, ten/4 as two,
> sum(ten/4) over (partition by four order by ten/4 range between unbounded 
> preceding and current row),
> last(ten/4) over (partition by four order by ten/4 range between unbounded 
> preceding and current row)
> FROM (select distinct ten, four from tenk1) ss;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29451) Some queries with divisions in SQL windows are failling in Thrift

2019-10-29 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29451:
-
Parent: (was: SPARK-27763)
Issue Type: Bug  (was: Sub-task)

> Some queries with divisions in SQL windows are failling in Thrift
> -
>
> Key: SPARK-29451
> URL: https://issues.apache.org/jira/browse/SPARK-29451
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
>
> Hello,
> the following queries are not properly working on Thrift. The only difference 
> between them and some other queries that works fine are the numeric 
> divisions, I think.
> {code:sql}
> SELECT four, ten/4 as two,
> sum(ten/4) over (partition by four order by ten/4 rows between unbounded 
> preceding and current row),
> last(ten/4) over (partition by four order by ten/4 rows between unbounded 
> preceding and current row)
> FROM (select distinct ten, four from tenk1) ss;
> {code}
> {code:sql}
> SELECT four, ten/4 as two,
> sum(ten/4) over (partition by four order by ten/4 range between unbounded 
> preceding and current row),
> last(ten/4) over (partition by four order by ten/4 range between unbounded 
> preceding and current row)
> FROM (select distinct ten, four from tenk1) ss;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27676) InMemoryFileIndex should hard-fail on missing files instead of logging and continuing

2019-10-29 Thread siddartha sagar chinne (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961899#comment-16961899
 ] 

siddartha sagar chinne commented on SPARK-27676:


hi [~ste...@apache.org] can this fix be made available in spark 2.x. it would 
really help us running spark on aws until spark 3 is released.

> InMemoryFileIndex should hard-fail on missing files instead of logging and 
> continuing
> -
>
> Key: SPARK-27676
> URL: https://issues.apache.org/jira/browse/SPARK-27676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
> Fix For: 3.0.0
>
>
> Spark's {{InMemoryFileIndex}} contains two places where {{FileNotFound}} 
> exceptions are caught and logged as warnings (during [directory 
> listing|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L274]
>  and [block location 
> lookup|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L333]).
>  I think that this is a dangerous default behavior and would prefer that 
> Spark hard-fails by default (with the ignore-and-continue behavior guarded by 
> a SQL session configuration).
> In SPARK-17599 and SPARK-24364, logic was added to ignore missing files. 
> Quoting from the PR for SPARK-17599:
> {quote}The {{ListingFileCatalog}} lists files given a set of resolved paths. 
> If a folder is deleted at any time between the paths were resolved and the 
> file catalog can check for the folder, the Spark job fails. This may abruptly 
> stop long running StructuredStreaming jobs for example.
> Folders may be deleted by users or automatically by retention policies. These 
> cases should not prevent jobs from successfully completing.
> {quote}
> Let's say that I'm *not* expecting to ever delete input files for my job. In 
> that case, this behavior can mask bugs.
> One straightforward masked bug class is accidental file deletion: if I'm 
> never expecting to delete files then I'd prefer to fail my job if Spark sees 
> deleted files.
> A more subtle bug can occur when using a S3 filesystem. Say I'm running a 
> Spark job against a partitioned Parquet dataset which is laid out like this:
> {code:java}
> data/
>   date=1/
> region=west/
>0.parquet
>1.parquet
> region=east/
>0.parquet
>1.parquet{code}
> If I do {{spark.read.parquet("/data/date=1/")}} then Spark needs to perform 
> multiple rounds of file listing, first listing {{/data/date=1}} to discover 
> the partitions for that date, then listing within each partition to discover 
> the leaf files. Due to the eventual consistency of S3 ListObjects, it's 
> possible that the first listing will show the {{region=west}} and 
> {{region=east}} partitions existing and then the next-level listing fails to 
> return any for some of the directories (e.g. {{/data/date=1/}} returns files 
> but {{/data/date=1/region=west/}} throws a {{FileNotFoundException}} in S3A 
> due to ListObjects inconsistency).
> If Spark propagated the {{FileNotFoundException}} and hard-failed in this 
> case then I'd be able to fail the job in this case where we _definitely_ know 
> that the S3 listing is inconsistent (failing here doesn't guard against _all_ 
> potential S3 list inconsistency issues (e.g. back-to-back listings which both 
> return a subset of the true set of objects), but I think it's still an 
> improvement to fail for the subset of cases that we _can_ detect even if 
> that's not a surefire failsafe against the more general problem).
> Finally, I'm unsure if the original patch will have the desired effect: if a 
> file is deleted once a Spark job expects to read it then that can cause 
> problems at multiple layers, both in the driver (multiple rounds of file 
> listing) and in executors (if the deletion occurs after the construction of 
> the catalog but before the scheduling of the read tasks); I think the 
> original patch only resolved the problem for the driver (unless I'm missing 
> similar executor-side code specific to the original streaming use-case).
> Given all of these reasons, I think that the "ignore potentially deleted 
> files during file index listing" behavior should be guarded behind a feature 
> flag which defaults to {{false}}, consistent with the existing 
> {{spark.files.ignoreMissingFiles}} and {{spark.sql.files.ignoreMissingFiles}} 
> flags (which both default to false).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (SPARK-29637) SHS Endpoint /applications//jobs/ doesn't include description

2019-10-29 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961872#comment-16961872
 ] 

Gabor Somogyi commented on SPARK-29637:
---

I'm working on this.

> SHS Endpoint /applications//jobs/ doesn't include description
> -
>
> Key: SPARK-29637
> URL: https://issues.apache.org/jira/browse/SPARK-29637
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> Starting from Spark 2.3, the SHS REST API endpoint 
> /applications//jobs/  is not including description in the JobData 
> returned. This is not the case until Spark 2.2.
> Steps to reproduce:
>  * Open spark-shell
> {code:java}
> scala> sc.setJobGroup("test", "job", false); 
> scala> val foo = sc.textFile("/user/foo.txt");
> foo: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at 
> textFile at :24
> scala> foo.foreach(println);
> {code}
>  * Access end REST API 
> [http://SHS-host:port/api/v1/applications/|http://shs-host:port/api/v1/applications/]/jobs/
>  * REST API of Spark 2.3 and above will not return description



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29638) Spark handles 'NaN' as 0 in sums

2019-10-29 Thread Dylan Guedes (Jira)

Dylan Guedes created SPARK-29638:


 Summary: Spark handles 'NaN' as 0 in sums
 Key: SPARK-29638
 URL: https://issues.apache.org/jira/browse/SPARK-29638
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dylan Guedes


Currently, Spark handles 'NaN' as 0 in window functions, such that 3+'NaN'=3. 
PgSQL, on the other hand, handles the entire result as 'NaN', as in 3+'NaN' = 
'NaN'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29637) SHS Endpoint /applications//jobs/ doesn't include description

2019-10-29 Thread Gabor Somogyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-29637:
--
Description: 
Starting from Spark 2.3, the SHS REST API endpoint /applications//jobs/ 
 is not including description in the JobData returned. This is not the case 
until Spark 2.2.

Steps to reproduce:
 * Open spark-shell
{code:java}
scala> sc.setJobGroup("test", "job", false); 
scala> val foo = sc.textFile("/user/foo.txt");
foo: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at 
textFile at :24
scala> foo.foreach(println);
{code}

 * Access end REST API 
[http://SHS-host:port/api/v1/applications/|http://shs-host:port/api/v1/applications/]/jobs/
 * REST API of Spark 2.3 and above will not return description

  was:
Starting from Spark 2.3, the SHS REST API endpoint /applications//jobs/ 
 is not including description in the JobData returned. This is not the case 
until Spark 2.2.

Steps to reproduce:
* Open spark-shell
{code:java}
scala> sc.setJobGroup("test", "job", false); 
scala> val foo = sc.textFile("/user/foo.txt");
thorn: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at 
textFile at :24
scala> foo.foreach(println);
{code}
* Access end REST API http://SHS-host:port/api/v1/applications//jobs/
* REST API of Spark 2.3 and above will not return description



> SHS Endpoint /applications//jobs/ doesn't include description
> -
>
> Key: SPARK-29637
> URL: https://issues.apache.org/jira/browse/SPARK-29637
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> Starting from Spark 2.3, the SHS REST API endpoint 
> /applications//jobs/  is not including description in the JobData 
> returned. This is not the case until Spark 2.2.
> Steps to reproduce:
>  * Open spark-shell
> {code:java}
> scala> sc.setJobGroup("test", "job", false); 
> scala> val foo = sc.textFile("/user/foo.txt");
> foo: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at 
> textFile at :24
> scala> foo.foreach(println);
> {code}
>  * Access end REST API 
> [http://SHS-host:port/api/v1/applications/|http://shs-host:port/api/v1/applications/]/jobs/
>  * REST API of Spark 2.3 and above will not return description



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29637) SHS Endpoint /applications//jobs/ doesn't include description

2019-10-29 Thread Gabor Somogyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-29637:
--
Issue Type: Bug  (was: Improvement)

> SHS Endpoint /applications//jobs/ doesn't include description
> -
>
> Key: SPARK-29637
> URL: https://issues.apache.org/jira/browse/SPARK-29637
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> Starting from Spark 2.3, the SHS REST API endpoint 
> /applications//jobs/  is not including description in the JobData 
> returned. This is not the case until Spark 2.2.
> Steps to reproduce:
>  * Open spark-shell
> {code:java}
> scala> sc.setJobGroup("test", "job", false); 
> scala> val foo = sc.textFile("/user/foo.txt");
> foo: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at 
> textFile at :24
> scala> foo.foreach(println);
> {code}
>  * Access end REST API 
> [http://SHS-host:port/api/v1/applications/|http://shs-host:port/api/v1/applications/]/jobs/
>  * REST API of Spark 2.3 and above will not return description



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29637) SHS Endpoint /applications//jobs/ doesn't include description

2019-10-29 Thread Gabor Somogyi (Jira)

Gabor Somogyi created SPARK-29637:
-

 Summary: SHS Endpoint /applications//jobs/ doesn't include 
description
 Key: SPARK-29637
 URL: https://issues.apache.org/jira/browse/SPARK-29637
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.4, 2.3.4, 3.0.0
Reporter: Gabor Somogyi


Starting from Spark 2.3, the SHS REST API endpoint /applications//jobs/ 
 is not including description in the JobData returned. This is not the case 
until Spark 2.2.

Steps to reproduce:
* Open spark-shell
{code:java}
scala> sc.setJobGroup("test", "job", false); 
scala> val foo = sc.textFile("/user/foo.txt");
thorn: org.apache.spark.rdd.RDD[String] = /user/foo.txt MapPartitionsRDD[1] at 
textFile at :24
scala> foo.foreach(println);
{code}
* Access end REST API http://SHS-host:port/api/v1/applications//jobs/
* REST API of Spark 2.3 and above will not return description




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29636) Can't parse '11:00 BST' or '2000-10-19 10:23:54+01' signatures to timestamp

2019-10-29 Thread Dylan Guedes (Jira)

Dylan Guedes created SPARK-29636:


 Summary: Can't parse '11:00 BST' or '2000-10-19 10:23:54+01' 
signatures to timestamp
 Key: SPARK-29636
 URL: https://issues.apache.org/jira/browse/SPARK-29636
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dylan Guedes


Currently, Spark can't parse a string such as '11:00 BST' or '2000-10-19 
10:23:54+01' to timestamp:

{code:sql}
spark-sql> select cast ('11:00 BST' as timestamp);
NULL
Time taken: 2.248 seconds, Fetched 1 row(s)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29595) Insertion with named_struct should match by name

2019-10-29 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961857#comment-16961857
 ] 

Aman Omer commented on SPARK-29595:
---

Raised a PR [https://github.com/apache/spark/pull/26293] . 

Kindly review.

Thanks.

> Insertion with named_struct should match by name
> 
>
> Key: SPARK-29595
> URL: https://issues.apache.org/jira/browse/SPARK-29595
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> {code:java}
> spark-sql> create table str using parquet as(select named_struct('a', 1, 'b', 
> 2) as data);
> spark-sql>  insert into str values named_struct("b", 3, "a", 1);
> spark-sql> select * from str;
> {"a":3,"b":1}
> {"a":1,"b":2}
> {code}
> The result should be 
> {code:java}
> {"a":1,"b":3}
> {"a":1,"b":2}
> {code}
> Spark should match the field names of named_struct on insertion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29634) spark-sql can't query hive table values with schema Char by equality.

2019-10-29 Thread Zhaoyang Qin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961828#comment-16961828
 ] 

Zhaoyang Qin edited comment on SPARK-29634 at 10/29/19 9:42 AM:


About Char datatype,Hive Docs sees:

[https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-CharcharChar]

But spark-sql CLI read this table by 

*select * from foo where bar = 'something'* ，

There will be no results.You must add Spaces like this: 

 *select * from foo where bar = 'something        '.*

This causes normal SQL fail to execute correctly.

And many TPCDS statements  get no results，eg: sql7,sql71,sql 72 etc.


was (Author: kelvin.fe):
About Char datatype,Hive Docs sees:

[https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-CharcharChar]

But spark-sql CLI read this table by *select * from foo where bar = 
'something'* ，There will be no results.You must add Spaces like this: 

 *select * from foo where bar = 'something        '.*

This causes normal SQL fail to execute correctly.

And many TPCDS statements to get no results.eg: sql7,sql71,sql 72 etc.

> spark-sql can't query hive table values with schema Char by equality.
> -
>
> Key: SPARK-29634
> URL: https://issues.apache.org/jira/browse/SPARK-29634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.1, 2.3.1
> Environment: Spark 2.3.1
> Hive 3.0.0
> TPCDS Data & Tables
>Reporter: Zhaoyang Qin
>Priority: Major
>  Labels: spark-sql, spark-sql-perf
>
> spark-sql can't query hive table values that with schema Char by equality.
> When i use spark-sql CLI  to execute a query with hive tables,The expected 
> results can not be obtained. The query result is empty. Some equality 
> conditions did not work as expected.I checked and found that the table fields 
> had one thing in common: they were created as char，sometimes as varchar.Then 
> I execute the following statement and return an empty result: select * from 
> foo where bar = 'something'.In fact, the data does exist. Using hive sql 
> returns the correct results.
> Simulation steps:（use spark-sql）
> {code:java}
> //spark-sql
> {code}
> ./spark-sql 
> > create table test1 (name char(10), age int);
> > insert into test1 values('erya',15);
> > select * from test1 where name = 'erya';  // no results
> > select * from test1 where name = 'erya      ';  // add 6 spaces



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29634) spark-sql can't query hive table values with schema Char by equality.

2019-10-29 Thread Zhaoyang Qin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaoyang Qin updated SPARK-29634:
-
Summary: spark-sql can't query hive table values with schema Char by 
equality.  (was: spark-sql can't query hive table values that with schema Char 
by equality.)

> spark-sql can't query hive table values with schema Char by equality.
> -
>
> Key: SPARK-29634
> URL: https://issues.apache.org/jira/browse/SPARK-29634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.1, 2.3.1
> Environment: Spark 2.3.1
> Hive 3.0.0
> TPCDS Data & Tables
>Reporter: Zhaoyang Qin
>Priority: Major
>  Labels: spark-sql, spark-sql-perf
>
> spark-sql can't query hive table values that with schema Char by equality.
> When i use spark-sql CLI  to execute a query with hive tables,The expected 
> results can not be obtained. The query result is empty. Some equality 
> conditions did not work as expected.I checked and found that the table fields 
> had one thing in common: they were created as char，sometimes as varchar.Then 
> I execute the following statement and return an empty result: select * from 
> foo where bar = 'something'.In fact, the data does exist. Using hive sql 
> returns the correct results.
> Simulation steps:（use spark-sql）
> {code:java}
> //spark-sql
> {code}
> ./spark-sql 
> > create table test1 (name char(10), age int);
> > insert into test1 values('erya',15);
> > select * from test1 where name = 'erya';  // no results
> > select * from test1 where name = 'erya      ';  // add 6 spaces



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29634) spark-sql can't query hive table values that with schema Char by equality.

2019-10-29 Thread Zhaoyang Qin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaoyang Qin updated SPARK-29634:
-
Description: 
spark-sql can't query hive table values that with schema Char by equality.

When i use spark-sql CLI  to execute a query with hive tables,The expected 
results can not be obtained. The query result is empty. Some equality 
conditions did not work as expected.I checked and found that the table fields 
had one thing in common: they were created as char，sometimes as varchar.Then I 
execute the following statement and return an empty result: select * from foo 
where bar = 'something'.In fact, the data does exist. Using hive sql returns 
the correct results.

Simulation steps:（use spark-sql）
{code:java}
//spark-sql
{code}
./spark-sql 

> create table test1 (name char(10), age int);

> insert into test1 values('erya',15);

> select * from test1 where name = 'erya';  // no results

> select * from test1 where name = 'erya      ';  // add 6 spaces

  was:
spark-sql can't query hive table values that with schema Char by equality.

When i use spark-sql CLI  to execute a query with hive tables,The expected 
results can not be obtained. The query result is empty. Some equality 
conditions did not work as expected.I checked and found that the table fields 
had one thing in common: they were created as char，sometimes as varchar.Then I 
execute the following statement and return an empty result: select * from foo 
where bar = 'something'.In fact, the data does exist. Using hive sql returns 
the correct results.


> spark-sql can't query hive table values that with schema Char by equality.
> --
>
> Key: SPARK-29634
> URL: https://issues.apache.org/jira/browse/SPARK-29634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.1, 2.3.1
> Environment: Spark 2.3.1
> Hive 3.0.0
> TPCDS Data & Tables
>Reporter: Zhaoyang Qin
>Priority: Major
>  Labels: spark-sql, spark-sql-perf
>
> spark-sql can't query hive table values that with schema Char by equality.
> When i use spark-sql CLI  to execute a query with hive tables,The expected 
> results can not be obtained. The query result is empty. Some equality 
> conditions did not work as expected.I checked and found that the table fields 
> had one thing in common: they were created as char，sometimes as varchar.Then 
> I execute the following statement and return an empty result: select * from 
> foo where bar = 'something'.In fact, the data does exist. Using hive sql 
> returns the correct results.
> Simulation steps:（use spark-sql）
> {code:java}
> //spark-sql
> {code}
> ./spark-sql 
> > create table test1 (name char(10), age int);
> > insert into test1 values('erya',15);
> > select * from test1 where name = 'erya';  // no results
> > select * from test1 where name = 'erya      ';  // add 6 spaces



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29634) spark-sql can't query hive table values that with schema Char by equality.

2019-10-29 Thread Zhaoyang Qin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961830#comment-16961830
 ] 

Zhaoyang Qin commented on SPARK-29634:
--

Hive doc says:
{panel:title=Char}
Char types are similar to Varchar but they are fixed-length meaning that values 
shorter than the specified length value are padded with spaces but trailing 
spaces are not important during comparisons. The maximum length is fixed at 255.
{panel}

> spark-sql can't query hive table values that with schema Char by equality.
> --
>
> Key: SPARK-29634
> URL: https://issues.apache.org/jira/browse/SPARK-29634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.1, 2.3.1
> Environment: Spark 2.3.1
> Hive 3.0.0
> TPCDS Data & Tables
>Reporter: Zhaoyang Qin
>Priority: Major
>  Labels: spark-sql, spark-sql-perf
>
> spark-sql can't query hive table values that with schema Char by equality.
> When i use spark-sql CLI  to execute a query with hive tables,The expected 
> results can not be obtained. The query result is empty. Some equality 
> conditions did not work as expected.I checked and found that the table fields 
> had one thing in common: they were created as char，sometimes as varchar.Then 
> I execute the following statement and return an empty result: select * from 
> foo where bar = 'something'.In fact, the data does exist. Using hive sql 
> returns the correct results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29634) spark-sql can't query hive table values that with schema Char by equality.

2019-10-29 Thread Zhaoyang Qin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961828#comment-16961828
 ] 

Zhaoyang Qin commented on SPARK-29634:
--

About Char datatype,Hive Docs sees:

[https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-CharcharChar]

But spark-sql CLI read this table by *select * from foo where bar = 
'something'* ，There will be no results.You must add Spaces like this: 

 *select * from foo where bar = 'something        '.*

This causes normal SQL fail to execute correctly.

And many TPCDS statements to get no results.eg: sql7,sql71,sql 72 etc.

> spark-sql can't query hive table values that with schema Char by equality.
> --
>
> Key: SPARK-29634
> URL: https://issues.apache.org/jira/browse/SPARK-29634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.1, 2.3.1
> Environment: Spark 2.3.1
> Hive 3.0.0
> TPCDS Data & Tables
>Reporter: Zhaoyang Qin
>Priority: Major
>  Labels: spark-sql, spark-sql-perf
>
> spark-sql can't query hive table values that with schema Char by equality.
> When i use spark-sql CLI  to execute a query with hive tables,The expected 
> results can not be obtained. The query result is empty. Some equality 
> conditions did not work as expected.I checked and found that the table fields 
> had one thing in common: they were created as char，sometimes as varchar.Then 
> I execute the following statement and return an empty result: select * from 
> foo where bar = 'something'.In fact, the data does exist. Using hive sql 
> returns the correct results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29635) Deduplicate test suites between Kafka micro-batch sink and Kafka continuous sink

2019-10-29 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-29635:


 Summary: Deduplicate test suites between Kafka micro-batch sink 
and Kafka continuous sink
 Key: SPARK-29635
 URL: https://issues.apache.org/jira/browse/SPARK-29635
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


There's a comment in KafkaContinuousSinkSuite which is most likely explaining 
TODO:

https://github.com/apache/spark/blob/37690dea107623ebca1e47c64db59196ee388f2f/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSinkSuite.scala#L35-L39

{noformat}
/**
 * This is a temporary port of KafkaSinkSuite, since we do not yet have a V2 
memory stream.
 * Once we have one, this will be changed to a specialization of KafkaSinkSuite 
and we won't have
 * to duplicate all the code.
 */
{noformat}

Given latest master branch has V2 memory stream now, it is a good time to 
deduplicate two suites into one, via having base class and let these suites 
override necessary things.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29634) spark-sql can't query hive table values that with schema Char by equality.

2019-10-29 Thread Zhaoyang Qin (Jira)

Zhaoyang Qin created SPARK-29634:


 Summary: spark-sql can't query hive table values that with schema 
Char by equality.
 Key: SPARK-29634
 URL: https://issues.apache.org/jira/browse/SPARK-29634
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, SQL
Affects Versions: 2.3.1, 2.2.1
 Environment: Spark 2.3.1

Hive 3.0.0

TPCDS Data & Tables
Reporter: Zhaoyang Qin


spark-sql can't query hive table values that with schema Char by equality.

When i use spark-sql CLI  to execute a query with hive tables,The expected 
results can not be obtained. The query result is empty. Some equality 
conditions did not work as expected.I checked and found that the table fields 
had one thing in common: they were created as char，sometimes as varchar.Then I 
execute the following statement and return an empty result: select * from foo 
where bar = 'something'.In fact, the data does exist. Using hive sql returns 
the correct results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS

2019-10-29 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961797#comment-16961797
 ] 

Aman Omer commented on SPARK-29630:
---

cc [~srowen] [~LI,Xiao] [~cloud_fan]

> Not allowed to create a permanent view by referencing a temporary view in 
> EXISTS
> 
>
> Key: SPARK-29630
> URL: https://issues.apache.org/jira/browse/SPARK-29630
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> {code}
> // In the master, the query below fails
> $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * 
> FROM temp_table) t2;
> org.apache.spark.sql.AnalysisException
> Not allowed to create a permanent view `v7_temp` by referencing a temporary 
> view `temp_table`;
> // In the master, the query below passed, but this should fail
> $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM 
> temp_table);
> Passed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS

2019-10-29 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961785#comment-16961785
 ] 

Takeshi Yamamuro edited comment on SPARK-29630 at 10/29/19 8:19 AM:


This is a logical view, so I think the view definition holds a logical plan 
depending on `temp_table`.  The execution of the query might work well because 
of some optimization though.


was (Author: maropu):
This is a logical view, so I think the view definition holds a logical plan 
depending on `temp_table`. 

> Not allowed to create a permanent view by referencing a temporary view in 
> EXISTS
> 
>
> Key: SPARK-29630
> URL: https://issues.apache.org/jira/browse/SPARK-29630
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> {code}
> // In the master, the query below fails
> $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * 
> FROM temp_table) t2;
> org.apache.spark.sql.AnalysisException
> Not allowed to create a permanent view `v7_temp` by referencing a temporary 
> view `temp_table`;
> // In the master, the query below passed, but this should fail
> $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM 
> temp_table);
> Passed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS

2019-10-29 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961785#comment-16961785
 ] 

Takeshi Yamamuro edited comment on SPARK-29630 at 10/29/19 8:18 AM:


This is a logical view, so I think the view definition holds a logical plan 
depending on `temp_table`. 


was (Author: maropu):
Ah, I see. I'll recheck later.

> Not allowed to create a permanent view by referencing a temporary view in 
> EXISTS
> 
>
> Key: SPARK-29630
> URL: https://issues.apache.org/jira/browse/SPARK-29630
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> {code}
> // In the master, the query below fails
> $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * 
> FROM temp_table) t2;
> org.apache.spark.sql.AnalysisException
> Not allowed to create a permanent view `v7_temp` by referencing a temporary 
> view `temp_table`;
> // In the master, the query below passed, but this should fail
> $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM 
> temp_table);
> Passed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29595) Insertion with named_struct should match by name

2019-10-29 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961783#comment-16961783
 ] 

Aman Omer edited comment on SPARK-29595 at 10/29/19 8:18 AM:
-

{color:#172b4d}named_struct takes Seq(name1, val1, name2, val2, ...) as 
parameter. Vmalidation step for named_struct only check for string at odd 
places. For example following query will add a row in _str_ table.{color}
{code:java}
insert into str values named_struct( "ab", 1, "ba", 2);{code}
According to the discussion in [https://github.com/apache/spark/pull/26275] , 
which was tackling similar issue, changing fields of struct type according to 
names will introduce complexity.

So I think Spark should throw an exception when names does not match in 
named_struct.

cc [~srowen] [~maropu]


was (Author: aman_omer):
{color:#172b4d}Parameter required for named_struct is Seq(name1, val1, name2, 
val2, ...). Validation step for named_struct only check for string at odd 
places. For example following query will add a row in _str_ table.{color}
{code:java}
insert into str values named_struct( "ab", 1, "ba", 2);{code}
According to the discussion in [https://github.com/apache/spark/pull/26275] , 
which was tackling similar issue, changing fields of struct type according to 
names will introduce complexity.

So I think Spark should throw an exception when names does not match in 
named_struct.

cc [~srowen] [~maropu]

> Insertion with named_struct should match by name
> 
>
> Key: SPARK-29595
> URL: https://issues.apache.org/jira/browse/SPARK-29595
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> {code:java}
> spark-sql> create table str using parquet as(select named_struct('a', 1, 'b', 
> 2) as data);
> spark-sql>  insert into str values named_struct("b", 3, "a", 1);
> spark-sql> select * from str;
> {"a":3,"b":1}
> {"a":1,"b":2}
> {code}
> The result should be 
> {code:java}
> {"a":1,"b":3}
> {"a":1,"b":2}
> {code}
> Spark should match the field names of named_struct on insertion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29595) Insertion with named_struct should match by name

2019-10-29 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961783#comment-16961783
 ] 

Aman Omer edited comment on SPARK-29595 at 10/29/19 8:17 AM:
-

{color:#172b4d}Parameter required for named_struct is Seq(name1, val1, name2, 
val2, ...). Validation step for named_struct only check for string at odd 
places. For example following query will add a row in _str_ table.{color}
{code:java}
insert into str values named_struct( "ab", 1, "ba", 2);{code}
According to the discussion in [https://github.com/apache/spark/pull/26275] , 
which was tackling similar issue, changing fields of struct type according to 
names will introduce complexity.

So I think Spark should throw an exception when names does not match in 
named_struct.

cc [~srowen] [~maropu]


was (Author: aman_omer):
{color:#172b4d}Parameters required for named_struct is Seq(name1, val1, name2, 
val2, ...). Validation step for named_struct only check for string at odd 
places. For example following query will add a row in _str_ table.{color}{color}
{code:java}
insert into str values named_struct( "ab", 1, "ba", 2);{code}
According to the discussion in [https://github.com/apache/spark/pull/26275] , 
which was tackling similar issue, changing fields of struct type according to 
names will introduce complexity.

So I think Spark should throw an exception when names does not match in 
named_struct.

cc [~srowen] [~maropu]

> Insertion with named_struct should match by name
> 
>
> Key: SPARK-29595
> URL: https://issues.apache.org/jira/browse/SPARK-29595
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> {code:java}
> spark-sql> create table str using parquet as(select named_struct('a', 1, 'b', 
> 2) as data);
> spark-sql>  insert into str values named_struct("b", 3, "a", 1);
> spark-sql> select * from str;
> {"a":3,"b":1}
> {"a":1,"b":2}
> {code}
> The result should be 
> {code:java}
> {"a":1,"b":3}
> {"a":1,"b":2}
> {code}
> Spark should match the field names of named_struct on insertion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS

2019-10-29 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29630:
-
Parent: SPARK-27764
Issue Type: Sub-task  (was: Improvement)

> Not allowed to create a permanent view by referencing a temporary view in 
> EXISTS
> 
>
> Key: SPARK-29630
> URL: https://issues.apache.org/jira/browse/SPARK-29630
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> {code}
> // In the master, the query below fails
> $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * 
> FROM temp_table) t2;
> org.apache.spark.sql.AnalysisException
> Not allowed to create a permanent view `v7_temp` by referencing a temporary 
> view `temp_table`;
> // In the master, the query below passed, but this should fail
> $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM 
> temp_table);
> Passed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29595) Insertion with named_struct should match by name

2019-10-29 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961783#comment-16961783
 ] 

Aman Omer edited comment on SPARK-29595 at 10/29/19 8:13 AM:
-

{color:#172b4d}Parameters required for named_struct is Seq(name1, val1, name2, 
val2, ...). Validation step for named_struct only check for string at odd 
places. For example following query will add a row in _str_ table.{color}{color}
{code:java}
insert into str values named_struct( "ab", 1, "ba", 2);{code}
According to the discussion in [https://github.com/apache/spark/pull/26275] , 
which was tackling similar issue, changing fields of struct type according to 
names will introduce complexity.

So I think Spark should throw an exception when names does not match in 
named_struct.

cc [~srowen] [~maropu]


was (Author: aman_omer):
Parameters required for named_struct is _{color:#172b4d}Seq(name1, val1, name2, 
val2, ...){color}_{color:#808080}{color:#172b4d}. Validation step for 
named_struct only check for string at odd places. For example following query 
will add a row in _str_ table.{color} {color}
{code:java}
insert into str values named_struct( "ab", 1, "ba", 2);{code}
According to the discussion in [https://github.com/apache/spark/pull/26275] , 
which was tackling similar issue, changing fields of struct type according to 
names will introduce complexity.

So I think Spark should throw an exception when names does not match in 
named_struct.

cc [~srowen] [~maropu]

> Insertion with named_struct should match by name
> 
>
> Key: SPARK-29595
> URL: https://issues.apache.org/jira/browse/SPARK-29595
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> {code:java}
> spark-sql> create table str using parquet as(select named_struct('a', 1, 'b', 
> 2) as data);
> spark-sql>  insert into str values named_struct("b", 3, "a", 1);
> spark-sql> select * from str;
> {"a":3,"b":1}
> {"a":1,"b":2}
> {code}
> The result should be 
> {code:java}
> {"a":1,"b":3}
> {"a":1,"b":2}
> {code}
> Spark should match the field names of named_struct on insertion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS

2019-10-29 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961785#comment-16961785
 ] 

Takeshi Yamamuro commented on SPARK-29630:
--

Ah, I see. I'll recheck later.

> Not allowed to create a permanent view by referencing a temporary view in 
> EXISTS
> 
>
> Key: SPARK-29630
> URL: https://issues.apache.org/jira/browse/SPARK-29630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> {code}
> // In the master, the query below fails
> $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * 
> FROM temp_table) t2;
> org.apache.spark.sql.AnalysisException
> Not allowed to create a permanent view `v7_temp` by referencing a temporary 
> view `temp_table`;
> // In the master, the query below passed, but this should fail
> $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM 
> temp_table);
> Passed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29595) Insertion with named_struct should match by name

2019-10-29 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961783#comment-16961783
 ] 

Aman Omer commented on SPARK-29595:
---

Parameters required for named_struct is _{color:#172b4d}Seq(name1, val1, name2, 
val2, ...){color}_{color:#808080}{color:#172b4d}. Validation step for 
named_struct only check for string at odd places. For example following query 
will add a row in _str_ table.{color} {color}
{code:java}
insert into str values named_struct( "ab", 1, "ba", 2);{code}
According to the discussion in [https://github.com/apache/spark/pull/26275] , 
which was tackling similar issue, changing fields of struct type according to 
names will introduce complexity.

So I think Spark should throw an exception when names does not match in 
named_struct.

cc [~srowen] [~maropu]

> Insertion with named_struct should match by name
> 
>
> Key: SPARK-29595
> URL: https://issues.apache.org/jira/browse/SPARK-29595
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> {code:java}
> spark-sql> create table str using parquet as(select named_struct('a', 1, 'b', 
> 2) as data);
> spark-sql>  insert into str values named_struct("b", 3, "a", 1);
> spark-sql> select * from str;
> {"a":3,"b":1}
> {"a":1,"b":2}
> {code}
> The result should be 
> {code:java}
> {"a":1,"b":3}
> {"a":1,"b":2}
> {code}
> Spark should match the field names of named_struct on insertion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29626) notEqual() should return true when the one is null, the other is not null

2019-10-29 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961770#comment-16961770
 ] 

Aman Omer commented on SPARK-29626:
---

Looking into this one.

> notEqual() should return true when the one is null, the other is not null
> -
>
> Key: SPARK-29626
> URL: https://issues.apache.org/jira/browse/SPARK-29626
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: zhouhuazheng
>Priority: Minor
>
> the one is null，the other is not null, then use the function notEqual(), we 
> hope it return true . 
> eg: 
> scala> df.show()
> +--+---+
> | age| name|
> +--+---+
> | null|Michael|
> | 30| Andy|
> | 19| Justin|
> | 35| null|
> | 19| Justin|
> | null| null|
> |Justin| Justin|
> | 19| 19|
> +--+---+
> scala> df.filter(col("age").notEqual(col("name"))).show
> +---+--+
> |age| name|
> +---+--+
> | 30| Andy|
> | 19|Justin|
> | 19|Justin|
> +---+--+
> scala> df.filter(col("age").equalTo(col("name"))).show
> +--+--+
> | age| name|
> +--+--+
> | null| null|
> |Justin|Justin|
> | 19| 19|
> +--+--+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29222) Flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence

2019-10-29 Thread huangtianhua (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961766#comment-16961766
 ] 

huangtianhua commented on SPARK-29222:
--

[~hyukjin.kwon], the failure is releated with the performance of arm instance, 
now we donate an arm instance to AMPLab and we havn't build the python job yet, 
and we plan to donate some higher performance arm instances to AMPLab next 
month, if the issues happen again then I will create a pr to fix this. Thanks a 
lot.

> Flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence
> ---
>
> Key: SPARK-29222
> URL: https://issues.apache.org/jira/browse/SPARK-29222
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, Tests
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111237/testReport/]
> {code:java}
> Error Message
> 7 != 10
> StacktraceTraceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 429, in test_parameter_convergence
> self._eventually(condition, catch_assertions=True)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 74, in _eventually
> raise lastValue
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 65, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 425, in condition
> self.assertEqual(len(model_weights), len(batches))
> AssertionError: 7 != 10
>{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS

2019-10-29 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961765#comment-16961765
 ] 

Aman Omer commented on SPARK-29630:
---

[~maropu] I think when we project some constant in subquery, Spark will not 
link subquery's table/view. Hence table _v8_temp_ is not dependent on 
_temp_table_.

Please check the output for following query.

$ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT * FROM 
temp_table);

> Not allowed to create a permanent view by referencing a temporary view in 
> EXISTS
> 
>
> Key: SPARK-29630
> URL: https://issues.apache.org/jira/browse/SPARK-29630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> {code}
> // In the master, the query below fails
> $ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * 
> FROM temp_table) t2;
> org.apache.spark.sql.AnalysisException
> Not allowed to create a permanent view `v7_temp` by referencing a temporary 
> view `temp_table`;
> // In the master, the query below passed, but this should fail
> $ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM 
> temp_table);
> Passed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-10-29 Thread huangtianhua (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961760#comment-16961760
 ] 

huangtianhua commented on SPARK-29106:
--

[~shaneknapp], hi, the arm job has built for some days, althought there are 
some failures due to poor performance of arm instance or the network upgrade, 
but the good news is that we got several successes, at least it means that 
spark supports on arm platform :) 

According the discussion 'Deprecate Python < 3.6 in Spark 3.0' I think now we 
can do the python test only for python3.6, it will be more simple for us? So 
how about to add a new arm job for python3.6 tests? 

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
> Attachments: R-ansible.yml, R-libs.txt
>
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when integrate arm test with amplab 
> jenkins.
> About the k8s test on arm, we have took test it, see 
> [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it 
> later. 
> And we plan test on other stable branches too, and we can integrate them to 
> amplab when they are ready.
> We have offered an arm instance and sent the infos to shane knapp, thanks 
> shane to add the first arm job to amplab jenkins :) 
> The other important thing is about the leveldbjni 
> [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
>  spark depends on leveldbjni-all-1.8 
> [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
>  we can see there is no arm64 supporting. So we build an arm64 supporting 
> release of leveldbjni see 
> [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
>  but we can't modified the spark pom.xml directly with something like 
> 'property'/'profile' to choose correct jar package on arm or x86 platform, 
> because spark depends on some hadoop packages like hadoop-hdfs, the packages 
> depend on leveldbjni-all-1.8 too, unless hadoop release with new arm 
> supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of 
> openlabtesting and 'mvn install' to use it when arm testing for spark.
> PS: The issues found and fixed:
>  SPARK-28770
>  [https://github.com/apache/spark/pull/25673]
>   
>  SPARK-28519
>  [https://github.com/apache/spark/pull/25279]
>   
>  SPARK-28433
>  [https://github.com/apache/spark/pull/25186]
>  
> SPARK-28467
> [https://github.com/apache/spark/pull/25864]
>  
> SPARK-29286
> [https://github.com/apache/spark/pull/26021]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29628) Forcibly create a temporary view in CREATE VIEW if referencing a temporary view

2019-10-29 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961758#comment-16961758
 ] 

Aman Omer commented on SPARK-29628:
---

Thanks [~maropu] . I will look into this one.

> Forcibly create a temporary view in CREATE VIEW if referencing a temporary 
> view
> ---
>
> Key: SPARK-29628
> URL: https://issues.apache.org/jira/browse/SPARK-29628
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> {code}
> CREATE TEMPORARY VIEW temp_table AS SELECT * FROM VALUES
>   (1, 1) as temp_table(a, id);
> CREATE VIEW v1_temp AS SELECT * FROM temp_table;
> // In Spark
> org.apache.spark.sql.AnalysisException
> Not allowed to create a permanent view `v1_temp` by referencing a temporary 
> view `temp_table`;
> // In PostgreSQL
> NOTICE:  view "v1_temp" will be a temporary view
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29633) Make COLUMN optional in ALTER TABLE

2019-10-29 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-29633:


 Summary: Make COLUMN optional in ALTER TABLE
 Key: SPARK-29633
 URL: https://issues.apache.org/jira/browse/SPARK-29633
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


In the mater, the query below fails;
{code}
create table tt3 (ax bigint, b short, c decimal) using parquet;
alter table tt3 rename c to d;
{code}

In PostgreSQL, we can make COLUMN optional;
{code}
ALTER TABLE [ IF EXISTS ] [ ONLY ] name [ * ]
RENAME [ COLUMN ] column_name TO new_column_name
{code}
https://www.postgresql.org/docs/current/sql-altertable.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29632) Support ALTER TABLE [relname] SET SCHEMA [dbname]

2019-10-29 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29632:
-
Description: 
{code}
CREATE SCHEMA temp_view_test;
CREATE TABLE tx1 (x1 int, x2 int, x3 string) using parquet;
ALTER TABLE tx1 SET SCHEMA temp_view_test;
{code}

{code}
ALTER TABLE [ IF EXISTS ] name
SET SCHEMA new_schema
{code}
https://www.postgresql.org/docs/current/sql-altertable.html

  was:
{code}
CREATE SCHEMA temp_view_test;
CREATE TABLE tx1 (x1 int, x2 int, x3 string) using parquet;
ALTER TABLE tx1 SET SCHEMA temp_view_test;
{code}
https://www.postgresql.org/docs/current/sql-altertable.html


> Support ALTER TABLE [relname] SET SCHEMA [dbname]
> -
>
> Key: SPARK-29632
> URL: https://issues.apache.org/jira/browse/SPARK-29632
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> {code}
> CREATE SCHEMA temp_view_test;
> CREATE TABLE tx1 (x1 int, x2 int, x3 string) using parquet;
> ALTER TABLE tx1 SET SCHEMA temp_view_test;
> {code}
> {code}
> ALTER TABLE [ IF EXISTS ] name
> SET SCHEMA new_schema
> {code}
> https://www.postgresql.org/docs/current/sql-altertable.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29632) Support ALTER TABLE [relname] SET SCHEMA [dbname]

2019-10-29 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29632:
-
Parent: SPARK-27764
Issue Type: Sub-task  (was: Improvement)

> Support ALTER TABLE [relname] SET SCHEMA [dbname]
> -
>
> Key: SPARK-29632
> URL: https://issues.apache.org/jira/browse/SPARK-29632
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> {code}
> CREATE SCHEMA temp_view_test;
> CREATE TABLE tx1 (x1 int, x2 int, x3 string) using parquet;
> ALTER TABLE tx1 SET SCHEMA temp_view_test;
> {code}
> https://www.postgresql.org/docs/current/sql-altertable.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29632) Support ALTER TABLE [relname] SET SCHEMA [dbname]

2019-10-29 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-29632:


 Summary: Support ALTER TABLE [relname] SET SCHEMA [dbname]
 Key: SPARK-29632
 URL: https://issues.apache.org/jira/browse/SPARK-29632
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


{code}
CREATE SCHEMA temp_view_test;
CREATE TABLE tx1 (x1 int, x2 int, x3 string) using parquet;
ALTER TABLE tx1 SET SCHEMA temp_view_test;
{code}
https://www.postgresql.org/docs/current/sql-altertable.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29476) Add tooltip information for Thread Dump links and Thread details table columns in Executors Tab

2019-10-29 Thread jobit mathew (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961723#comment-16961723
 ] 

jobit mathew commented on SPARK-29476:
--

[~hyukjin.kwon],

I think it is better to have some tooltips in the  Executors tab specially for 
Thread dump link[for most of the other columns tool tip is already added] to 
explain more information what it meant like *thread dump for executors and 
drivers*.

And  after clicking on thread dump link ,the next page contains the *search* 
and *thread table detail*s.

In this page also add *tool tip* for *Search,*better to mention what and all it 
will search like it will search the content from table including stack trace 
details.And *tool tip* for *thread table column heading detail*s for better 
understanding.

 

 

 

> Add tooltip information for Thread Dump links and Thread details table 
> columns in Executors Tab
> ---
>
> Key: SPARK-29476
> URL: https://issues.apache.org/jira/browse/SPARK-29476
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29565) OneHotEncoder should support single-column input/ouput

2019-10-29 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-29565:
---

Assignee: Huaxin Gao

> OneHotEncoder should support single-column input/ouput
> --
>
> Key: SPARK-29565
> URL: https://issues.apache.org/jira/browse/SPARK-29565
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
>
> Current feature algs 
> ({color:#5a6e5a}QuantileDiscretizer/Binarizer/Bucketizer/StringIndexer{color})
>  are designed to support both single-col & multi-col.
> And there is already some internal utils (like 
> {color:#c7a65d}checkSingleVsMultiColumnParams{color}) for this.
> For OneHotEncoder, it's reasonable to support single-col.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29565) OneHotEncoder should support single-column input/ouput

2019-10-29 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-29565.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26265
[https://github.com/apache/spark/pull/26265]

> OneHotEncoder should support single-column input/ouput
> --
>
> Key: SPARK-29565
> URL: https://issues.apache.org/jira/browse/SPARK-29565
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> Current feature algs 
> ({color:#5a6e5a}QuantileDiscretizer/Binarizer/Bucketizer/StringIndexer{color})
>  are designed to support both single-col & multi-col.
> And there is already some internal utils (like 
> {color:#c7a65d}checkSingleVsMultiColumnParams{color}) for this.
> For OneHotEncoder, it's reasonable to support single-col.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29631) Support ANSI SQL CREATE SEQUENCE

2019-10-29 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-29631:


 Summary:  Support ANSI SQL CREATE SEQUENCE
 Key: SPARK-29631
 URL: https://issues.apache.org/jira/browse/SPARK-29631
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


{code}
CREATE SEQUENCE seq1;
CREATE TEMPORARY SEQUENCE seq1_temp;
CREATE VIEW v9 AS SELECT seq1.is_called FROM seq1;
CREATE VIEW v13_temp AS SELECT seq1_temp.is_called FROM seq1_temp;
{code}
https://www.postgresql.org/docs/current/sql-createsequence.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24918) Executor Plugin API

2019-10-29 Thread Brandon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961672#comment-16961672
 ] 

Brandon edited comment on SPARK-24918 at 10/29/19 6:09 AM:
---

[~nsheth] placing the plugin class inside a jar and passing as `–jars` to 
spark-submit should be sufficient, right? It seems this is not enough to make 
the class visible to the executor. I have had to explicitly add this jar to 
`spark.executor.extraClassPath` for plugins to load correctly.


was (Author: brandonvin):
[~nsheth] placing the plugin class inside a jar and passing as `–jars` to 
spark-submit should sufficient, right? It seems this is not enough to make the 
class visible to the executor. I have had to explicitly add this jar to 
`spark.executor.extraClassPath` for plugins to load correctly.

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Nihar Sheth
>Priority: Major
>  Labels: SPIP, memory-analysis
> Fix For: 2.4.0
>
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29630) Not allowed to create a permanent view by referencing a temporary view in EXISTS

2019-10-29 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-29630:


 Summary: Not allowed to create a permanent view by referencing a 
temporary view in EXISTS
 Key: SPARK-29630
 URL: https://issues.apache.org/jira/browse/SPARK-29630
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


{code}
// In the master, the query below fails
$ CREATE VIEW v7_temp AS SELECT t1.id, t2.a FROM base_table t1, (SELECT * FROM 
temp_table) t2;
org.apache.spark.sql.AnalysisException
Not allowed to create a permanent view `v7_temp` by referencing a temporary 
view `temp_table`;

// In the master, the query below passed, but this should fail
$ CREATE VIEW v8_temp AS SELECT * FROM base_table WHERE EXISTS (SELECT 1 FROM 
temp_table);
Passed
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 105 matches

Mail list logo