date:20201005

[jira] [Created] (SPARK-33066) Port docker integration tests to JDBC v2

2020-10-05 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-33066:
--

 Summary: Port docker integration tests to JDBC v2
 Key: SPARK-33066
 URL: https://issues.apache.org/jira/browse/SPARK-33066
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Port existing docker integration tests like 
org.apache.spark.sql.jdbc.OracleIntegrationSuite to JDBC v2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31430) Bug in the approximate quantile computation.

2020-10-05 Thread Vladimir (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17207920#comment-17207920
 ] 

Vladimir commented on SPARK-31430:
--

Bug fixed in https://issues.apache.org/jira/browse/SPARK-32908

> Bug in the approximate quantile computation.
> 
>
> Key: SPARK-31430
> URL: https://issues.apache.org/jira/browse/SPARK-31430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Siddartha Naidu
>Priority: Major
> Attachments: approx_quantile_data.csv
>
>
> I am seeing a bug where passing lower relative error to the 
> {{approxQuantile}} function is leading to incorrect result in the presence of 
> partitions. Setting a relative error 1e-6 causes it to compute equal values 
> for 0.9 and 1.0 quantiles. Coalescing it back to 1 partition gives correct 
> results. This issue was not present in spark version 2.4.5, we noticed it 
> when testing 3.0.0-preview.
> {{>>> df = spark.read.csv('file:///tmp/approx_quantile_data.csv', 
> header=True, 
> schema=T.StructType([T.StructField('Store',T.StringType(),True),T.StructField('seconds',T.LongType(),True)]))}}
> {{>>> df = df.repartition(200, 'Store').localCheckpoint()}}
> {{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.0001)}}
> {{[1422576000.0, 1430352000.0, 1438300800.0]}}
> {{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.1)}}
> {{[1422576000.0, 1430524800.0, 1438300800.0]}}
> {color:#de350b}{{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 
> 0.01)}}{color}
> {color:#de350b}{{[1422576000.0, 1438300800.0, 1438300800.0]}}{color}
> {{>>> df.coalesce(1).approxQuantile('seconds', [0.8, 0.9, 1.0], 0.01)}}
> {{[1422576000.0, 1430524800.0, 1438300800.0]}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33067) Add negative checks to JDBC v2 Table Catalog tests

2020-10-05 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-33067:
--

 Summary: Add negative checks to JDBC v2 Table Catalog tests
 Key: SPARK-33067
 URL: https://issues.apache.org/jira/browse/SPARK-33067
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Add checks when JDBC v2 commands fail



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33067) Add negative checks to JDBC v2 Table Catalog tests

2020-10-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33067:


Assignee: (was: Apache Spark)

> Add negative checks to JDBC v2 Table Catalog tests
> --
>
> Key: SPARK-33067
> URL: https://issues.apache.org/jira/browse/SPARK-33067
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Add checks when JDBC v2 commands fail



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33067) Add negative checks to JDBC v2 Table Catalog tests

2020-10-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33067:


Assignee: Apache Spark

> Add negative checks to JDBC v2 Table Catalog tests
> --
>
> Key: SPARK-33067
> URL: https://issues.apache.org/jira/browse/SPARK-33067
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Add checks when JDBC v2 commands fail



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33067) Add negative checks to JDBC v2 Table Catalog tests

2020-10-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17207963#comment-17207963
 ] 

Apache Spark commented on SPARK-33067:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29945

> Add negative checks to JDBC v2 Table Catalog tests
> --
>
> Key: SPARK-33067
> URL: https://issues.apache.org/jira/browse/SPARK-33067
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Add checks when JDBC v2 commands fail



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33067) Add negative checks to JDBC v2 Table Catalog tests

2020-10-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17207964#comment-17207964
 ] 

Apache Spark commented on SPARK-33067:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29945

> Add negative checks to JDBC v2 Table Catalog tests
> --
>
> Key: SPARK-33067
> URL: https://issues.apache.org/jira/browse/SPARK-33067
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Add checks when JDBC v2 commands fail



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33068) Spark 2.3 vs Spark 1.6 collect_list giving different schema

2020-10-05 Thread Ayush Goyal (Jira)

Ayush Goyal created SPARK-33068:
---

 Summary: Spark 2.3 vs Spark 1.6 collect_list giving different 
schema
 Key: SPARK-33068
 URL: https://issues.apache.org/jira/browse/SPARK-33068
 Project: Spark
  Issue Type: IT Help
  Components: Spark Submit
Affects Versions: 2.3.4
Reporter: Ayush Goyal


Hi,

I am migrating from spark 1.6 to spark 2.3. However in collect_list I am 
getting different schema.

 
{code:java}
val df_date_agg = df
.groupBy($"a",$"b",$"c")
.agg(sum($"d").alias("data1"),sum($"e").alias("data2"))
.groupBy($"a")
.agg(collect_list(array($"b",$"c",$"data1")).alias("final_data1"),
 collect_list(array($"b",$"c",$"data2")).alias("final_data2"))
{code}
When I am running above line in spark 1.6 getting below schema

 

 
{code:java}
 |-- final_data1: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- final_data2: array (nullable = true)
 ||-- element: string (containsNull = true)
{code}
 

 

but in spark 2.3 schema changed to 

 
{code:java}
|-- final_data1: array (nullable = true)
 ||-- element: array (containsNull = true)
 |||-- element: string (containsNull = true)
 |-- final_data1: array (nullable = true)
 ||-- element: array (containsNull = true)
 |||-- element: string (containsNull = true)
{code}
 

 

In Spark 1.6 array($"b",$"c",$"data1") is converting to string like this 
{code:java}
'[2020-09-26, Ayush, 103.67]'
{code}
In spark 2.3 it is converted to WrappedArray
{code:java}
WrappedArray(2020-09-26, Ayush, 103.67)
{code}
I want to keep my schema as it is Otherwise all the dependent codes have to 
change.

 

Thanks

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33069) Skip test result report if no JUnit XML files are found

2020-10-05 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-33069:


 Summary: Skip test result report if no JUnit XML files are found
 Key: SPARK-33069
 URL: https://issues.apache.org/jira/browse/SPARK-33069
 Project: Spark
  Issue Type: Test
  Components: Project Infra
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


Currently, if there are no JUnit XML files are found, the test results fail.
See also https://github.com/apache/spark/pull/29906#issuecomment-702525542



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33069) Skip test result report if no JUnit XML files are found

2020-10-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33069:
-
Parent: SPARK-32244
Issue Type: Sub-task  (was: Test)

> Skip test result report if no JUnit XML files are found
> ---
>
> Key: SPARK-33069
> URL: https://issues.apache.org/jira/browse/SPARK-33069
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, if there are no JUnit XML files are found, the test results fail.
> See also https://github.com/apache/spark/pull/29906#issuecomment-702525542



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33069) Skip test result report if no JUnit XML files are found

2020-10-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33069:


Assignee: (was: Apache Spark)

> Skip test result report if no JUnit XML files are found
> ---
>
> Key: SPARK-33069
> URL: https://issues.apache.org/jira/browse/SPARK-33069
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, if there are no JUnit XML files are found, the test results fail.
> See also https://github.com/apache/spark/pull/29906#issuecomment-702525542



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33069) Skip test result report if no JUnit XML files are found

2020-10-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33069:


Assignee: Apache Spark

> Skip test result report if no JUnit XML files are found
> ---
>
> Key: SPARK-33069
> URL: https://issues.apache.org/jira/browse/SPARK-33069
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently, if there are no JUnit XML files are found, the test results fail.
> See also https://github.com/apache/spark/pull/29906#issuecomment-702525542



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33069) Skip test result report if no JUnit XML files are found

2020-10-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208001#comment-17208001
 ] 

Apache Spark commented on SPARK-33069:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29946

> Skip test result report if no JUnit XML files are found
> ---
>
> Key: SPARK-33069
> URL: https://issues.apache.org/jira/browse/SPARK-33069
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, if there are no JUnit XML files are found, the test results fail.
> See also https://github.com/apache/spark/pull/29906#issuecomment-702525542



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33069) Skip test result report if no JUnit XML files are found

2020-10-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208003#comment-17208003
 ] 

Apache Spark commented on SPARK-33069:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29946

> Skip test result report if no JUnit XML files are found
> ---
>
> Key: SPARK-33069
> URL: https://issues.apache.org/jira/browse/SPARK-33069
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, if there are no JUnit XML files are found, the test results fail.
> See also https://github.com/apache/spark/pull/29906#issuecomment-702525542



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33042) Add a test case to ensure changes to spark.sql.optimizer.maxIterations take effect at runtime

2020-10-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33042:


Assignee: Yuning Zhang

> Add a test case to ensure changes to spark.sql.optimizer.maxIterations take 
> effect at runtime
> -
>
> Key: SPARK-33042
> URL: https://issues.apache.org/jira/browse/SPARK-33042
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuning Zhang
>Assignee: Yuning Zhang
>Priority: Major
>
> **Add a test case to ensure changes to `spark.sql.optimizer.maxIterations` 
> take effect at runtime.
> Currently, there is only one related test case: 
> [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala#L156]
> However, this test case only checks the value of the conf can be changed at 
> runtime. It does not check the updated value is actually used by the 
> Optimizer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33042) Add a test case to ensure changes to spark.sql.optimizer.maxIterations take effect at runtime

2020-10-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33042.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29919
[https://github.com/apache/spark/pull/29919]

> Add a test case to ensure changes to spark.sql.optimizer.maxIterations take 
> effect at runtime
> -
>
> Key: SPARK-33042
> URL: https://issues.apache.org/jira/browse/SPARK-33042
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuning Zhang
>Assignee: Yuning Zhang
>Priority: Major
> Fix For: 3.1.0
>
>
> **Add a test case to ensure changes to `spark.sql.optimizer.maxIterations` 
> take effect at runtime.
> Currently, there is only one related test case: 
> [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala#L156]
> However, this test case only checks the value of the conf can be changed at 
> runtime. It does not check the updated value is actually used by the 
> Optimizer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32914) Avoid calling dataType multiple times for each expression

2020-10-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32914.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29790
[https://github.com/apache/spark/pull/29790]

> Avoid calling dataType multiple times for each expression
> -
>
> Key: SPARK-32914
> URL: https://issues.apache.org/jira/browse/SPARK-32914
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> Some expression's data type not a static value. It needs to be calculated 
> every time. For example:
> {code:scala}
> spark.range(1L).selectExpr("approx_count_distinct(case when id % 400 
> > 20 then id else 0 end)").show
> {code}
> Profile result:
> {noformat}
> -- Execution profile ---
> Total samples   : 18365
> Frame buffer usage  : 2.6688%
> --- 58443254327 ns (31.82%), 5844 samples
>   [ 0] GenericTaskQueueSet 131072u>, (MemoryType)1>::steal_best_of_2(unsigned int, int*, StarTask&)
>   [ 1] StealTask::do_it(GCTaskManager*, unsigned int)
>   [ 2] GCTaskThread::run()
>   [ 3] java_start(Thread*)
>   [ 4] start_thread
> --- 6140668667 ns (3.34%), 614 samples
>   [ 0] GenericTaskQueueSet 131072u>, (MemoryType)1>::peek()
>   [ 1] ParallelTaskTerminator::offer_termination(TerminatorTerminator*)
>   [ 2] StealTask::do_it(GCTaskManager*, unsigned int)
>   [ 3] GCTaskThread::run()
>   [ 4] java_start(Thread*)
>   [ 5] start_thread
> --- 5679994036 ns (3.09%), 568 samples
>   [ 0] scala.collection.generic.Growable.$plus$plus$eq
>   [ 1] scala.collection.generic.Growable.$plus$plus$eq$
>   [ 2] scala.collection.mutable.ListBuffer.$plus$plus$eq
>   [ 3] scala.collection.mutable.ListBuffer.$plus$plus$eq
>   [ 4] scala.collection.generic.GenericTraversableTemplate.$anonfun$flatten$1
>   [ 5] 
> scala.collection.generic.GenericTraversableTemplate$$Lambda$107.411506101.apply
>   [ 6] scala.collection.immutable.List.foreach
>   [ 7] scala.collection.generic.GenericTraversableTemplate.flatten
>   [ 8] scala.collection.generic.GenericTraversableTemplate.flatten$
>   [ 9] scala.collection.AbstractTraversable.flatten
>   [10] org.apache.spark.internal.config.ConfigEntry.readString
>   [11] org.apache.spark.internal.config.ConfigEntryWithDefault.readFrom
>   [12] org.apache.spark.sql.internal.SQLConf.getConf
>   [13] org.apache.spark.sql.internal.SQLConf.caseSensitiveAnalysis
>   [14] org.apache.spark.sql.types.DataType.sameType
>   [15] 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1
>   [16] 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1$adapted
>   [17] 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$$$Lambda$1527.1975399904.apply
>   [18] scala.collection.IndexedSeqOptimized.prefixLengthImpl
>   [19] scala.collection.IndexedSeqOptimized.forall
>   [20] scala.collection.IndexedSeqOptimized.forall$
>   [21] scala.collection.mutable.ArrayBuffer.forall
>   [22] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.haveSameType
>   [23] 
> org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck
>   [24] 
> org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$
>   [25] org.apache.spark.sql.catalyst.expressions.CaseWhen.dataTypeCheck
>   [26] 
> org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType
>   [27] 
> org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$
>   [28] org.apache.spark.sql.catalyst.expressions.CaseWhen.dataType
>   [29] 
> org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus.update
>   [30] 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2
>   [31] 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2$adapted
>   [32] 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$Lambda$1534.1383512673.apply
>   [33] 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7
>   [34] 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7$adapted
>   [35] 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$Lambda$1555.725788712.apply
>   [36] 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs
>   [37] 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.
>   [38] 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.$anonfun$doExecute$2
>   [39] 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.$anonfun$doExecute$2$adapted
>   [40

[jira] [Assigned] (SPARK-32914) Avoid calling dataType multiple times for each expression

2020-10-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32914:


Assignee: Yuming Wang

> Avoid calling dataType multiple times for each expression
> -
>
> Key: SPARK-32914
> URL: https://issues.apache.org/jira/browse/SPARK-32914
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Some expression's data type not a static value. It needs to be calculated 
> every time. For example:
> {code:scala}
> spark.range(1L).selectExpr("approx_count_distinct(case when id % 400 
> > 20 then id else 0 end)").show
> {code}
> Profile result:
> {noformat}
> -- Execution profile ---
> Total samples   : 18365
> Frame buffer usage  : 2.6688%
> --- 58443254327 ns (31.82%), 5844 samples
>   [ 0] GenericTaskQueueSet 131072u>, (MemoryType)1>::steal_best_of_2(unsigned int, int*, StarTask&)
>   [ 1] StealTask::do_it(GCTaskManager*, unsigned int)
>   [ 2] GCTaskThread::run()
>   [ 3] java_start(Thread*)
>   [ 4] start_thread
> --- 6140668667 ns (3.34%), 614 samples
>   [ 0] GenericTaskQueueSet 131072u>, (MemoryType)1>::peek()
>   [ 1] ParallelTaskTerminator::offer_termination(TerminatorTerminator*)
>   [ 2] StealTask::do_it(GCTaskManager*, unsigned int)
>   [ 3] GCTaskThread::run()
>   [ 4] java_start(Thread*)
>   [ 5] start_thread
> --- 5679994036 ns (3.09%), 568 samples
>   [ 0] scala.collection.generic.Growable.$plus$plus$eq
>   [ 1] scala.collection.generic.Growable.$plus$plus$eq$
>   [ 2] scala.collection.mutable.ListBuffer.$plus$plus$eq
>   [ 3] scala.collection.mutable.ListBuffer.$plus$plus$eq
>   [ 4] scala.collection.generic.GenericTraversableTemplate.$anonfun$flatten$1
>   [ 5] 
> scala.collection.generic.GenericTraversableTemplate$$Lambda$107.411506101.apply
>   [ 6] scala.collection.immutable.List.foreach
>   [ 7] scala.collection.generic.GenericTraversableTemplate.flatten
>   [ 8] scala.collection.generic.GenericTraversableTemplate.flatten$
>   [ 9] scala.collection.AbstractTraversable.flatten
>   [10] org.apache.spark.internal.config.ConfigEntry.readString
>   [11] org.apache.spark.internal.config.ConfigEntryWithDefault.readFrom
>   [12] org.apache.spark.sql.internal.SQLConf.getConf
>   [13] org.apache.spark.sql.internal.SQLConf.caseSensitiveAnalysis
>   [14] org.apache.spark.sql.types.DataType.sameType
>   [15] 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1
>   [16] 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1$adapted
>   [17] 
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$$$Lambda$1527.1975399904.apply
>   [18] scala.collection.IndexedSeqOptimized.prefixLengthImpl
>   [19] scala.collection.IndexedSeqOptimized.forall
>   [20] scala.collection.IndexedSeqOptimized.forall$
>   [21] scala.collection.mutable.ArrayBuffer.forall
>   [22] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.haveSameType
>   [23] 
> org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck
>   [24] 
> org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$
>   [25] org.apache.spark.sql.catalyst.expressions.CaseWhen.dataTypeCheck
>   [26] 
> org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType
>   [27] 
> org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$
>   [28] org.apache.spark.sql.catalyst.expressions.CaseWhen.dataType
>   [29] 
> org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus.update
>   [30] 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2
>   [31] 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2$adapted
>   [32] 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$Lambda$1534.1383512673.apply
>   [33] 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7
>   [34] 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7$adapted
>   [35] 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$Lambda$1555.725788712.apply
>   [36] 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs
>   [37] 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.
>   [38] 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.$anonfun$doExecute$2
>   [39] 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.$anonfun$doExecute$2$adapted
>   [40] 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$Lambda$1459.1481387816.apply
>   [41] org.apache.spark.rdd.RDD.$anon

[jira] [Resolved] (SPARK-33063) Improve error message for insufficient K8s volume confs

2020-10-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33063.
---
Fix Version/s: 3.1.0
 Assignee: German Schiavon Matteo  (was: Apache Spark)
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/29941

> Improve error message for insufficient K8s volume confs
> ---
>
> Key: SPARK-33063
> URL: https://issues.apache.org/jira/browse/SPARK-33063
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0, 3.0.1
>Reporter: German Schiavon Matteo
>Assignee: German Schiavon Matteo
>Priority: Minor
> Fix For: 3.1.0
>
>
> Providing error handling when creating k8s volumes and clearer error messages.
> For example, when creating a *hostPath* volume, if you don't specify 
> {code:java}
> hostPath.volumeName.options.path
> {code}
>  it fails with a 
> {code:java}
>  key not found error
> {code}
>  which is clear that you are missing a key, but I couldn't find anywhere in 
> the docs that says you need to specify it.
> To reproduce the issue, you have to do the spark-submit command like this for 
> example:
> {code:java}
> ./bin/spark-submit \
> --master k8s://https://127.0.0.1:32768 \
> --deploy-mode cluster \
> --name spark-app\
> --class class \
> --conf spark.kubernetes.driver.volumes.hostPath.spark.mount.path=/tmp/jars/ \
> --conf spark.kubernetes.executor.volumes.hostPath.spark.mount.path=/tmp/jars/ 
> \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.container.image=spark:latest \
>  local:///opt/spark/examples/jars/app.jar
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33063) Improve error message for insufficient K8s volume confs

2020-10-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33063:
--
Affects Version/s: (was: 3.0.1)
   (was: 3.0.0)
   3.1.0

> Improve error message for insufficient K8s volume confs
> ---
>
> Key: SPARK-33063
> URL: https://issues.apache.org/jira/browse/SPARK-33063
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: German Schiavon Matteo
>Assignee: German Schiavon Matteo
>Priority: Minor
> Fix For: 3.1.0
>
>
> Providing error handling when creating k8s volumes and clearer error messages.
> For example, when creating a *hostPath* volume, if you don't specify 
> {code:java}
> hostPath.volumeName.options.path
> {code}
>  it fails with a 
> {code:java}
>  key not found error
> {code}
>  which is clear that you are missing a key, but I couldn't find anywhere in 
> the docs that says you need to specify it.
> To reproduce the issue, you have to do the spark-submit command like this for 
> example:
> {code:java}
> ./bin/spark-submit \
> --master k8s://https://127.0.0.1:32768 \
> --deploy-mode cluster \
> --name spark-app\
> --class class \
> --conf spark.kubernetes.driver.volumes.hostPath.spark.mount.path=/tmp/jars/ \
> --conf spark.kubernetes.executor.volumes.hostPath.spark.mount.path=/tmp/jars/ 
> \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.container.image=spark:latest \
>  local:///opt/spark/examples/jars/app.jar
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33070) Optimizer rules for SimpleHigherOrderFunction

2020-10-05 Thread Tanel Kiis (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis updated SPARK-33070:
---
Priority: Minor  (was: Major)

> Optimizer rules for SimpleHigherOrderFunction
> -
>
> Key: SPARK-33070
> URL: https://issues.apache.org/jira/browse/SPARK-33070
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Minor
>
> SimpleHigherOrderFunction like ArrayTransform, ArrayFilter, etc, can be 
> combined and reordered to achieve more optimal plan.
> Possible rules:
> * Combine 2 consecutive array transforms
> * Combine 2 consecutive array filters
> * Push array filter through array sort
> * Remove array sort before array exists and array forall.
> * Combine 2 consecutive map filters



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33070) Optimizer rules for SimpleHigherOrderFunction

2020-10-05 Thread Tanel Kiis (Jira)

Tanel Kiis created SPARK-33070:
--

 Summary: Optimizer rules for SimpleHigherOrderFunction
 Key: SPARK-33070
 URL: https://issues.apache.org/jira/browse/SPARK-33070
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Tanel Kiis


SimpleHigherOrderFunction like ArrayTransform, ArrayFilter, etc, can be 
combined and reordered to achieve more optimal plan.

Possible rules:
* Combine 2 consecutive array transforms
* Combine 2 consecutive array filters
* Push array filter through array sort
* Remove array sort before array exists and array forall.
* Combine 2 consecutive map filters




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33038) AQE plan string should only display one plan when the initial and the current plan are the same

2020-10-05 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-33038:
---

Assignee: Allison Wang

> AQE plan string should only display one plan when the initial and the current 
> plan are the same
> ---
>
> Key: SPARK-33038
> URL: https://issues.apache.org/jira/browse/SPARK-33038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Minor
>
> Currently, the AQE plan string displays both the initial plan and the current 
> or the final plan. This can be redundant when the initial plan and the 
> current physical plan are exactly the same. For instance, the `EXPLAIN` 
> command will not actually execute the query, and thus the plan string will 
> never change, but currently, the plan string still shows both the current and 
> the initial plan:
>  
> {code:java}
> AdaptiveSparkPlan (8)
> +- == Current Plan ==
>Sort (7)
>+- Exchange (6)
>   +- HashAggregate (5)
>  +- Exchange (4)
> +- HashAggregate (3)
>+- Filter (2)
>   +- Scan parquet default.explain_temp1 (1)
> +- == Initial Plan ==
>Sort (7)
>+- Exchange (6)
>   +- HashAggregate (5)
>  +- Exchange (4)
> +- HashAggregate (3)
>+- Filter (2)
>   +- Scan parquet default.explain_temp1 (1)
> {code}
> When the initial and the current plan are the same, there should be only one 
> plan string displayed. For example
> {code:java}
> AdaptiveSparkPlan (8)
> +- Sort (7)
>+- Exchange (6)
>   +- HashAggregate (5)
>  +- Exchange (4)
> +- HashAggregate (3)
>+- Filter (2)
>   +- Scan parquet default.explain_temp1 (1){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33038) AQE plan string should only display one plan when the initial and the current plan are the same

2020-10-05 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-33038.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

> AQE plan string should only display one plan when the initial and the current 
> plan are the same
> ---
>
> Key: SPARK-33038
> URL: https://issues.apache.org/jira/browse/SPARK-33038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently, the AQE plan string displays both the initial plan and the current 
> or the final plan. This can be redundant when the initial plan and the 
> current physical plan are exactly the same. For instance, the `EXPLAIN` 
> command will not actually execute the query, and thus the plan string will 
> never change, but currently, the plan string still shows both the current and 
> the initial plan:
>  
> {code:java}
> AdaptiveSparkPlan (8)
> +- == Current Plan ==
>Sort (7)
>+- Exchange (6)
>   +- HashAggregate (5)
>  +- Exchange (4)
> +- HashAggregate (3)
>+- Filter (2)
>   +- Scan parquet default.explain_temp1 (1)
> +- == Initial Plan ==
>Sort (7)
>+- Exchange (6)
>   +- HashAggregate (5)
>  +- Exchange (4)
> +- HashAggregate (3)
>+- Filter (2)
>   +- Scan parquet default.explain_temp1 (1)
> {code}
> When the initial and the current plan are the same, there should be only one 
> plan string displayed. For example
> {code:java}
> AdaptiveSparkPlan (8)
> +- Sort (7)
>+- Exchange (6)
>   +- HashAggregate (5)
>  +- Exchange (4)
> +- HashAggregate (3)
>+- Filter (2)
>   +- Scan parquet default.explain_temp1 (1){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32067) Use unique ConfigMap name for executor pod template

2020-10-05 Thread James Yu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208224#comment-17208224
 ] 

James Yu commented on SPARK-32067:
--

Hey, [~dongjoon] , I noticed that you added 3.1.0 into the `Affects Version/s` 
of this JIRA, But at this point, 3.1.0 is not released yet.  Did you mean to 
set the `Fix Version/s` to be 3.1.0, and it was just a typo?

> Use unique ConfigMap name for executor pod template
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: James Yu
>Priority: Major
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially,  with app2 launching while app1 is still in the middle of 
> ramping up all its executor pods. The unwanted result is that some launched 
> executor pods of app1 end up having app2's executor pod template applied to 
> them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the overlapping launching periods because both apps use the same 
> ConfigMap (name). This causes some app1's executor pods being ramped up after 
> app2 is launched to be inadvertently launched with the app2's pod template. 
> The issue can be seen as follows:
> First, after submitting app1, you get these configmaps:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   9m46s
> default  podspec-configmap  1   12m{code}
> Then submit app2 while app1 is still ramping up its executors. The 
> podspec-confimap is modified by app2.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  podspec-configmap  1   13m57s{code}
>  
> PROPOSED SOLUTION:
> Properly prefix the podspec-configmap for each submitted app, ideally the 
> same way as the driver configmap:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app1--podspec-configmap1   13m57s
> default  app2--driver-conf-map  1   10s 
> default  app2--podspec-configmap1   3m{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32067) Use unique ConfigMap name for executor pod template

2020-10-05 Thread James Yu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208224#comment-17208224
 ] 

James Yu edited comment on SPARK-32067 at 10/5/20, 6:02 PM:


Hey, [~dongjoon] , I noticed that you added 3.1.0 into the `Affects Version/s` 
of this JIRA, But at this point, 3.1.0 is not released yet.  Did you mean to 
set the `Fix Version/s` to be 3.1.0, and it was just a typo? Or did you expect 
that this fix will not go into 3.1.0?


was (Author: james...@ymail.com):
Hey, [~dongjoon] , I noticed that you added 3.1.0 into the `Affects Version/s` 
of this JIRA, But at this point, 3.1.0 is not released yet.  Did you mean to 
set the `Fix Version/s` to be 3.1.0, and it was just a typo?

> Use unique ConfigMap name for executor pod template
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: James Yu
>Priority: Major
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially,  with app2 launching while app1 is still in the middle of 
> ramping up all its executor pods. The unwanted result is that some launched 
> executor pods of app1 end up having app2's executor pod template applied to 
> them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the overlapping launching periods because both apps use the same 
> ConfigMap (name). This causes some app1's executor pods being ramped up after 
> app2 is launched to be inadvertently launched with the app2's pod template. 
> The issue can be seen as follows:
> First, after submitting app1, you get these configmaps:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   9m46s
> default  podspec-configmap  1   12m{code}
> Then submit app2 while app1 is still ramping up its executors. The 
> podspec-confimap is modified by app2.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  podspec-configmap  1   13m57s{code}
>  
> PROPOSED SOLUTION:
> Properly prefix the podspec-configmap for each submitted app, ideally the 
> same way as the driver configmap:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app1--podspec-configmap1   13m57s
> default  app2--driver-conf-map  1   10s 
> default  app2--podspec-configmap1   3m{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32067) Use unique ConfigMap name for executor pod template

2020-10-05 Thread James Yu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208224#comment-17208224
 ] 

James Yu edited comment on SPARK-32067 at 10/5/20, 6:04 PM:


Hey, [~dongjoon] , I noticed that you added 3.1.0 into the `Affects Version/s` 
of this JIRA, But at this point, 3.1.0 is not released yet.  Did you mean to 
set the `Fix Version/s` to be 3.1.0, and it was just a typo? Or did you expect 
that this fix will not go into 3.1.0? I hope this bug can be fixed and release 
as early as possible; otherwise, like [~sdehaes] said above, the pod template 
feature is useless to us.


was (Author: james...@ymail.com):
Hey, [~dongjoon] , I noticed that you added 3.1.0 into the `Affects Version/s` 
of this JIRA, But at this point, 3.1.0 is not released yet.  Did you mean to 
set the `Fix Version/s` to be 3.1.0, and it was just a typo? Or did you expect 
that this fix will not go into 3.1.0?

> Use unique ConfigMap name for executor pod template
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: James Yu
>Priority: Major
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially,  with app2 launching while app1 is still in the middle of 
> ramping up all its executor pods. The unwanted result is that some launched 
> executor pods of app1 end up having app2's executor pod template applied to 
> them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the overlapping launching periods because both apps use the same 
> ConfigMap (name). This causes some app1's executor pods being ramped up after 
> app2 is launched to be inadvertently launched with the app2's pod template. 
> The issue can be seen as follows:
> First, after submitting app1, you get these configmaps:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   9m46s
> default  podspec-configmap  1   12m{code}
> Then submit app2 while app1 is still ramping up its executors. The 
> podspec-confimap is modified by app2.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  podspec-configmap  1   13m57s{code}
>  
> PROPOSED SOLUTION:
> Properly prefix the podspec-configmap for each submitted app, ideally the 
> same way as the driver configmap:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app1--podspec-configmap1   13m57s
> default  app2--driver-conf-map  1   10s 
> default  app2--podspec-configmap1   3m{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33070) Optimizer rules for collection datatypes and SimpleHigherOrderFunction

2020-10-05 Thread Tanel Kiis (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis updated SPARK-33070:
---
Summary: Optimizer rules for collection datatypes and 
SimpleHigherOrderFunction  (was: Optimizer rules for SimpleHigherOrderFunction)

> Optimizer rules for collection datatypes and SimpleHigherOrderFunction
> --
>
> Key: SPARK-33070
> URL: https://issues.apache.org/jira/browse/SPARK-33070
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Minor
>
> SimpleHigherOrderFunction like ArrayTransform, ArrayFilter, etc, can be 
> combined and reordered to achieve more optimal plan.
> Possible rules:
> * Combine 2 consecutive array transforms
> * Combine 2 consecutive array filters
> * Push array filter through array sort
> * Remove array sort before array exists and array forall.
> * Combine 2 consecutive map filters



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32067) Use unique ConfigMap name for executor pod template

2020-10-05 Thread James Yu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208224#comment-17208224
 ] 

James Yu edited comment on SPARK-32067 at 10/5/20, 6:29 PM:


Hey, [~dongjoon] , I noticed that you added 3.1.0 into the `Affects Version/s` 
of this JIRA, But at this point, 3.1.0 is not released yet.  Did you mean to 
set the `Fix Version/s` to be 3.1.0, and it was just a typo? Or did you expect 
that this fix will not go into 3.1.0 so the bug will still affect 3.1.0? I hope 
this bug can be fixed and release as early as possible; otherwise, like 
[~sdehaes] said above, the pod template feature is useless to us.


was (Author: james...@ymail.com):
Hey, [~dongjoon] , I noticed that you added 3.1.0 into the `Affects Version/s` 
of this JIRA, But at this point, 3.1.0 is not released yet.  Did you mean to 
set the `Fix Version/s` to be 3.1.0, and it was just a typo? Or did you expect 
that this fix will not go into 3.1.0? I hope this bug can be fixed and release 
as early as possible; otherwise, like [~sdehaes] said above, the pod template 
feature is useless to us.

> Use unique ConfigMap name for executor pod template
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: James Yu
>Priority: Major
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially,  with app2 launching while app1 is still in the middle of 
> ramping up all its executor pods. The unwanted result is that some launched 
> executor pods of app1 end up having app2's executor pod template applied to 
> them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the overlapping launching periods because both apps use the same 
> ConfigMap (name). This causes some app1's executor pods being ramped up after 
> app2 is launched to be inadvertently launched with the app2's pod template. 
> The issue can be seen as follows:
> First, after submitting app1, you get these configmaps:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   9m46s
> default  podspec-configmap  1   12m{code}
> Then submit app2 while app1 is still ramping up its executors. The 
> podspec-confimap is modified by app2.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  podspec-configmap  1   13m57s{code}
>  
> PROPOSED SOLUTION:
> Properly prefix the podspec-configmap for each submitted app, ideally the 
> same way as the driver configmap:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app1--podspec-configmap1   13m57s
> default  app2--driver-conf-map  1   10s 
> default  app2--podspec-configmap1   3m{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30201) HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT

2020-10-05 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208268#comment-17208268
 ] 

Dongjoon Hyun commented on SPARK-30201:
---

Hi, [~ulysses] and [~cloud_fan]. Is this only critical on 3.0.0?

> HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT
> 
>
> Key: SPARK-30201
> URL: https://issues.apache.org/jira/browse/SPARK-30201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Critical
> Fix For: 3.0.0
>
>
> Now spark use `ObjectInspectorCopyOption.JAVA` as oi option which will 
> convert any string to UTF-8 string. When write non UTF-8 code data, then 
> `EFBFBD` will appear.
> We should use `ObjectInspectorCopyOption.DEFAULT` to support pass the bytes.
> Here is the way to reproduce:
> 1. make a file contains 16 radix 'AABBCC' which is not the UTF-8 code.
> 2. create table test1 (c string) location '$file_path';
> 3. select hex(c) from test1; // AABBCC
> 4. craete table test2 (c string) as select c from test1;
> 5. select hex(c) from test2; // EFBFBDEFBFBDEFBFBD



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33071) Join with ambiguous column succeeding but giving wrong output

2020-10-05 Thread George (Jira)

George created SPARK-33071:
--

 Summary: Join with ambiguous column succeeding but giving wrong 
output
 Key: SPARK-33071
 URL: https://issues.apache.org/jira/browse/SPARK-33071
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 2.4.4
Reporter: George


When joining two datasets where one column in each dataset is sourced from the 
same input dataset, the join successfully runs, but does not select the correct 
columns, leading to incorrect output.

Repro using pyspark:
{code:java}
sc.version
import pyspark.sql.functions as F
d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 'units' : 
4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 'sales': 1, 'units' : 
2}, {'key': 'd', 'sales': 3, 'units' : 6}]
input_df = spark.createDataFrame(d)
df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
df1 = df1.filter(F.col("key") != F.lit("c"))
df2 = df2.filter(F.col("key") != F.lit("d"))
ret = df1.join(df2, df1.key == df2.key, "full").select(
df1["key"].alias("df1_key"),
df2["key"].alias("df2_key"),
df1["sales"],
df2["units"],
F.coalesce(df1["key"], df2["key"]).alias("key"))
ret.show()
ret.explain(){code}
output for 2.4.4:
{code:java}
>>> sc.version
u'2.4.4'
>>> import pyspark.sql.functions as F
>>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 
>>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 'sales': 
>>> 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
>>> input_df = spark.createDataFrame(d)
>>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
>>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
>>> df1 = df1.filter(F.col("key") != F.lit("c"))
>>> df2 = df2.filter(F.col("key") != F.lit("d"))
>>> ret = df1.join(df2, df1.key == df2.key, "full").select(
... df1["key"].alias("df1_key"),
... df2["key"].alias("df2_key"),
... df1["sales"],
... df2["units"],
... F.coalesce(df1["key"], df2["key"]).alias("key"))
20/10/05 15:46:14 WARN Column: Constructing trivially true equals predicate, 
'key#213 = key#213'. Perhaps you need to use aliases.
>>> ret.show()
+---+---+-+-++
|df1_key|df2_key|sales|units| key|
+---+---+-+-++
|  d|  d|3| null|   d|
|   null|   null| null|2|null|
|  b|  b|5|   10|   b|
|  a|  a|3|6|   a|
+---+---+-+-++>>> ret.explain()
== Physical Plan ==
*(5) Project [key#213 AS df1_key#258, key#213 AS df2_key#259, sales#223L, 
units#230L, coalesce(key#213, key#213) AS key#260]
+- SortMergeJoin [key#213], [key#237], FullOuter
   :- *(2) Sort [key#213 ASC NULLS FIRST], false, 0
   :  +- *(2) HashAggregate(keys=[key#213], functions=[sum(sales#214L)])
   : +- Exchange hashpartitioning(key#213, 200)
   :+- *(1) HashAggregate(keys=[key#213], 
functions=[partial_sum(sales#214L)])
   :   +- *(1) Project [key#213, sales#214L]
   :  +- *(1) Filter (isnotnull(key#213) && NOT (key#213 = c))
   : +- Scan ExistingRDD[key#213,sales#214L,units#215L]
   +- *(4) Sort [key#237 ASC NULLS FIRST], false, 0
  +- *(4) HashAggregate(keys=[key#237], functions=[sum(units#239L)])
 +- Exchange hashpartitioning(key#237, 200)
+- *(3) HashAggregate(keys=[key#237], 
functions=[partial_sum(units#239L)])
   +- *(3) Project [key#237, units#239L]
  +- *(3) Filter (isnotnull(key#237) && NOT (key#237 = d))
 +- Scan ExistingRDD[key#237,sales#238L,units#239L]
{code}
output for 3.0.1:


{code:java}
// code placeholder
>>> sc.version
u'3.0.1'
>>> import pyspark.sql.functions as F
>>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 
>>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 'sales': 
>>> 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
>>> input_df = spark.createDataFrame(d)
/usr/local/lib/python2.7/site-packages/pyspark/sql/session.py:381: UserWarning: 
inferring schema from dict is deprecated,please use pyspark.sql.Row instead
  warnings.warn("inferring schema from dict is deprecated,"
>>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
>>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
>>> df1 = df1.filter(F.col("key") != F.lit("c"))
>>> df2 = df2.filter(F.col("key") != F.lit("d"))
>>> ret = df1.join(df2, df1.key == df2.key, "full").select(
... df1["key"].alias("df1_key"),
... df2["key"].alias("df2_key"),
... df1["sales"],
... df2["units"],
... F.coalesce(df1["key"], df2["key"]).alias("key"))
>>> ret.show()
+---+---+-+-++
|df1_key|df2_key|sales|units| key|
+---+---+-+-++
|  d|  d|3| null|   d|
|   null|   null| null|2|null|
|  b|  b|5|   10|   b|
|

[jira] [Assigned] (SPARK-32793) Expose assert_true in Python/Scala APIs and add error message parameter

2020-10-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32793:


Assignee: Apache Spark

> Expose assert_true in Python/Scala APIs and add error message parameter
> ---
>
> Key: SPARK-32793
> URL: https://issues.apache.org/jira/browse/SPARK-32793
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karen Feng
>Assignee: Apache Spark
>Priority: Minor
>
> # Add RAISEERROR() (or RAISE_ERROR()) to the API
>  # Add Scala/Python/R version of API for ASSERT_TRUE()
>  # Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which 
> the `message` parameter is only lazily evaluated when the condition is not 
> true
>  # Change the implementation of ASSERT_TRUE() to be rewritten during 
> optimization to IF() instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32793) Expose assert_true in Python/Scala APIs and add error message parameter

2020-10-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32793:


Assignee: (was: Apache Spark)

> Expose assert_true in Python/Scala APIs and add error message parameter
> ---
>
> Key: SPARK-32793
> URL: https://issues.apache.org/jira/browse/SPARK-32793
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karen Feng
>Priority: Minor
>
> # Add RAISEERROR() (or RAISE_ERROR()) to the API
>  # Add Scala/Python/R version of API for ASSERT_TRUE()
>  # Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which 
> the `message` parameter is only lazily evaluated when the condition is not 
> true
>  # Change the implementation of ASSERT_TRUE() to be rewritten during 
> optimization to IF() instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32793) Expose assert_true in Python/Scala APIs and add error message parameter

2020-10-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208346#comment-17208346
 ] 

Apache Spark commented on SPARK-32793:
--

User 'karenfeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/29947

> Expose assert_true in Python/Scala APIs and add error message parameter
> ---
>
> Key: SPARK-32793
> URL: https://issues.apache.org/jira/browse/SPARK-32793
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karen Feng
>Priority: Minor
>
> # Add RAISEERROR() (or RAISE_ERROR()) to the API
>  # Add Scala/Python/R version of API for ASSERT_TRUE()
>  # Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which 
> the `message` parameter is only lazily evaluated when the condition is not 
> true
>  # Change the implementation of ASSERT_TRUE() to be rewritten during 
> optimization to IF() instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32793) Expose assert_true in Python/Scala APIs and add error message parameter

2020-10-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208347#comment-17208347
 ] 

Apache Spark commented on SPARK-32793:
--

User 'karenfeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/29947

> Expose assert_true in Python/Scala APIs and add error message parameter
> ---
>
> Key: SPARK-32793
> URL: https://issues.apache.org/jira/browse/SPARK-32793
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karen Feng
>Priority: Minor
>
> # Add RAISEERROR() (or RAISE_ERROR()) to the API
>  # Add Scala/Python/R version of API for ASSERT_TRUE()
>  # Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which 
> the `message` parameter is only lazily evaluated when the condition is not 
> true
>  # Change the implementation of ASSERT_TRUE() to be rewritten during 
> optimization to IF() instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32067) Use unique ConfigMap name for executor pod template

2020-10-05 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208372#comment-17208372
 ] 

Dongjoon Hyun edited comment on SPARK-32067 at 10/5/20, 10:16 PM:
--

[~james...@ymail.com]. It's used when `master` branch is affected in order to 
distinguish the following.
- `Affected Version = the version of master` means this bug exists in `master` 
branch or this new feature is added in `master` branch for next release.
- `Affected Version = 3.0.2 and not 3.1.0` means this bug exists only at 
`branch-3.0`. In master branch, this doesn't exist due to the another 
improvement or fixes.


was (Author: dongjoon):
[~james...@ymail.com]. It's used when `master` branch is affected to 
distinguish the following.
- `Affected Version = the version of master` means this bug exists in `master` 
branch or this new feature is added in `master` branch for next release.
- `Affected Version = 3.0.2 and not 3.1.0` means this bug exists only at 
`branch-3.0`. In master branch, this doesn't exist due to the another 
improvement or fixes.

> Use unique ConfigMap name for executor pod template
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: James Yu
>Priority: Major
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially,  with app2 launching while app1 is still in the middle of 
> ramping up all its executor pods. The unwanted result is that some launched 
> executor pods of app1 end up having app2's executor pod template applied to 
> them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the overlapping launching periods because both apps use the same 
> ConfigMap (name). This causes some app1's executor pods being ramped up after 
> app2 is launched to be inadvertently launched with the app2's pod template. 
> The issue can be seen as follows:
> First, after submitting app1, you get these configmaps:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   9m46s
> default  podspec-configmap  1   12m{code}
> Then submit app2 while app1 is still ramping up its executors. The 
> podspec-confimap is modified by app2.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  podspec-configmap  1   13m57s{code}
>  
> PROPOSED SOLUTION:
> Properly prefix the podspec-configmap for each submitted app, ideally the 
> same way as the driver configmap:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app1--podspec-configmap1   13m57s
> default  app2--driver-conf-map  1   10s 
> default  app2--podspec-configmap1   3m{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32067) Use unique ConfigMap name for executor pod template

2020-10-05 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208372#comment-17208372
 ] 

Dongjoon Hyun commented on SPARK-32067:
---

[~james...@ymail.com]. It's used when `master` branch is affected to 
distinguish the following.
- `Affected Version = the version of master` means this bug exists in `master` 
branch or this new feature is added in `master` branch for next release.
- `Affected Version = 3.0.2 and not 3.1.0` means this bug exists only at 
`branch-3.0`. In master branch, this doesn't exist due to the another 
improvement or fixes.

> Use unique ConfigMap name for executor pod template
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: James Yu
>Priority: Major
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially,  with app2 launching while app1 is still in the middle of 
> ramping up all its executor pods. The unwanted result is that some launched 
> executor pods of app1 end up having app2's executor pod template applied to 
> them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the overlapping launching periods because both apps use the same 
> ConfigMap (name). This causes some app1's executor pods being ramped up after 
> app2 is launched to be inadvertently launched with the app2's pod template. 
> The issue can be seen as follows:
> First, after submitting app1, you get these configmaps:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   9m46s
> default  podspec-configmap  1   12m{code}
> Then submit app2 while app1 is still ramping up its executors. The 
> podspec-confimap is modified by app2.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  podspec-configmap  1   13m57s{code}
>  
> PROPOSED SOLUTION:
> Properly prefix the podspec-configmap for each submitted app, ideally the 
> same way as the driver configmap:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app1--podspec-configmap1   13m57s
> default  app2--driver-conf-map  1   10s 
> default  app2--podspec-configmap1   3m{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20202) Remove references to org.spark-project.hive

2020-10-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-20202.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29936
[https://github.com/apache/spark/pull/29936]

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1, 2.2.3, 2.3.4, 2.4.4, 3.0.0, 3.1.0
>Reporter: Owen O'Malley
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30201) HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT

2020-10-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208385#comment-17208385
 ] 

Apache Spark commented on SPARK-30201:
--

User 'anuragmantri' has created a pull request for this issue:
https://github.com/apache/spark/pull/29948

> HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT
> 
>
> Key: SPARK-30201
> URL: https://issues.apache.org/jira/browse/SPARK-30201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Critical
> Fix For: 3.0.0
>
>
> Now spark use `ObjectInspectorCopyOption.JAVA` as oi option which will 
> convert any string to UTF-8 string. When write non UTF-8 code data, then 
> `EFBFBD` will appear.
> We should use `ObjectInspectorCopyOption.DEFAULT` to support pass the bytes.
> Here is the way to reproduce:
> 1. make a file contains 16 radix 'AABBCC' which is not the UTF-8 code.
> 2. create table test1 (c string) location '$file_path';
> 3. select hex(c) from test1; // AABBCC
> 4. craete table test2 (c string) as select c from test1;
> 5. select hex(c) from test2; // EFBFBDEFBFBDEFBFBD



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33072) Remove two Hive 1.2-related Jenkins jobs

2020-10-05 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208388#comment-17208388
 ] 

Dongjoon Hyun commented on SPARK-33072:
---

Could you remove these two jobs, [~shaneknapp]?

> Remove two Hive 1.2-related Jenkins jobs
> 
>
> Key: SPARK-33072
> URL: https://issues.apache.org/jira/browse/SPARK-33072
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> SPARK-20202 removed `hive-1.2` profile at Apache Spark 3.1.0.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-1.2/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-hive-1.2/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33072) Remove two Hive 1.2-related Jenkins jobs

2020-10-05 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-33072:
-

 Summary: Remove two Hive 1.2-related Jenkins jobs
 Key: SPARK-33072
 URL: https://issues.apache.org/jira/browse/SPARK-33072
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun


SPARK-20202 removed `hive-1.2` profile at Apache Spark 3.1.0.
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-1.2/

- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-hive-1.2/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30201) HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT

2020-10-05 Thread Anurag Mantripragada (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208389#comment-17208389
 ] 

Anurag Mantripragada commented on SPARK-30201:
--

[~cloud_fan], [~ulysses], [~dongjoon] - I verified this issue is present in 
branch-2.4. Test failure below:



{{[info] == Results ==}}
{{[info] !== Correct Answer - 1 == == Spark Answer - 1 ==}}
{{[info] !struct<> struct}}
{{[info] ![AABBCC] [EFBFBDEFBFBDEFBFBD] (QueryTest.scala:163)}}
{{[info] org.scalatest.exceptions.TestFailedException:}}

 

I created a PR to backport it to branch 2.4. It was a clean cherry-pick, could 
you please take a look? Thanks

[https://github.com/apache/spark/pull/29948]

> HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT
> 
>
> Key: SPARK-30201
> URL: https://issues.apache.org/jira/browse/SPARK-30201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Critical
> Fix For: 3.0.0
>
>
> Now spark use `ObjectInspectorCopyOption.JAVA` as oi option which will 
> convert any string to UTF-8 string. When write non UTF-8 code data, then 
> `EFBFBD` will appear.
> We should use `ObjectInspectorCopyOption.DEFAULT` to support pass the bytes.
> Here is the way to reproduce:
> 1. make a file contains 16 radix 'AABBCC' which is not the UTF-8 code.
> 2. create table test1 (c string) location '$file_path';
> 3. select hex(c) from test1; // AABBCC
> 4. craete table test2 (c string) as select c from test1;
> 5. select hex(c) from test2; // EFBFBDEFBFBDEFBFBD



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33072) Remove two Hive 1.2-related Jenkins jobs

2020-10-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33072:
--
Description: 
SPARK-20202 removed `hive-1.2` profile at Apache Spark 3.1.0 and excluded the 
following two from `Spark QA Dashboard`. We need to remove it.
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-1.2/

- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-hive-1.2/

  was:
SPARK-20202 removed `hive-1.2` profile at Apache Spark 3.1.0.
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-1.2/

- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-hive-1.2/


> Remove two Hive 1.2-related Jenkins jobs
> 
>
> Key: SPARK-33072
> URL: https://issues.apache.org/jira/browse/SPARK-33072
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> SPARK-20202 removed `hive-1.2` profile at Apache Spark 3.1.0 and excluded the 
> following two from `Spark QA Dashboard`. We need to remove it.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-1.2/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-hive-1.2/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33072) Remove two Hive 1.2-related Jenkins jobs

2020-10-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33072:
--
Issue Type: Task  (was: Bug)

> Remove two Hive 1.2-related Jenkins jobs
> 
>
> Key: SPARK-33072
> URL: https://issues.apache.org/jira/browse/SPARK-33072
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> SPARK-20202 removed `hive-1.2` profile at Apache Spark 3.1.0 and excluded the 
> following two from `Spark QA Dashboard`. We need to remove it.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-1.2/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-hive-1.2/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33039) Misleading watermark calculation in structure streaming

2020-10-05 Thread Aoyuan Liao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208409#comment-17208409
 ] 

Aoyuan Liao commented on SPARK-33039:
-

It is not a bug. It is documented that

"

It is important to note that the following conditions must be satisfied for the 
watermarking to clean the state in aggregation queries _(as of Spark 2.1.1, 
subject to change in the future)_.
 * *Output mode must be Append or Update*

" in 
[https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking.]

In your case, the data is output in complete mode so all aggregated data are 
preserved.

> Misleading watermark calculation in structure streaming
> ---
>
> Key: SPARK-33039
> URL: https://issues.apache.org/jira/browse/SPARK-33039
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4
>Reporter: Sandish Kumar HN
>Priority: Major
>
> source code:
> {code:java}
> import org.apache.spark.sql.SparkSession
> import org.apache.hadoop.fs.Path
> import java.sql.Timestamp
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.streaming.{ProcessingTime, Trigger}
> object TestWaterMark extends App {
>  val spark = SparkSession.builder().master("local").getOrCreate()
>  val sc = spark.sparkContext
>  val dir = new Path("/tmp/test-structured-streaming")
>  val fs = dir.getFileSystem(sc.hadoopConfiguration)
>  fs.mkdirs(dir)
>  val schema = StructType(StructField("vilue", StringType) ::
>  StructField("timestamp", TimestampType) ::
>  Nil)
>  val eventStream = spark
>  .readStream
>  .option("sep", ";")
>  .option("header", "false")
>  .schema(schema)
>  .csv(dir.toString)
>  // Watermarked aggregation
>  val eventsCount = eventStream
>  .withWatermark("timestamp", "5 seconds")
>  .groupBy(window(col("timestamp"), "10 seconds"))
>  .count
>  def writeFile(path: Path, data: String) {
>  val file = fs.create(path)
>  file.writeUTF(data)
>  file.close()
>  }
>  // Debug query
>  val query = eventsCount.writeStream
>  .format("console")
>  .outputMode("complete")
>  .option("truncate", "false")
>  .trigger(Trigger.ProcessingTime("5 seconds"))
>  .start()
>  writeFile(new Path(dir, "file1"), """
>  |OLD;2019-08-09 10:05:00
>  |OLD;2019-08-09 10:10:00
>  |OLD;2019-08-09 10:15:00""".stripMargin)
>  query.processAllAvailable()
>  val lp1 = query.lastProgress
>  println(lp1.eventTime)
>  writeFile(new Path(dir, "file2"), """
>  |NEW;2020-08-29 10:05:00
>  |NEW;2020-08-29 10:10:00
>  |NEW;2020-08-29 10:15:00""".stripMargin)
>  query.processAllAvailable()
>  val lp2 = query.lastProgress
>  println(lp2.eventTime)
>  writeFile(new Path(dir, "file4"), """
>  |OLD;2017-08-10 10:05:00
>  |OLD;2017-08-10 10:10:00
>  |OLD;2017-08-10 10:15:00""".stripMargin)
>  writeFile(new Path(dir, "file3"), "")
>  query.processAllAvailable()
>  val lp3 = query.lastProgress
>  println(lp3.eventTime)
>  query.awaitTermination()
>  fs.delete(dir, true)
> }
> {code}
> OUTPUT:
>  
> {code:java}
> ---
> Batch: 0
> ---
> +--+-+
> |window |count|
> +--+-+
> |[2019-08-09 10:05:00, 2019-08-09 10:05:10]|1 |
> |[2019-08-09 10:15:00, 2019-08-09 10:15:10]|1 |
> |[2019-08-09 10:10:00, 2019-08-09 10:10:10]|1 |
> +--+-+
> {min=2019-08-09T17:05:00.000Z, avg=2019-08-09T17:10:00.000Z, 
> watermark=1970-01-01T00:00:00.000Z, max=2019-08-09T17:15:00.000Z}
> ---
> Batch: 1
> ---
> +--+-+
> |window |count|
> +--+-+
> |[2020-08-29 10:15:00, 2020-08-29 10:15:10]|1 |
> |[2020-08-29 10:10:00, 2020-08-29 10:10:10]|1 |
> |[2019-08-09 10:05:00, 2019-08-09 10:05:10]|1 |
> |[2020-08-29 10:05:00, 2020-08-29 10:05:10]|1 |
> |[2019-08-09 10:15:00, 2019-08-09 10:15:10]|1 |
> |[2019-08-09 10:10:00, 2019-08-09 10:10:10]|1 |
> +--+-+
> {min=2020-08-29T17:05:00.000Z, avg=2020-08-29T17:10:00.000Z, 
> watermark=2019-08-09T17:14:55.000Z, max=2020-08-29T17:15:00.000Z}
> ---
> Batch: 2
> ---
> +--+-+
> |window |count|
> +--+-+
> |[2017-08-10 10:15:00, 2017-08-10 10:15:10]|1 |
> |[2020-08-29 10:15:00, 2020-08-29 10:15:10]|1 |
> |[2017-08-10 10:05:00, 2017-08-10 10:05:10]|1 |
> |[2020-08-29 10:10:00, 2020-08-29 10:10:10]|1 |
> |[2019-08-0

[jira] [Updated] (SPARK-30201) HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT

2020-10-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30201:
--
Labels: correctness  (was: )

> HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT
> 
>
> Key: SPARK-30201
> URL: https://issues.apache.org/jira/browse/SPARK-30201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Critical
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Now spark use `ObjectInspectorCopyOption.JAVA` as oi option which will 
> convert any string to UTF-8 string. When write non UTF-8 code data, then 
> `EFBFBD` will appear.
> We should use `ObjectInspectorCopyOption.DEFAULT` to support pass the bytes.
> Here is the way to reproduce:
> 1. make a file contains 16 radix 'AABBCC' which is not the UTF-8 code.
> 2. create table test1 (c string) location '$file_path';
> 3. select hex(c) from test1; // AABBCC
> 4. craete table test2 (c string) as select c from test1;
> 5. select hex(c) from test2; // EFBFBDEFBFBDEFBFBD



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30201) HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT

2020-10-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30201:
--
Affects Version/s: 2.4.7

> HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT
> 
>
> Key: SPARK-30201
> URL: https://issues.apache.org/jira/browse/SPARK-30201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Critical
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Now spark use `ObjectInspectorCopyOption.JAVA` as oi option which will 
> convert any string to UTF-8 string. When write non UTF-8 code data, then 
> `EFBFBD` will appear.
> We should use `ObjectInspectorCopyOption.DEFAULT` to support pass the bytes.
> Here is the way to reproduce:
> 1. make a file contains 16 radix 'AABBCC' which is not the UTF-8 code.
> 2. create table test1 (c string) location '$file_path';
> 3. select hex(c) from test1; // AABBCC
> 4. craete table test2 (c string) as select c from test1;
> 5. select hex(c) from test2; // EFBFBDEFBFBDEFBFBD



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30201) HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT

2020-10-05 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208414#comment-17208414
 ] 

Dongjoon Hyun commented on SPARK-30201:
---

I labeled this issue as `correctness` because the query result is wrong.

> HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT
> 
>
> Key: SPARK-30201
> URL: https://issues.apache.org/jira/browse/SPARK-30201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Critical
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Now spark use `ObjectInspectorCopyOption.JAVA` as oi option which will 
> convert any string to UTF-8 string. When write non UTF-8 code data, then 
> `EFBFBD` will appear.
> We should use `ObjectInspectorCopyOption.DEFAULT` to support pass the bytes.
> Here is the way to reproduce:
> 1. make a file contains 16 radix 'AABBCC' which is not the UTF-8 code.
> 2. create table test1 (c string) location '$file_path';
> 3. select hex(c) from test1; // AABBCC
> 4. craete table test2 (c string) as select c from test1;
> 5. select hex(c) from test2; // EFBFBDEFBFBDEFBFBD



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30201) HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT

2020-10-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30201:
--
Affects Version/s: 2.3.4

> HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT
> 
>
> Key: SPARK-30201
> URL: https://issues.apache.org/jira/browse/SPARK-30201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.7, 3.0.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Critical
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Now spark use `ObjectInspectorCopyOption.JAVA` as oi option which will 
> convert any string to UTF-8 string. When write non UTF-8 code data, then 
> `EFBFBD` will appear.
> We should use `ObjectInspectorCopyOption.DEFAULT` to support pass the bytes.
> Here is the way to reproduce:
> 1. make a file contains 16 radix 'AABBCC' which is not the UTF-8 code.
> 2. create table test1 (c string) location '$file_path';
> 3. select hex(c) from test1; // AABBCC
> 4. craete table test2 (c string) as select c from test1;
> 5. select hex(c) from test2; // EFBFBDEFBFBDEFBFBD



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30201) HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT

2020-10-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30201:
--
Affects Version/s: 2.2.3

> HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT
> 
>
> Key: SPARK-30201
> URL: https://issues.apache.org/jira/browse/SPARK-30201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.7, 3.0.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Critical
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Now spark use `ObjectInspectorCopyOption.JAVA` as oi option which will 
> convert any string to UTF-8 string. When write non UTF-8 code data, then 
> `EFBFBD` will appear.
> We should use `ObjectInspectorCopyOption.DEFAULT` to support pass the bytes.
> Here is the way to reproduce:
> 1. make a file contains 16 radix 'AABBCC' which is not the UTF-8 code.
> 2. create table test1 (c string) location '$file_path';
> 3. select hex(c) from test1; // AABBCC
> 4. craete table test2 (c string) as select c from test1;
> 5. select hex(c) from test2; // EFBFBDEFBFBDEFBFBD



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30201) HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT

2020-10-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30201:
--
Affects Version/s: 2.1.3

> HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT
> 
>
> Key: SPARK-30201
> URL: https://issues.apache.org/jira/browse/SPARK-30201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Critical
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Now spark use `ObjectInspectorCopyOption.JAVA` as oi option which will 
> convert any string to UTF-8 string. When write non UTF-8 code data, then 
> `EFBFBD` will appear.
> We should use `ObjectInspectorCopyOption.DEFAULT` to support pass the bytes.
> Here is the way to reproduce:
> 1. make a file contains 16 radix 'AABBCC' which is not the UTF-8 code.
> 2. create table test1 (c string) location '$file_path';
> 3. select hex(c) from test1; // AABBCC
> 4. craete table test2 (c string) as select c from test1;
> 5. select hex(c) from test2; // EFBFBDEFBFBDEFBFBD



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30201) HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT

2020-10-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30201:
--
Affects Version/s: 2.0.2

> HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT
> 
>
> Key: SPARK-30201
> URL: https://issues.apache.org/jira/browse/SPARK-30201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Critical
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Now spark use `ObjectInspectorCopyOption.JAVA` as oi option which will 
> convert any string to UTF-8 string. When write non UTF-8 code data, then 
> `EFBFBD` will appear.
> We should use `ObjectInspectorCopyOption.DEFAULT` to support pass the bytes.
> Here is the way to reproduce:
> 1. make a file contains 16 radix 'AABBCC' which is not the UTF-8 code.
> 2. create table test1 (c string) location '$file_path';
> 3. select hex(c) from test1; // AABBCC
> 4. craete table test2 (c string) as select c from test1;
> 5. select hex(c) from test2; // EFBFBDEFBFBDEFBFBD



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33069) Skip test result report if no JUnit XML files are found

2020-10-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33069:


Assignee: Hyukjin Kwon

> Skip test result report if no JUnit XML files are found
> ---
>
> Key: SPARK-33069
> URL: https://issues.apache.org/jira/browse/SPARK-33069
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, if there are no JUnit XML files are found, the test results fail.
> See also https://github.com/apache/spark/pull/29906#issuecomment-702525542



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33069) Skip test result report if no JUnit XML files are found

2020-10-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33069.
--
Fix Version/s: 2.4.8
   3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 29946
[https://github.com/apache/spark/pull/29946]

> Skip test result report if no JUnit XML files are found
> ---
>
> Key: SPARK-33069
> URL: https://issues.apache.org/jira/browse/SPARK-33069
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0, 3.0.2, 2.4.8
>
>
> Currently, if there are no JUnit XML files are found, the test results fail.
> See also https://github.com/apache/spark/pull/29906#issuecomment-702525542



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33057) Cannot use filter with window operations

2020-10-05 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-33057.
--
Resolution: Invalid

> Cannot use filter with window operations
> 
>
> Key: SPARK-33057
> URL: https://issues.apache.org/jira/browse/SPARK-33057
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Li Jin
>Priority: Major
>
> Current, trying to use filter with a window operations will fail:
>  
> {code:java}
> df = spark.range(100)
> win = Window.partitionBy().orderBy('id')
> df.filter(F.rank().over(win) > 10).show()
> {code}
> Error:
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/sql/dataframe.py",
>  line 1461, in filter
>     jdf = self._jdf.filter(condition._jc)
>   File 
> "/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
>  line 1304, in __call__
>   File 
> "/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/sql/utils.py",
>  line 134, in deco
>     raise_from(converted)
>   File "", line 3, in raise_from
> pyspark.sql.utils.AnalysisException: It is not allowed to use window 
> functions inside WHERE clause;{code}
> Although the code is same as the code below, which works:
> {code:java}
> df = spark.range(100)
> win = Window.partitionBy().orderBy('id')
> df = df.withColumn('rank', F.rank().over(win))
> df = df[df['rank'] > 10]
> df = df.drop('rank'){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33057) Cannot use filter with window operations

2020-10-05 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208431#comment-17208431
 ] 

Takeshi Yamamuro commented on SPARK-33057:
--

I'll close this now baed on the comments above. If any problem, please reopen 
it.

> Cannot use filter with window operations
> 
>
> Key: SPARK-33057
> URL: https://issues.apache.org/jira/browse/SPARK-33057
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Li Jin
>Priority: Major
>
> Current, trying to use filter with a window operations will fail:
>  
> {code:java}
> df = spark.range(100)
> win = Window.partitionBy().orderBy('id')
> df.filter(F.rank().over(win) > 10).show()
> {code}
> Error:
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/sql/dataframe.py",
>  line 1461, in filter
>     jdf = self._jdf.filter(condition._jc)
>   File 
> "/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
>  line 1304, in __call__
>   File 
> "/Users/icexelloss/opt/miniconda3/envs/ibis-dev-spark-3/lib/python3.8/site-packages/pyspark/sql/utils.py",
>  line 134, in deco
>     raise_from(converted)
>   File "", line 3, in raise_from
> pyspark.sql.utils.AnalysisException: It is not allowed to use window 
> functions inside WHERE clause;{code}
> Although the code is same as the code below, which works:
> {code:java}
> df = spark.range(100)
> win = Window.partitionBy().orderBy('id')
> df = df.withColumn('rank', F.rank().over(win))
> df = df[df['rank'] > 10]
> df = df.drop('rank'){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32945) Avoid project combination when it will hurst performance

2020-10-05 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-32945:

Summary: Avoid project combination when it will hurst performance  (was: 
Avoid project combinatin when it will hurst performance)

> Avoid project combination when it will hurst performance
> 
>
> Key: SPARK-32945
> URL: https://issues.apache.org/jira/browse/SPARK-32945
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> In some cases, project combination will hurt performance. We should avoid 
> project combination for that case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32945) Avoid project combination when it will hurst performance

2020-10-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208464#comment-17208464
 ] 

Apache Spark commented on SPARK-32945:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/29950

> Avoid project combination when it will hurst performance
> 
>
> Key: SPARK-32945
> URL: https://issues.apache.org/jira/browse/SPARK-32945
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> In some cases, project combination will hurt performance. We should avoid 
> project combination for that case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32945) Avoid project combination when it will hurst performance

2020-10-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208463#comment-17208463
 ] 

Apache Spark commented on SPARK-32945:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/29950

> Avoid project combination when it will hurst performance
> 
>
> Key: SPARK-32945
> URL: https://issues.apache.org/jira/browse/SPARK-32945
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> In some cases, project combination will hurt performance. We should avoid 
> project combination for that case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32945) Avoid project combination when it will hurst performance

2020-10-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32945:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Avoid project combination when it will hurst performance
> 
>
> Key: SPARK-32945
> URL: https://issues.apache.org/jira/browse/SPARK-32945
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> In some cases, project combination will hurt performance. We should avoid 
> project combination for that case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32945) Avoid project combination when it will hurst performance

2020-10-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32945:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Avoid project combination when it will hurst performance
> 
>
> Key: SPARK-32945
> URL: https://issues.apache.org/jira/browse/SPARK-32945
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> In some cases, project combination will hurt performance. We should avoid 
> project combination for that case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30201) HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT

2020-10-05 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208468#comment-17208468
 ] 

Dongjoon Hyun commented on SPARK-30201:
---

This lands at `branch-2.4` via https://github.com/apache/spark/pull/29948 

> HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT
> 
>
> Key: SPARK-30201
> URL: https://issues.apache.org/jira/browse/SPARK-30201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Critical
>  Labels: correctness
> Fix For: 2.4.8, 3.0.0
>
>
> Now spark use `ObjectInspectorCopyOption.JAVA` as oi option which will 
> convert any string to UTF-8 string. When write non UTF-8 code data, then 
> `EFBFBD` will appear.
> We should use `ObjectInspectorCopyOption.DEFAULT` to support pass the bytes.
> Here is the way to reproduce:
> 1. make a file contains 16 radix 'AABBCC' which is not the UTF-8 code.
> 2. create table test1 (c string) location '$file_path';
> 3. select hex(c) from test1; // AABBCC
> 4. craete table test2 (c string) as select c from test1;
> 5. select hex(c) from test2; // EFBFBDEFBFBDEFBFBD



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30201) HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT

2020-10-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30201:
--
Fix Version/s: 2.4.8

> HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT
> 
>
> Key: SPARK-30201
> URL: https://issues.apache.org/jira/browse/SPARK-30201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.7, 3.0.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Critical
>  Labels: correctness
> Fix For: 2.4.8, 3.0.0
>
>
> Now spark use `ObjectInspectorCopyOption.JAVA` as oi option which will 
> convert any string to UTF-8 string. When write non UTF-8 code data, then 
> `EFBFBD` will appear.
> We should use `ObjectInspectorCopyOption.DEFAULT` to support pass the bytes.
> Here is the way to reproduce:
> 1. make a file contains 16 radix 'AABBCC' which is not the UTF-8 code.
> 2. create table test1 (c string) location '$file_path';
> 3. select hex(c) from test1; // AABBCC
> 4. craete table test2 (c string) as select c from test1;
> 5. select hex(c) from test2; // EFBFBDEFBFBDEFBFBD



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33067) Add negative checks to JDBC v2 Table Catalog tests

2020-10-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33067:


Assignee: Maxim Gekk

> Add negative checks to JDBC v2 Table Catalog tests
> --
>
> Key: SPARK-33067
> URL: https://issues.apache.org/jira/browse/SPARK-33067
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Add checks when JDBC v2 commands fail



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33067) Add negative checks to JDBC v2 Table Catalog tests

2020-10-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33067.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29945
[https://github.com/apache/spark/pull/29945]

> Add negative checks to JDBC v2 Table Catalog tests
> --
>
> Key: SPARK-33067
> URL: https://issues.apache.org/jira/browse/SPARK-33067
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> Add checks when JDBC v2 commands fail



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32067) Use unique ConfigMap name for executor pod template

2020-10-05 Thread Stijn De Haes (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208493#comment-17208493
 ] 

Stijn De Haes commented on SPARK-32067:
---

[~dongjoon] if the PR is merged are you fine with me cherry picking this to 
branch-3.0?

Also I noticed affects version 2.4.7 in the Jira ticket however the template 
support was only added since 3.0.0 unless I am mistaken. So best to remove that 
version

> Use unique ConfigMap name for executor pod template
> ---
>
> Key: SPARK-32067
> URL: https://issues.apache.org/jira/browse/SPARK-32067
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.4.7, 3.0.1, 3.1.0
>Reporter: James Yu
>Priority: Major
>
> THE BUG:
> The bug is reproducible by spark-submit two different apps (app1 and app2) 
> with different executor pod templates (e.g., different labels) to K8s 
> sequentially,  with app2 launching while app1 is still in the middle of 
> ramping up all its executor pods. The unwanted result is that some launched 
> executor pods of app1 end up having app2's executor pod template applied to 
> them.
> The root cause appears to be that app1's podspec-configmap got overwritten by 
> app2 during the overlapping launching periods because both apps use the same 
> ConfigMap (name). This causes some app1's executor pods being ramped up after 
> app2 is launched to be inadvertently launched with the app2's pod template. 
> The issue can be seen as follows:
> First, after submitting app1, you get these configmaps:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   9m46s
> default  podspec-configmap  1   12m{code}
> Then submit app2 while app1 is still ramping up its executors. The 
> podspec-confimap is modified by app2.
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app2--driver-conf-map  1   10s
> default  podspec-configmap  1   13m57s{code}
>  
> PROPOSED SOLUTION:
> Properly prefix the podspec-configmap for each submitted app, ideally the 
> same way as the driver configmap:
> {code:java}
> NAMESPACENAME   DATAAGE
> default  app1--driver-conf-map  1   11m43s
> default  app1--podspec-configmap1   13m57s
> default  app2--driver-conf-map  1   10s 
> default  app2--podspec-configmap1   3m{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33073) Improve error handling on Pandas to Arrow conversion failures

2020-10-05 Thread Bryan Cutler (Jira)

Bryan Cutler created SPARK-33073:


 Summary: Improve error handling on Pandas to Arrow conversion 
failures
 Key: SPARK-33073
 URL: https://issues.apache.org/jira/browse/SPARK-33073
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.1
Reporter: Bryan Cutler


Currently, when converting from Pandas to Arrow for Pandas UDF return values or 
from createDataFrame(), PySpark will catch all ArrowExceptions and display info 
on how to disable the safe conversion config. This is displayed with the 
original error as a tuple:

{noformat}
('Exception thrown when converting pandas.Series (object) to Arrow Array 
(int32). It can be caused by overflows or other unsafe conversions warned by 
Arrow. Arrow safe type check can be disabled by using SQL config 
`spark.sql.execution.pandas.convertToArrowArraySafely`.', ArrowInvalid('Could 
not convert a with type str: tried to convert to int'))
{noformat}

The problem is that this is meant mainly for thing like float truncation or 
overflow, but will also show if the user has an invalid schema with types that 
are incompatible. The extra information is confusing in this case and the real 
error is buried.

This could be improved by only printing the extra info on how to disable safe 
checking if the config is actually set. Also, any safe failures would be a 
ValueError, which ArrowInvaildError is a subclass, so the catch could be made 
more narrow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33073) Improve error handling on Pandas to Arrow conversion failures

2020-10-05 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-33073:
-
Description: 
Currently, when converting from Pandas to Arrow for Pandas UDF return values or 
from createDataFrame(), PySpark will catch all ArrowExceptions and display info 
on how to disable the safe conversion config. This is displayed with the 
original error as a tuple:

{noformat}
('Exception thrown when converting pandas.Series (object) to Arrow Array 
(int32). It can be caused by overflows or other unsafe conversions warned by 
Arrow. Arrow safe type check can be disabled by using SQL config 
`spark.sql.execution.pandas.convertToArrowArraySafely`.', ArrowInvalid('Could 
not convert a with type str: tried to convert to int'))
{noformat}

The problem is that this is meant mainly for thing like float truncation or 
overflow, but will also show if the user has an invalid schema with types that 
are incompatible. The extra information is confusing in this case and the real 
error is buried.

This could be improved by only printing the extra info on how to disable safe 
checking if the config is actually set and using exception chaining to better 
show the original error. Also, any safe failures would be a ValueError, which 
ArrowInvaildError is a subclass, so the catch could be made more narrow.

  was:
Currently, when converting from Pandas to Arrow for Pandas UDF return values or 
from createDataFrame(), PySpark will catch all ArrowExceptions and display info 
on how to disable the safe conversion config. This is displayed with the 
original error as a tuple:

{noformat}
('Exception thrown when converting pandas.Series (object) to Arrow Array 
(int32). It can be caused by overflows or other unsafe conversions warned by 
Arrow. Arrow safe type check can be disabled by using SQL config 
`spark.sql.execution.pandas.convertToArrowArraySafely`.', ArrowInvalid('Could 
not convert a with type str: tried to convert to int'))
{noformat}

The problem is that this is meant mainly for thing like float truncation or 
overflow, but will also show if the user has an invalid schema with types that 
are incompatible. The extra information is confusing in this case and the real 
error is buried.

This could be improved by only printing the extra info on how to disable safe 
checking if the config is actually set. Also, any safe failures would be a 
ValueError, which ArrowInvaildError is a subclass, so the catch could be made 
more narrow.


> Improve error handling on Pandas to Arrow conversion failures
> -
>
> Key: SPARK-33073
> URL: https://issues.apache.org/jira/browse/SPARK-33073
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently, when converting from Pandas to Arrow for Pandas UDF return values 
> or from createDataFrame(), PySpark will catch all ArrowExceptions and display 
> info on how to disable the safe conversion config. This is displayed with the 
> original error as a tuple:
> {noformat}
> ('Exception thrown when converting pandas.Series (object) to Arrow Array 
> (int32). It can be caused by overflows or other unsafe conversions warned by 
> Arrow. Arrow safe type check can be disabled by using SQL config 
> `spark.sql.execution.pandas.convertToArrowArraySafely`.', ArrowInvalid('Could 
> not convert a with type str: tried to convert to int'))
> {noformat}
> The problem is that this is meant mainly for thing like float truncation or 
> overflow, but will also show if the user has an invalid schema with types 
> that are incompatible. The extra information is confusing in this case and 
> the real error is buried.
> This could be improved by only printing the extra info on how to disable safe 
> checking if the config is actually set and using exception chaining to better 
> show the original error. Also, any safe failures would be a ValueError, which 
> ArrowInvaildError is a subclass, so the catch could be made more narrow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

68 matches

Mail list logo