[jira] [Commented] (SPARK-31783) Performance test on dense and sparse datasets

2020-05-20 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112818#comment-17112818
 ] 

zhengruifeng commented on SPARK-31783:
--

The test code is in 'blockify_total', and current result is the .xlsx file.

> Performance test on dense and sparse datasets
> -
>
> Key: SPARK-31783
> URL: https://issues.apache.org/jira/browse/SPARK-31783
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Attachments: blockify_perf_20200521.xlsx, blockify_total
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31783) Performance test on dense and sparse datasets

2020-05-20 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-31783:
-
Attachment: blockify_perf_20200521.xlsx

> Performance test on dense and sparse datasets
> -
>
> Key: SPARK-31783
> URL: https://issues.apache.org/jira/browse/SPARK-31783
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Attachments: blockify_perf_20200521.xlsx, blockify_total
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31783) Performance test on dense and sparse datasets

2020-05-20 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-31783:
-
Attachment: blockify_total

> Performance test on dense and sparse datasets
> -
>
> Key: SPARK-31783
> URL: https://issues.apache.org/jira/browse/SPARK-31783
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Attachments: blockify_perf_20200521.xlsx, blockify_total
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31783) Performance test on dense and sparse datasets

2020-05-20 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-31783:


Assignee: zhengruifeng

> Performance test on dense and sparse datasets
> -
>
> Key: SPARK-31783
> URL: https://issues.apache.org/jira/browse/SPARK-31783
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31783) Performance test on dense and sparse datasets

2020-05-20 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-31783:


 Summary: Performance test on dense and sparse datasets
 Key: SPARK-31783
 URL: https://issues.apache.org/jira/browse/SPARK-31783
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 3.1.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31782) Performance test on dense and sparse datasets

2020-05-20 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-31782:


 Summary: Performance test on dense and sparse datasets
 Key: SPARK-31782
 URL: https://issues.apache.org/jira/browse/SPARK-31782
 Project: Spark
  Issue Type: Test
  Components: ML
Affects Versions: 3.1.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31782) Performance test on dense and sparse datasets

2020-05-20 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-31782.
--
Resolution: Duplicate

> Performance test on dense and sparse datasets
> -
>
> Key: SPARK-31782
> URL: https://issues.apache.org/jira/browse/SPARK-31782
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join

2020-05-20 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112816#comment-17112816
 ] 

Jungtaek Lim commented on SPARK-31754:
--

I can also take a look if the input and checkpoint are sharable (say, no 
production inputs), but I imagine it's unlikely.

> Spark Structured Streaming: NullPointerException in Stream Stream join
> --
>
> Key: SPARK-31754
> URL: https://issues.apache.org/jira/browse/SPARK-31754
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark Version : 2.4.0
> Hadoop Version : 3.0.0
>Reporter: Puviarasu
>Priority: Major
>  Labels: structured-streaming
> Attachments: CodeGen.txt, Logical-Plan.txt
>
>
> When joining 2 streams with watermarking and windowing we are getting 
> NullPointer Exception after running for few minutes. 
> After failure we analyzed the checkpoint offsets/sources and found the files 
> for which the application failed. These files are not having any null values 
> in the join columns. 
> We even started the job with the files and the application ran. From this we 
> concluded that the exception is not because of the data from the streams.
> *Code:*
>  
> {code:java}
> val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> 
> "1" )
>  val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> 
> "1" )
>  
> spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1")
>  
> spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2")
>  spark.sql("select * from source1 where eventTime1 is not null and col1 is 
> not null").withWatermark("eventTime1", "30 
> minutes").createTempView("viewNotNull1")
>  spark.sql("select * from source2 where eventTime2 is not null and col2 is 
> not null").withWatermark("eventTime2", "30 
> minutes").createTempView("viewNotNull2")
>  spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = 
> b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + 
> interval 2 hours").createTempView("join")
>  val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> 
> "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3")
>  spark.sql("select * from 
> join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 
> seconds")).format("parquet").options(optionsMap3).start()
> {code}
>  
> *Exception:*
>  
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
> Aborting TaskSet 4.0 because task 0 (partition 0)
> cannot run anywhere due to node and executor blacklist.
> Most recent failure:
> Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply$mcV$spala:338)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream)
> at 
> 

[jira] [Updated] (SPARK-31761) Sql Div operator can result in incorrect output for int_min

2020-05-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31761:
-
Target Version/s:   (was: 3.1.0)

> Sql Div operator can result in incorrect output for int_min
> ---
>
> Key: SPARK-31761
> URL: https://issues.apache.org/jira/browse/SPARK-31761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kuhu Shukla
>Priority: Major
>
> Input  in csv : -2147483648,-1  --> (_c0, _c1)
> {code}
> val res = df.selectExpr("_c0 div _c1")
> res.collect
> res1: Array[org.apache.spark.sql.Row] = Array([-2147483648])
> {code}
> The result should be 2147483648 instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31762) Fix perf regression of date/timestamp formatting in toHiveString

2020-05-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31762:

Parent: SPARK-31408
Issue Type: Sub-task  (was: Bug)

> Fix perf regression of date/timestamp formatting in toHiveString
> 
>
> Key: SPARK-31762
> URL: https://issues.apache.org/jira/browse/SPARK-31762
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> HiveResult.toHiveString has to convert incoming Java date/timestamps types to 
> days/microseconds because existing API of DateFormatter/TimestampFormatter 
> don't accept java.sql.Timestamp/java.util.Date and 
> java.time.Instant/java.time.LocalDate. Internally, the formatters perform 
> conversions to Java types again. This badly impacts on the performance. The 
> ticket aims to add new APIs to DateFormatter and TimestampFormatter that 
> should accept Java types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31762) Fix perf regression of date/timestamp formatting in toHiveString

2020-05-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31762:
---

Assignee: Maxim Gekk

> Fix perf regression of date/timestamp formatting in toHiveString
> 
>
> Key: SPARK-31762
> URL: https://issues.apache.org/jira/browse/SPARK-31762
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> HiveResult.toHiveString has to convert incoming Java date/timestamps types to 
> days/microseconds because existing API of DateFormatter/TimestampFormatter 
> don't accept java.sql.Timestamp/java.util.Date and 
> java.time.Instant/java.time.LocalDate. Internally, the formatters perform 
> conversions to Java types again. This badly impacts on the performance. The 
> ticket aims to add new APIs to DateFormatter and TimestampFormatter that 
> should accept Java types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31762) Fix perf regression of date/timestamp formatting in toHiveString

2020-05-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31762.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28582
[https://github.com/apache/spark/pull/28582]

> Fix perf regression of date/timestamp formatting in toHiveString
> 
>
> Key: SPARK-31762
> URL: https://issues.apache.org/jira/browse/SPARK-31762
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> HiveResult.toHiveString has to convert incoming Java date/timestamps types to 
> days/microseconds because existing API of DateFormatter/TimestampFormatter 
> don't accept java.sql.Timestamp/java.util.Date and 
> java.time.Instant/java.time.LocalDate. Internally, the formatters perform 
> conversions to Java types again. This badly impacts on the performance. The 
> ticket aims to add new APIs to DateFormatter and TimestampFormatter that 
> should accept Java types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31760) Simplification Based on Containment

2020-05-20 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31760:

Summary: Simplification Based on Containment  (was: Consolidating Single 
Table Predicates)

> Simplification Based on Containment
> ---
>
> Key: SPARK-31760
> URL: https://issues.apache.org/jira/browse/SPARK-31760
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> https://docs.teradata.com/reader/Ws7YT1jvRK2vEr1LpVURug/V~FCwD9BL7gY4ac3WwHInw



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31781) Move param k (number of clusters) to shared params

2020-05-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112716#comment-17112716
 ] 

Apache Spark commented on SPARK-31781:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28595

> Move param k (number of clusters) to shared params
> --
>
> Key: SPARK-31781
> URL: https://issues.apache.org/jira/browse/SPARK-31781
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Param k (number of clusters) is used for all the clustering algorithms, so 
> move it to shared params.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31781) Move param k (number of clusters) to shared params

2020-05-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31781:


Assignee: Apache Spark

> Move param k (number of clusters) to shared params
> --
>
> Key: SPARK-31781
> URL: https://issues.apache.org/jira/browse/SPARK-31781
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> Param k (number of clusters) is used for all the clustering algorithms, so 
> move it to shared params.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31781) Move param k (number of clusters) to shared params

2020-05-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112715#comment-17112715
 ] 

Apache Spark commented on SPARK-31781:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28595

> Move param k (number of clusters) to shared params
> --
>
> Key: SPARK-31781
> URL: https://issues.apache.org/jira/browse/SPARK-31781
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Param k (number of clusters) is used for all the clustering algorithms, so 
> move it to shared params.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31781) Move param k (number of clusters) to shared params

2020-05-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31781:


Assignee: (was: Apache Spark)

> Move param k (number of clusters) to shared params
> --
>
> Key: SPARK-31781
> URL: https://issues.apache.org/jira/browse/SPARK-31781
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Param k (number of clusters) is used for all the clustering algorithms, so 
> move it to shared params.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31781) Move param k (number of clusters) to shared params

2020-05-20 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-31781:
--

 Summary: Move param k (number of clusters) to shared params
 Key: SPARK-31781
 URL: https://issues.apache.org/jira/browse/SPARK-31781
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.1.0
Reporter: Huaxin Gao


Param k (number of clusters) is used for all the clustering algorithms, so move 
it to shared params.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31780) Add R test tag to exclude R K8s image building and test

2020-05-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31780:
-

Assignee: Dongjoon Hyun

> Add R test tag to exclude R K8s image building and test
> ---
>
> Key: SPARK-31780
> URL: https://issues.apache.org/jira/browse/SPARK-31780
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31780) Add R test tag to exclude R K8s image building and test

2020-05-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31780.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28594
[https://github.com/apache/spark/pull/28594]

> Add R test tag to exclude R K8s image building and test
> ---
>
> Key: SPARK-31780
> URL: https://issues.apache.org/jira/browse/SPARK-31780
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31780) Add R test tag to exclude R K8s image building and test

2020-05-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31780:
--
Priority: Minor  (was: Major)

> Add R test tag to exclude R K8s image building and test
> ---
>
> Key: SPARK-31780
> URL: https://issues.apache.org/jira/browse/SPARK-31780
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31780) Add R test tag to exclude R K8s image building and test

2020-05-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31780:
--
Affects Version/s: (was: 3.1.0)
   3.0.0

> Add R test tag to exclude R K8s image building and test
> ---
>
> Key: SPARK-31780
> URL: https://issues.apache.org/jira/browse/SPARK-31780
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31780) Add R test tag to exclude R K8s image building and test

2020-05-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31780:


Assignee: Apache Spark

> Add R test tag to exclude R K8s image building and test
> ---
>
> Key: SPARK-31780
> URL: https://issues.apache.org/jira/browse/SPARK-31780
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31780) Add R test tag to exclude R K8s image building and test

2020-05-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112662#comment-17112662
 ] 

Apache Spark commented on SPARK-31780:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/28594

> Add R test tag to exclude R K8s image building and test
> ---
>
> Key: SPARK-31780
> URL: https://issues.apache.org/jira/browse/SPARK-31780
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31780) Add R test tag to exclude R K8s image building and test

2020-05-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31780:


Assignee: (was: Apache Spark)

> Add R test tag to exclude R K8s image building and test
> ---
>
> Key: SPARK-31780
> URL: https://issues.apache.org/jira/browse/SPARK-31780
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31780) Add R test tag to exclude R K8s image building and test

2020-05-20 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-31780:
-

 Summary: Add R test tag to exclude R K8s image building and test
 Key: SPARK-31780
 URL: https://issues.apache.org/jira/browse/SPARK-31780
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Tests
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31779) Redefining struct inside array incorrectly wraps child fields in array

2020-05-20 Thread Jeff Evans (Jira)
Jeff Evans created SPARK-31779:
--

 Summary: Redefining struct inside array incorrectly wraps child 
fields in array
 Key: SPARK-31779
 URL: https://issues.apache.org/jira/browse/SPARK-31779
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5
Reporter: Jeff Evans


It seems that redefining a {{struct}} for the purpose of removing a sub-field, 
when that {{struct}} is itself inside an {{array}}, results in the remaining 
(non-removed) {{struct}} fields themselves being incorrectly wrapped in an 
array.

For more context, see [this|https://stackoverflow.com/a/46084983/375670] 
StackOverflow answer and discussion thread.  I have debugged this code and 
distilled it down to what I believe represents a bug in Spark itself.

Consider the following {{spark-shell}} session (version 2.4.5):

{code}
// use a nested JSON structure that contains a struct inside an array
val jsonData = """{
  "foo": "bar",
  "top": {
"child1": 5,
"child2": [
  {
"child2First": "one",
"child2Second": 2
  }
]
  }
}"""

// read into a DataFrame
val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS())

// create a new definition for "top", which will remove the 
"top.child2.child2First" column

val newTop = struct(df("top").getField("child1").alias("child1"), 
array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"))).alias("child2"))

// show the schema before and after swapping out the struct definition
df.schema.toDDL
// `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
ARRAY>>
df.withColumn("top", newTop).schema.toDDL
// `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: 
ARRAY>>>
{code}

Notice in this case that the new definition for {{top.child2.child2Second}} is 
an {{ARRAY}}.  This is incorrect; it should simply be {{BIGINT}}.  
There is nothing in the definition of the {{newTop}} {{struct}} that should 
have caused the type to become wrapped in an array like this.






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31778) Support cross-building docker images

2020-05-20 Thread Holden Karau (Jira)
Holden Karau created SPARK-31778:


 Summary: Support cross-building docker images
 Key: SPARK-31778
 URL: https://issues.apache.org/jira/browse/SPARK-31778
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0, 3.0.1, 3.1.0
Reporter: Holden Karau


We have a CI for Spark on ARM, we should support building images for ARM and 
AMD64 now that dokcer buildx can cross-compile



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31777) CrossValidator supports user-supplied folds

2020-05-20 Thread Xiangrui Meng (Jira)
Xiangrui Meng created SPARK-31777:
-

 Summary: CrossValidator supports user-supplied folds 
 Key: SPARK-31777
 URL: https://issues.apache.org/jira/browse/SPARK-31777
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 3.1.0
Reporter: Xiangrui Meng


As a user, I can specify how CrossValidator should create folds by specifying a 
foldCol, which should be integer type with range [0, numFolds). If foldCol is 
specified, Spark won't do random k-fold split. This is useful if there are 
custom logics to create folds, e.g., random split by users instead of random 
splits of events.

This is similar to SPARK-16206, which is for the RDD-based APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31776) Literal lit() supports lists and numpy arrays

2020-05-20 Thread Xiangrui Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-31776:
--
Issue Type: Improvement  (was: New Feature)

> Literal lit() supports lists and numpy arrays
> -
>
> Key: SPARK-31776
> URL: https://issues.apache.org/jira/browse/SPARK-31776
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In ML workload, it is common to replace null feature vectors with some 
> default value. However, lit() does not support Python list and numpy arrays 
> at input. Users cannot simply use fillna() to get the job done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31776) Literal lit() supports lists and numpy arrays

2020-05-20 Thread Xiangrui Meng (Jira)
Xiangrui Meng created SPARK-31776:
-

 Summary: Literal lit() supports lists and numpy arrays
 Key: SPARK-31776
 URL: https://issues.apache.org/jira/browse/SPARK-31776
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Xiangrui Meng


In ML workload, it is common to replace null feature vectors with some default 
value. However, lit() does not support Python list and numpy arrays at input. 
Users cannot simply use fillna() to get the job done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31775) Support tensor type (TensorType) in Spark SQL/DataFrame

2020-05-20 Thread Xiangrui Meng (Jira)
Xiangrui Meng created SPARK-31775:
-

 Summary: Support tensor type (TensorType) in Spark SQL/DataFrame
 Key: SPARK-31775
 URL: https://issues.apache.org/jira/browse/SPARK-31775
 Project: Spark
  Issue Type: New Feature
  Components: ML, SQL
Affects Versions: 3.1.0
Reporter: Xiangrui Meng


More and more DS/ML workloads are dealing with tensors. For example, a decoded 
color image can be represented by a 3D tensor. It would be nice to natively 
support tensor type. A local tensor is essentially an array with shape info, 
stored together. Native support is better than user-defined type (UDT) because 
PyArrow does not support UDTs but it already supports tensors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31387) HiveThriftServer2Listener update methods fail with unknown operation/session id

2020-05-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31387:
-

Assignee: Ali Smesseim

> HiveThriftServer2Listener update methods fail with unknown operation/session 
> id
> ---
>
> Key: SPARK-31387
> URL: https://issues.apache.org/jira/browse/SPARK-31387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Ali Smesseim
>Assignee: Ali Smesseim
>Priority: Major
>
> HiveThriftServer2Listener update methods, such as onSessionClosed and 
> onOperationError throw a NullPointerException (in Spark 3) or a 
> NoSuchElementException (in Spark 2) when the input session/operation id is 
> unknown. In Spark 2, this can cause control flow issues with the caller of 
> the listener. In Spark 3, the listener is called by a ListenerBus which 
> catches the exception, but it would still be nicer if an invalid update is 
> logged and does not throw an exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31387) HiveThriftServer2Listener update methods fail with unknown operation/session id

2020-05-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31387.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28544
[https://github.com/apache/spark/pull/28544]

> HiveThriftServer2Listener update methods fail with unknown operation/session 
> id
> ---
>
> Key: SPARK-31387
> URL: https://issues.apache.org/jira/browse/SPARK-31387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Ali Smesseim
>Assignee: Ali Smesseim
>Priority: Major
> Fix For: 3.0.0
>
>
> HiveThriftServer2Listener update methods, such as onSessionClosed and 
> onOperationError throw a NullPointerException (in Spark 3) or a 
> NoSuchElementException (in Spark 2) when the input session/operation id is 
> unknown. In Spark 2, this can cause control flow issues with the caller of 
> the listener. In Spark 3, the listener is called by a ListenerBus which 
> catches the exception, but it would still be nicer if an invalid update is 
> logged and does not throw an exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31773) getting the Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, at org.apache.spark.sql.catalyst.errors.package$.attachTree(pac

2020-05-20 Thread Pankaj Tiwari (Jira)
Pankaj Tiwari created SPARK-31773:
-

 Summary: getting the Caused by: 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
 Key: SPARK-31773
 URL: https://issues.apache.org/jira/browse/SPARK-31773
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
 Environment: spark 2.2
Reporter: Pankaj Tiwari


Actually I am loading the excel which has some 90 columns and the some columns 
name contains special character as well like @ % -> . etc etc so while I am 
doing one use case like :

sourceDataSet.select(columnSeq).except(targetDataset.select(columnSeq)));

this is working fine but as soon as I am running 

sourceDataSet.select(columnSeq).except(targetDataset.select(columnSeq)).count()

it is failing with error like :

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:

Exchange SinglePartition

+- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#26596L])

   +- *HashAggregate(keys=columns name 

 

 

Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
Binding attribute, tree:column namet#14050

        at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)

        at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)

        at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)

        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)

        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)

        at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)

        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)

        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)

        at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)

        at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$40.apply(HashAggregateExec.scala:703)

        at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$40.apply(HashAggregateExec.scala:703)

        at 
scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)

        at 
scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)

        at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233)

        at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223)

        at 
scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)

        at 
scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)

        at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233)

        at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223)

        at scala.collection.immutable.Stream.foreach(Stream.scala:595)

        at 
scala.collection.TraversableOnce$class.count(TraversableOnce.scala:115)

        at scala.collection.AbstractTraversable.count(Traversable.scala:104)

        at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:312)

        at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:702)

        at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsume(HashAggregateExec.scala:156)

        at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)

        at 
org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:36)

 

 

 

 

Caused by: java.lang.RuntimeException: Couldn't find here one name of column 
following with

  at scala.sys.package$.error(package.scala:27)

        at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94)

        at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88)

        at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31774) getting the Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, at org.apache.spark.sql.catalyst.errors.package$.attachTree(pac

2020-05-20 Thread Pankaj Tiwari (Jira)
Pankaj Tiwari created SPARK-31774:
-

 Summary: getting the Caused by: 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
 Key: SPARK-31774
 URL: https://issues.apache.org/jira/browse/SPARK-31774
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
 Environment: spark 2.2
Reporter: Pankaj Tiwari


Actually I am loading the excel which has some 90 columns and the some columns 
name contains special character as well like @ % -> . etc etc so while I am 
doing one use case like :

sourceDataSet.select(columnSeq).except(targetDataset.select(columnSeq)));

this is working fine but as soon as I am running 

sourceDataSet.select(columnSeq).except(targetDataset.select(columnSeq)).count()

it is failing with error like :

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:

Exchange SinglePartition

+- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#26596L])

   +- *HashAggregate(keys=columns name 

 

 

Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
Binding attribute, tree:column namet#14050

        at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)

        at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)

        at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)

        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)

        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)

        at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)

        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)

        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)

        at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)

        at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$40.apply(HashAggregateExec.scala:703)

        at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$40.apply(HashAggregateExec.scala:703)

        at 
scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)

        at 
scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)

        at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233)

        at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223)

        at 
scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)

        at 
scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)

        at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233)

        at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223)

        at scala.collection.immutable.Stream.foreach(Stream.scala:595)

        at 
scala.collection.TraversableOnce$class.count(TraversableOnce.scala:115)

        at scala.collection.AbstractTraversable.count(Traversable.scala:104)

        at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:312)

        at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:702)

        at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsume(HashAggregateExec.scala:156)

        at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)

        at 
org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:36)

 

 

 

 

Caused by: java.lang.RuntimeException: Couldn't find here one name of column 
following with

  at scala.sys.package$.error(package.scala:27)

        at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94)

        at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88)

        at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31772) Json schema reading is not consistent between int and string types

2020-05-20 Thread yaniv oren (Jira)
yaniv oren created SPARK-31772:
--

 Summary: Json schema reading is not consistent between int and 
string types
 Key: SPARK-31772
 URL: https://issues.apache.org/jira/browse/SPARK-31772
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.4
Reporter: yaniv oren


When reading json file using a schema, int value is converted to string if 
field is string but string field is not converted to int value if field is int.

Sample Code:

read_schema = StructType([StructField({color:#008080}"a"{color}, IntegerType()),
 StructField({color:#008080}"b"{color}, StringType())])
df = 
{color:#94558d}self{color}.spark_session.read.schema(read_schema).json({color:#008080}"input/json/temp_test"{color})
df.show()

 

json temp_test

{"a": 1,"b": "b1"}
{"a": 2,"b": "b2"}
{"a": 3,"b": 3}
{"a": "4","b": 4}

 

actual:

| a| b|
+++
| 1| b1|
| 2| b2|
| 3| 3|
|null|null|
+++

 

expected:

Third line will be nulled as the fourth line as b is int while in schema it's 
string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31761) Sql Div operator can result in incorrect output for int_min

2020-05-20 Thread Sandeep Katta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112368#comment-17112368
 ] 

Sandeep Katta edited comment on SPARK-31761 at 5/20/20, 3:46 PM:
-

[~sowen] [~hyukjin.kwon]  [~dongjoon]

 

I have executed the same query in spark-2.4.4 it works as per expectation.

 

As you can see from 2.4.4 plan, columns are casted to double,  so there won't 
be *Integer overflow.*

== Parsed Logical Plan ==
 'Project [cast(('col0 / 'col1) as bigint) AS CAST((col0 / col1) AS 
BIGINT)#4|#4]
 +- Relation[col0#0,col1#1|#0,col1#1] csv

== Analyzed Logical Plan ==
 CAST((col0 / col1) AS BIGINT): bigint
 Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS 
CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as 
bigint) AS CAST((col0 / col1) AS BIGINT)#4L]
 +- Relation[col0#0,col1#1|#0,col1#1] csv

== Optimized Logical Plan ==
 Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS 
CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as 
bigint) AS CAST((col0 / col1) AS BIGINT)#4L]
 +- Relation[col0#0,col1#1|#0,col1#1] csv

== Physical Plan ==
 *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as 
bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as 
double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L]
 +- *(1) FileScan csv [col0#0,col1#1|#0,col1#1] Batched: false, Format: CSV, 
Location: 
InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv],
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct
 *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as 
bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as 
double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L]
 +- *(1) FileScan csv [col0#0,col1#1|#0,col1#1] Batched: false, Format: CSV, 
Location: 
InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv],
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct

*Spark-3.0 Plan*

== Parsed Logical Plan ==
 'Project [('col0 div 'col1) AS (col0 div col1)#4|#4]
 +- RelationV2[col0#0, col1#1|#0, col1#1] csv 
[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv]

== Analyzed Logical Plan ==
 (col0 div col1): int
 Project [(col0#0 div col1#1) AS (col0 div col1)#4|#0 div col1#1) AS (col0 div 
col1)#4]
 +- RelationV2[col0#0, col1#1|#0, col1#1] csv 
[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv]

== Optimized Logical Plan ==
 Project [(col0#0 div col1#1) AS (col0 div col1)#4|#0 div col1#1) AS (col0 div 
col1)#4]
 +- RelationV2[col0#0, col1#1|#0, col1#1] csv 
[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv]

== Physical Plan ==
 *(1) Project [(col0#0 div col1#1) AS (col0 div col1)#4|#0 div col1#1) AS (col0 
div col1)#4]
 +- BatchScan[col0#0, col1#1|#0, col1#1] CSVScan Location: 
InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv],
 ReadSchema: struct

 

In Spark3 do I need to cast the columns as in spark-2.4,  or user should 
manually add cast to their query as per below example

 

val schema = "col0 int,col1 int";
 val df = spark.read.schema(schema).csv("file:/opt/fordebug/divTest.csv");
 val res = df.selectExpr("col0 div col1")
 val res = df.selectExpr("Cast(col0 as Decimal) div col1 ")
 res.collect

 

please let us know your opinion 


was (Author: sandeep.katta2007):
[~sowen] [~hyukjin.kwon] 

 

I have executed the same query in spark-2.4.4 it works as per expectation.

 

As you can see from 2.4.4 plan, columns are casted to double,  so there won't 
be *Integer overflow.*

== Parsed Logical Plan ==
 'Project [cast(('col0 / 'col1) as bigint) AS CAST((col0 / col1) AS 
BIGINT)#4|#4]
 +- Relation[col0#0,col1#1|#0,col1#1] csv

== Analyzed Logical Plan ==
 CAST((col0 / col1) AS BIGINT): bigint
 Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS 
CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as 
bigint) AS CAST((col0 / col1) AS BIGINT)#4L]
 +- Relation[col0#0,col1#1|#0,col1#1] csv

== Optimized Logical Plan ==
 Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS 
CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as 
bigint) AS CAST((col0 / col1) AS BIGINT)#4L]
 +- Relation[col0#0,col1#1|#0,col1#1] csv

== Physical Plan ==
 *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as 
bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as 
double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L]
 +- *(1) FileScan csv [col0#0,col1#1|#0,col1#1] Batched: false, Format: CSV, 
Location: 
InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv],
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct
 *(1) Project 

[jira] [Comment Edited] (SPARK-31761) Sql Div operator can result in incorrect output for int_min

2020-05-20 Thread Sandeep Katta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112368#comment-17112368
 ] 

Sandeep Katta edited comment on SPARK-31761 at 5/20/20, 3:44 PM:
-

[~sowen] [~hyukjin.kwon] 

 

I have executed the same query in spark-2.4.4 it works as per expectation.

 

As you can see from 2.4.4 plan, columns are casted to double,  so there won't 
be *Integer overflow.*

== Parsed Logical Plan ==
 'Project [cast(('col0 / 'col1) as bigint) AS CAST((col0 / col1) AS 
BIGINT)#4|#4]
 +- Relation[col0#0,col1#1|#0,col1#1] csv

== Analyzed Logical Plan ==
 CAST((col0 / col1) AS BIGINT): bigint
 Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS 
CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as 
bigint) AS CAST((col0 / col1) AS BIGINT)#4L]
 +- Relation[col0#0,col1#1|#0,col1#1] csv

== Optimized Logical Plan ==
 Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS 
CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as 
bigint) AS CAST((col0 / col1) AS BIGINT)#4L]
 +- Relation[col0#0,col1#1|#0,col1#1] csv

== Physical Plan ==
 *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as 
bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as 
double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L]
 +- *(1) FileScan csv [col0#0,col1#1|#0,col1#1] Batched: false, Format: CSV, 
Location: 
InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv],
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct
 *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as 
bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as 
double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L]
 +- *(1) FileScan csv [col0#0,col1#1|#0,col1#1] Batched: false, Format: CSV, 
Location: 
InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv],
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct

*Spark-3.0 Plan*

== Parsed Logical Plan ==
 'Project [('col0 div 'col1) AS (col0 div col1)#4|#4]
 +- RelationV2[col0#0, col1#1|#0, col1#1] csv 
[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv]

== Analyzed Logical Plan ==
 (col0 div col1): int
 Project [(col0#0 div col1#1) AS (col0 div col1)#4|#0 div col1#1) AS (col0 div 
col1)#4]
 +- RelationV2[col0#0, col1#1|#0, col1#1] csv 
[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv]

== Optimized Logical Plan ==
 Project [(col0#0 div col1#1) AS (col0 div col1)#4|#0 div col1#1) AS (col0 div 
col1)#4]
 +- RelationV2[col0#0, col1#1|#0, col1#1] csv 
[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv]

== Physical Plan ==
 *(1) Project [(col0#0 div col1#1) AS (col0 div col1)#4|#0 div col1#1) AS (col0 
div col1)#4]
 +- BatchScan[col0#0, col1#1|#0, col1#1] CSVScan Location: 
InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv],
 ReadSchema: struct

 

In Spark3 do I need to cast the columns as in spark-2.4,  or user should 
manually add cast to their query as per below example

 

val schema = "col0 int,col1 int";
 val df = spark.read.schema(schema).csv("file:/opt/fordebug/divTest.csv");
 val res = df.selectExpr("col0 div col1")
 val res = df.selectExpr("Cast(col0 as Decimal) div col1 ")
 res.collect

 

please let us know your opinion 


was (Author: sandeep.katta2007):
[~sowen] [~hyukjin.kwon] 

 

I have executed the same query in spark-2.4.4 it works as per expectation.

 

As you can see from 2.4.4 plan, columns are casted to double,  so there won't 
be *Integer overflow.*

== Parsed Logical Plan ==
'Project [cast(('col0 / 'col1) as bigint) AS CAST((col0 / col1) AS BIGINT)#4]
+- Relation[col0#0,col1#1] csv

== Analyzed Logical Plan ==
CAST((col0 / col1) AS BIGINT): bigint
Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS 
CAST((col0 / col1) AS BIGINT)#4L]
+- Relation[col0#0,col1#1] csv

== Optimized Logical Plan ==
Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS 
CAST((col0 / col1) AS BIGINT)#4L]
+- Relation[col0#0,col1#1] csv

== Physical Plan ==
*(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) 
AS CAST((col0 / col1) AS BIGINT)#4L]
+- *(1) FileScan csv [col0#0,col1#1] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:/opt/fordebug/divTest.csv], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct
*(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) 
AS CAST((col0 / col1) AS BIGINT)#4L]
+- *(1) FileScan csv [col0#0,col1#1] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:/opt/fordebug/divTest.csv], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct


Spark-3.0


== Parsed Logical Plan ==
'Project [('col0 div 'col1) AS (col0 

[jira] [Commented] (SPARK-31761) Sql Div operator can result in incorrect output for int_min

2020-05-20 Thread Sandeep Katta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112368#comment-17112368
 ] 

Sandeep Katta commented on SPARK-31761:
---

[~sowen] [~hyukjin.kwon] 

 

I have executed the same query in spark-2.4.4 it works as per expectation.

 

As you can see from 2.4.4 plan, columns are casted to double,  so there won't 
be *Integer overflow.*

== Parsed Logical Plan ==
'Project [cast(('col0 / 'col1) as bigint) AS CAST((col0 / col1) AS BIGINT)#4]
+- Relation[col0#0,col1#1] csv

== Analyzed Logical Plan ==
CAST((col0 / col1) AS BIGINT): bigint
Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS 
CAST((col0 / col1) AS BIGINT)#4L]
+- Relation[col0#0,col1#1] csv

== Optimized Logical Plan ==
Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS 
CAST((col0 / col1) AS BIGINT)#4L]
+- Relation[col0#0,col1#1] csv

== Physical Plan ==
*(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) 
AS CAST((col0 / col1) AS BIGINT)#4L]
+- *(1) FileScan csv [col0#0,col1#1] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:/opt/fordebug/divTest.csv], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct
*(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) 
AS CAST((col0 / col1) AS BIGINT)#4L]
+- *(1) FileScan csv [col0#0,col1#1] Batched: false, Format: CSV, Location: 
InMemoryFileIndex[file:/opt/fordebug/divTest.csv], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct


Spark-3.0


== Parsed Logical Plan ==
'Project [('col0 div 'col1) AS (col0 div col1)#4]
+- RelationV2[col0#0, col1#1] csv file:/opt/fordebug/divTest.csv

== Analyzed Logical Plan ==
(col0 div col1): int
Project [(col0#0 div col1#1) AS (col0 div col1)#4]
+- RelationV2[col0#0, col1#1] csv file:/opt/fordebug/divTest.csv

== Optimized Logical Plan ==
Project [(col0#0 div col1#1) AS (col0 div col1)#4]
+- RelationV2[col0#0, col1#1] csv file:/opt/fordebug/divTest.csv

== Physical Plan ==
*(1) Project [(col0#0 div col1#1) AS (col0 div col1)#4]
+- BatchScan[col0#0, col1#1] CSVScan Location: 
InMemoryFileIndex[file:/opt/fordebug/divTest.csv], ReadSchema: 
struct

 

In Spark3 do I need to cast the columns as in spark-2.4,  or user should 
manually add cast to their query as per below example

 

val schema = "col0 int,col1 int";
val df = spark.read.schema(schema).csv("file:/opt/fordebug/divTest.csv");
val res = df.selectExpr("col0 div col1")
val res = df.selectExpr("Cast(col0 as Decimal) div col1 ")
res.collect

 

please let us know your opinion 

> Sql Div operator can result in incorrect output for int_min
> ---
>
> Key: SPARK-31761
> URL: https://issues.apache.org/jira/browse/SPARK-31761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kuhu Shukla
>Priority: Major
>
> Input  in csv : -2147483648,-1  --> (_c0, _c1)
> {code}
> val res = df.selectExpr("_c0 div _c1")
> res.collect
> res1: Array[org.apache.spark.sql.Row] = Array([-2147483648])
> {code}
> The result should be 2147483648 instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join

2020-05-20 Thread Puviarasu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112359#comment-17112359
 ] 

Puviarasu edited comment on SPARK-31754 at 5/20/20, 3:36 PM:
-

Hello [~kabhwan] , 

Please find below the comments in *bold*.
 # Is it always reproducible under the same input & checkpoint (start from 
initial checkpoint or specific checkpoint)?
 *Same Checkpoint: With the same checkpoint and input the application is 
failing with the same exception in the same offset.* 
*New Checkpoint: Also we tested clearing the checkpoint with the same input. In 
this case[after clearing the checkpoint] exception dint happen for that 
particular input.* 
 # Could you share the query plan (logical/physical)? Query plan from previous 
batch would be OK.
 *Sure. Please find the attachment [^Logical-Plan.txt]*
 # Could you try it out with recent version like 2.4.5 or 3.0.0-preview2 so 
that we can avoid investigating issue which might be already resolved?
 *For this we might need some more time as we need some changes to be done in 
our cluster settings. Kindly bear us with the delay.* 

Thank you. 


was (Author: puviarasu):
Hello [~kabhwan] , 

Please find below the comments in *bold*.
 # Is it always reproducible under the same input & checkpoint (start from 
initial checkpoint or specific checkpoint)? 
*With the same checkpoint and input the application is failing with the same 
exception in the same offset. Also we tested clearing the checkpoint with the 
same input. In this case exception dint happen for that particular input.* 
 # Could you share the query plan (logical/physical)? Query plan from previous 
batch would be OK. 
*Sure. Please find the attachment [^Logical-Plan.txt]*
 # Could you try it out with recent version like 2.4.5 or 3.0.0-preview2 so 
that we can avoid investigating issue which might be already resolved? 
*For this we might need some more time as we need some changes to be done in 
our cluster settings. Kindly bear us with the delay.* 

Thank you. 

> Spark Structured Streaming: NullPointerException in Stream Stream join
> --
>
> Key: SPARK-31754
> URL: https://issues.apache.org/jira/browse/SPARK-31754
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark Version : 2.4.0
> Hadoop Version : 3.0.0
>Reporter: Puviarasu
>Priority: Major
>  Labels: structured-streaming
> Attachments: CodeGen.txt, Logical-Plan.txt
>
>
> When joining 2 streams with watermarking and windowing we are getting 
> NullPointer Exception after running for few minutes. 
> After failure we analyzed the checkpoint offsets/sources and found the files 
> for which the application failed. These files are not having any null values 
> in the join columns. 
> We even started the job with the files and the application ran. From this we 
> concluded that the exception is not because of the data from the streams.
> *Code:*
>  
> {code:java}
> val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> 
> "1" )
>  val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> 
> "1" )
>  
> spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1")
>  
> spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2")
>  spark.sql("select * from source1 where eventTime1 is not null and col1 is 
> not null").withWatermark("eventTime1", "30 
> minutes").createTempView("viewNotNull1")
>  spark.sql("select * from source2 where eventTime2 is not null and col2 is 
> not null").withWatermark("eventTime2", "30 
> minutes").createTempView("viewNotNull2")
>  spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = 
> b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + 
> interval 2 hours").createTempView("join")
>  val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> 
> "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3")
>  spark.sql("select * from 
> join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 
> seconds")).format("parquet").options(optionsMap3).start()
> {code}
>  
> *Exception:*
>  
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
> Aborting TaskSet 4.0 because task 0 (partition 0)
> cannot run anywhere due to node and executor blacklist.
> Most recent failure:
> Lost 

[jira] [Commented] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join

2020-05-20 Thread Puviarasu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112359#comment-17112359
 ] 

Puviarasu commented on SPARK-31754:
---

Hello [~kabhwan] , 

Please find below the comments in *bold*.
 # Is it always reproducible under the same input & checkpoint (start from 
initial checkpoint or specific checkpoint)? 
*With the same checkpoint and input the application is failing with the same 
exception in the same offset. Also we tested clearing the checkpoint with the 
same input. In this case exception dint happen for that particular input.* 
 # Could you share the query plan (logical/physical)? Query plan from previous 
batch would be OK. 
*Sure. Please find the attachment [^Logical-Plan.txt]*
 # Could you try it out with recent version like 2.4.5 or 3.0.0-preview2 so 
that we can avoid investigating issue which might be already resolved? 
*For this we might need some more time as we need some changes to be done in 
our cluster settings. Kindly bear us with the delay.* 

Thank you. 

> Spark Structured Streaming: NullPointerException in Stream Stream join
> --
>
> Key: SPARK-31754
> URL: https://issues.apache.org/jira/browse/SPARK-31754
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark Version : 2.4.0
> Hadoop Version : 3.0.0
>Reporter: Puviarasu
>Priority: Major
>  Labels: structured-streaming
> Attachments: CodeGen.txt, Logical-Plan.txt
>
>
> When joining 2 streams with watermarking and windowing we are getting 
> NullPointer Exception after running for few minutes. 
> After failure we analyzed the checkpoint offsets/sources and found the files 
> for which the application failed. These files are not having any null values 
> in the join columns. 
> We even started the job with the files and the application ran. From this we 
> concluded that the exception is not because of the data from the streams.
> *Code:*
>  
> {code:java}
> val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> 
> "1" )
>  val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> 
> "1" )
>  
> spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1")
>  
> spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2")
>  spark.sql("select * from source1 where eventTime1 is not null and col1 is 
> not null").withWatermark("eventTime1", "30 
> minutes").createTempView("viewNotNull1")
>  spark.sql("select * from source2 where eventTime2 is not null and col2 is 
> not null").withWatermark("eventTime2", "30 
> minutes").createTempView("viewNotNull2")
>  spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = 
> b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + 
> interval 2 hours").createTempView("join")
>  val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> 
> "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3")
>  spark.sql("select * from 
> join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 
> seconds")).format("parquet").options(optionsMap3).start()
> {code}
>  
> *Exception:*
>  
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
> Aborting TaskSet 4.0 because task 0 (partition 0)
> cannot run anywhere due to node and executor blacklist.
> Most recent failure:
> Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157)
> at 

[jira] [Updated] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join

2020-05-20 Thread Puviarasu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Puviarasu updated SPARK-31754:
--
Attachment: Logical-Plan.txt

> Spark Structured Streaming: NullPointerException in Stream Stream join
> --
>
> Key: SPARK-31754
> URL: https://issues.apache.org/jira/browse/SPARK-31754
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark Version : 2.4.0
> Hadoop Version : 3.0.0
>Reporter: Puviarasu
>Priority: Major
>  Labels: structured-streaming
> Attachments: CodeGen.txt, Logical-Plan.txt
>
>
> When joining 2 streams with watermarking and windowing we are getting 
> NullPointer Exception after running for few minutes. 
> After failure we analyzed the checkpoint offsets/sources and found the files 
> for which the application failed. These files are not having any null values 
> in the join columns. 
> We even started the job with the files and the application ran. From this we 
> concluded that the exception is not because of the data from the streams.
> *Code:*
>  
> {code:java}
> val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> 
> "1" )
>  val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> 
> "1" )
>  
> spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1")
>  
> spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2")
>  spark.sql("select * from source1 where eventTime1 is not null and col1 is 
> not null").withWatermark("eventTime1", "30 
> minutes").createTempView("viewNotNull1")
>  spark.sql("select * from source2 where eventTime2 is not null and col2 is 
> not null").withWatermark("eventTime2", "30 
> minutes").createTempView("viewNotNull2")
>  spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = 
> b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + 
> interval 2 hours").createTempView("join")
>  val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> 
> "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3")
>  spark.sql("select * from 
> join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 
> seconds")).format("parquet").options(optionsMap3).start()
> {code}
>  
> *Exception:*
>  
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
> Aborting TaskSet 4.0 because task 0 (partition 0)
> cannot run anywhere due to node and executor blacklist.
> Most recent failure:
> Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply$mcV$spala:338)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream)
> at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:583)
> at 
> 

[jira] [Updated] (SPARK-31766) Add Spark version prefix to K8s UUID test image tag

2020-05-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31766:
--
Fix Version/s: (was: 3.1.0)
   3.0.0

> Add Spark version prefix to K8s UUID test image tag
> ---
>
> Key: SPARK-31766
> URL: https://issues.apache.org/jira/browse/SPARK-31766
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31766) Add Spark version prefix to K8s UUID test image tag

2020-05-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31766:
--
Affects Version/s: (was: 3.1.0)
   3.0.0

> Add Spark version prefix to K8s UUID test image tag
> ---
>
> Key: SPARK-31766
> URL: https://issues.apache.org/jira/browse/SPARK-31766
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2020-05-20 Thread Agrim Bansal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112339#comment-17112339
 ] 

Agrim Bansal commented on SPARK-12312:
--

are we not solving this for hive database ?

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31437) Try assigning tasks to existing executors by which required resources in ResourceProfile are satisfied

2020-05-20 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112245#comment-17112245
 ] 

Thomas Graves commented on SPARK-31437:
---

Originally when I thought about this briefly I was thinking of adding something 
to ResourceProfile similar what you suggest which would be option as to whether 
to always create new executors using the ExecutorResourceRequests specified in 
the ResourceProfile or try to reuse executors that exist that fit.   I think 
what you are proposing is along those lines.

Now there are a bunch of edge cases here, like what if there aren't any exist 
that it would fit in. That is a good reason to keep the 
ExecutorResourceRequirements in the Profile and make the user specify them even 
if it it might fit another one. 

Another option is I was wanting to give resource profiles names.  You could 
potentially use that or perhaps gives the ExecutorResourceRequirements names as 
well and let user specify the names as optional to use if they already exist.

>> We don't have to change any existing behaviour for cases like this. Just 
>> boot up new executors.

Note it tooks like I had a typo the 4 cpus should be 4 gpus.

I guess that is ok if you are using an exactly matches option for the 
ExecutorResourceRequests.

Otherwise if its just a fits option then it would be harder to control.  Let 
says you have 2 active profiles, one with 8 cores and 4 gpus, one with 8 cores 
and then you have a third where you want to start something that uses 2 cores.  
You tell the third to use existing executors but that means it could use the 
ones with 8 cores and 4 gpus and it would waste the gpus.  If you tell it to 
not reuse executors then it would boot up more and not do what you really want. 
  This might be where something like the names could come in as include/exclude 
lists, but then that gets more complicated for user as well.

So overall if we just do the exactly matches for the ExecutorResourceRequests 
that could be fairly straightforward. The tracking in the allocation manager 
could just change from per resource profile to the per ExecutorResourceRequests 
and I think similarly in the scheduler it could do something like that.  

I would just want to make sure whatever end user api we come up with would be 
extensible to the other cases where it will fit.

> Try assigning tasks to existing executors by which required resources in 
> ResourceProfile are satisfied
> --
>
> Key: SPARK-31437
> URL: https://issues.apache.org/jira/browse/SPARK-31437
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.1.0
>Reporter: Hongze Zhang
>Priority: Major
>
> By the change in [PR|https://github.com/apache/spark/pull/27773] of 
> SPARK-29154, submitted tasks are scheduled onto executors only if resource 
> profile IDs strictly match. As a result Spark always starts new executors for 
> customized ResourceProfiles.
> This limitation makes working with process-local jobs unfriendly. E.g. Task 
> cores has been increased from 1 to 4 in a new stage, and executor has 8 
> slots, it is expected that 2 new tasks can be run on the existing executor 
> but Spark starts new executors for new ResourceProfile. The behavior is 
> unnecessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31759) Support configurable max number of rotate logs for spark daemons

2020-05-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31759.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28580
[https://github.com/apache/spark/pull/28580]

> Support configurable max number of rotate logs for spark daemons
> 
>
> Key: SPARK-31759
> URL: https://issues.apache.org/jira/browse/SPARK-31759
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core, Spark Submit
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.1.0
>
>
> in `spark-daemon.sh`, `spark_rotate_log` accepts $2 as a custom setting for 
> the number of max rotate log files, but this part of code is actually never 
> used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31759) Support configurable max number of rotate logs for spark daemons

2020-05-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31759:


Assignee: Kent Yao

> Support configurable max number of rotate logs for spark daemons
> 
>
> Key: SPARK-31759
> URL: https://issues.apache.org/jira/browse/SPARK-31759
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core, Spark Submit
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
>
> in `spark-daemon.sh`, `spark_rotate_log` accepts $2 as a custom setting for 
> the number of max rotate log files, but this part of code is actually never 
> used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31710) result is the not the same when query and execute jobs

2020-05-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112021#comment-17112021
 ] 

Apache Spark commented on SPARK-31710:
--

User 'GuoPhilipse' has created a pull request for this issue:
https://github.com/apache/spark/pull/28593

> result is the not the same when query and execute jobs
> --
>
> Key: SPARK-31710
> URL: https://issues.apache.org/jira/browse/SPARK-31710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hdp:2.7.7
> spark:2.4.5
>Reporter: philipse
>Priority: Major
>
> Hi Team
> Steps to reproduce.
> {code:java}
> create table test(id bigint);
> insert into test select 1586318188000;
> create table test1(id bigint) partitioned by (year string);
> insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) 
> from test;
> {code}
> let's check the result. 
> Case 1:
> *select * from test1;*
> 234 | 52238-06-04 13:06:400.0
> --the result is wrong
> Case 2:
> *select 234,cast(id as TIMESTAMP) from test;*
>  
> java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>  at java.sql.Timestamp.valueOf(Timestamp.java:237)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530)
>  at org.apache.hive.beeline.Rows$Row.(Rows.java:166)
>  at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43)
>  at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756)
>  at org.apache.hive.beeline.Commands.execute(Commands.java:826)
>  at org.apache.hive.beeline.Commands.sql(Commands.java:670)
>  at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974)
>  at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810)
>  at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767)
>  at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480)
>  at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
>  Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0)
>  
> I try hive,it works well,and the convert is fine and correct
> {code:java}
> select 234,cast(id as TIMESTAMP) from test;
>  234   2020-04-08 11:56:28
> {code}
> Two questions:
> q1:
> if we forbid this convert,should we keep all cases the same?
> q2:
> if we allow the convert in some cases, should we decide the long length, for 
> the code seems to force to convert to ns with times*100 nomatter how long 
> the data is,if it convert to timestamp with incorrect length, we can raise 
> the error.
> {code:java}
> // // converting seconds to us
> private[this] def longToTimestamp(t: Long): Long = t * 100L{code}
>  
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31771) Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'

2020-05-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31771:


Assignee: Apache Spark

> Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'
> -
>
> Key: SPARK-31771
> URL: https://issues.apache.org/jira/browse/SPARK-31771
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text 
> Style in java.time.DateTimeFormatterBuilder which output the leading single 
> letter of the value, e.g. `December` would be `D`,  while in Spark 2.4 they 
> means Full-Text Style.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31771) Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'

2020-05-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111985#comment-17111985
 ] 

Apache Spark commented on SPARK-31771:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/28592

> Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'
> -
>
> Key: SPARK-31771
> URL: https://issues.apache.org/jira/browse/SPARK-31771
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text 
> Style in java.time.DateTimeFormatterBuilder which output the leading single 
> letter of the value, e.g. `December` would be `D`,  while in Spark 2.4 they 
> means Full-Text Style.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31771) Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'

2020-05-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31771:


Assignee: (was: Apache Spark)

> Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'
> -
>
> Key: SPARK-31771
> URL: https://issues.apache.org/jira/browse/SPARK-31771
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text 
> Style in java.time.DateTimeFormatterBuilder which output the leading single 
> letter of the value, e.g. `December` would be `D`,  while in Spark 2.4 they 
> means Full-Text Style.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31437) Try assigning tasks to existing executors by which required resources in ResourceProfile are satisfied

2020-05-20 Thread Hongze Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084552#comment-17084552
 ] 

Hongze Zhang edited comment on SPARK-31437 at 5/20/20, 9:14 AM:


(edited)

Thanks [~tgraves]. I got your point of making them tied. 

Actually I was thinking of something like this:

1. to break ResourceProfile up to ExecutorResourceProfile and ResourceProfile;
2. ResourceProfile still contains both resource requirement of executor and 
task;
3. ExecutorResourceProfile only includes executor's resource req;
4. ExecutorResourceProfile is required to allocate new executor instances from 
scheduler backend; 
5. Similar to current solution, user specifies ResourceProfile for RDD, then 
tasks are scheduled onto executors that are allocated using 
ExecutorResourceProfile;
6. Each time ResourceProfile comes, ExecutorResourceProfile is created/selected 
within one of several strategies;

Strategies types:

s1. Always creates new ExecutorResourceProfile;
s2. If executor resource requirement in ResourceProfile meets existing 
ExecutorResourceProfile ("meets" may mean exactly matches/equals), use the 
existing one;
s3. ...

bq. My etl tasks uses 8 cores, my ml tasks use 8 cores and 4 cpus.  How do I 
keep my etl tasks from running on the ML executors without wasting resources?

We don't have to change any existing behaviour for cases like this. Just boot 
up new executors.

For now the major problem is that even ResourceProfile.executorResources is not 
changed in a new ResourceProfile (e.g. task.cpu changed from 1 to 2), we still 
have to shut down the old executors then start new ones, right? This is the 
thing to be optimized.




was (Author: zhztheplayer):
Thanks [~tgraves]. I got your point of making them tied. 

Actually I was thinking of something like this:

1. to break ResourceProfile up to ExecutorResourceProfile and ResourceProfile;
2. ResourceProfile still contains both resource requirement of executor and 
task;
3. ExecutorResourceProfile only includes executor's resource req;
4. ExecutorResourceProfile is required to allocate new executor instances from 
scheduler backend; 
5. Similar to current solution, user specifies ResourceProfile for RDD, then 
tasks are scheduled onto executors that are allocated using 
ExecutorResourceProfile;
6. Each time ResourceProfile comes, ExecutorResourceProfile is created/selected 
within one of several strategies;

Strategies types:

s1. Always creates new ExecutorResourceProfile;
s2. If executor resource requirement in ResourceProfile meets existing 
ExecutorResourceProfile, use the existing one;
s3. ...

bq. My etl tasks uses 8 cores, my ml tasks use 8 cores and 4 cpus.  How do I 
keep my etl tasks from running on the ML executors without wasting resources?

By just using strategy s1, everything should work as current implementation.



> Try assigning tasks to existing executors by which required resources in 
> ResourceProfile are satisfied
> --
>
> Key: SPARK-31437
> URL: https://issues.apache.org/jira/browse/SPARK-31437
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.1.0
>Reporter: Hongze Zhang
>Priority: Major
>
> By the change in [PR|https://github.com/apache/spark/pull/27773] of 
> SPARK-29154, submitted tasks are scheduled onto executors only if resource 
> profile IDs strictly match. As a result Spark always starts new executors for 
> customized ResourceProfiles.
> This limitation makes working with process-local jobs unfriendly. E.g. Task 
> cores has been increased from 1 to 4 in a new stage, and executor has 8 
> slots, it is expected that 2 new tasks can be run on the existing executor 
> but Spark starts new executors for new ResourceProfile. The behavior is 
> unnecessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31771) Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'

2020-05-20 Thread Kent Yao (Jira)
Kent Yao created SPARK-31771:


 Summary: Disable Narrow TextStyle for datetime pattern 
'G/M/L/E/u/Q/q'
 Key: SPARK-31771
 URL: https://issues.apache.org/jira/browse/SPARK-31771
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao


Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text Style 
in java.time.DateTimeFormatterBuilder which output the leading single letter of 
the value, e.g. `December` would be `D`,  while in Spark 2.4 they means 
Full-Text Style.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31770) spark double count issue when table is recreated from hive

2020-05-20 Thread Nithin (Jira)
Nithin created SPARK-31770:
--

 Summary: spark double count issue when table is recreated from hive
 Key: SPARK-31770
 URL: https://issues.apache.org/jira/browse/SPARK-31770
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.2
Reporter: Nithin


Spark sets serde property named path when a table is created. Along with it , 
it will add a property called sql.source.provider that ensures the consistency. 
But if we recreate the same table using hive , then the hive will not respect 
the serde properties set by spark named sql.source.provider. This will lead to 
conflict if we read the newly created table using spark . 

 

It can be argued that the issue is with hive because it do not properly ensures 
that the table properties are being copied. But these are the properties set by 
spark on a hive table. and spark should be the one adhering to the standards of 
hive. From a corporate standpoint , we cannot tell users to avoid creating or 
altering tables on hive because spark will end up giving wrong results. *Worst 
of all , even if it failed then it was okay. but it simply gives double count 
without even a failure message and that is very dangerous for critical 
applications*. Either end up giving error message if the serde property does 
not have path as well as sql.source.provider (or) do not even try to create the 
property path in first place.  Few related issues : 
 # SPARK-31751
 # SPARK-28266 was resolved as non-reproducible which is not entirely true

 

*steps to reproduce :*

 
{code:java}
df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])
df.createOrReplaceTempView("test1")
spark.sql("CREATE TABLE test1_using_orc USING ORC AS (SELECT * from test1)")
{code}
 

Run the below from hive ( preferably tez )

 
{code:java}
create table test2_like_test1 like test1_using_orc;
insert into test2_like_test1 select * from test1_using_orc;{code}

---from spark again : Now test2_like_test1 will have serde property called 
path , but will not have sql.source.provider. 

 

 
{code:java}
>>> spark.sql("set spark.sql.hive.convertMetastoreOrc=false").show()
++-+
| key|value|
++-+
|spark.sql.hive.co...|false|
++-+
>>> spark.sql("select count(*) from test2_like_test1").show()
++
|count(1)|
++
| 1|
++
>>> spark.sql("set spark.sql.hive.convertMetastoreOrc=true").show()
++-+
| key|value|
++-+
|spark.sql.hive.co...| true|
++-+
>>> spark.sql("select count(*) from test2_like_test1").show()
++
|count(1)|
++
| 2|
++
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31769) Add support of MDC in spark driver logs.

2020-05-20 Thread Izek Greenfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111955#comment-17111955
 ] 

Izek Greenfield commented on SPARK-31769:
-

Hi [~cloud_fan] Following previous discussions on 
[PR-26624|https://github.com/apache/spark/pull/26624], I'd like us to agree on 
the necessity of this feature, after which I'll create a dedicated PR.

What are your thoughts?

> Add support of MDC in spark driver logs.
> 
>
> Key: SPARK-31769
> URL: https://issues.apache.org/jira/browse/SPARK-31769
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Izek Greenfield
>Priority: Major
>
> In order to align driver logs with spark executors logs, add support to log 
> MDC state for properties that are shared between the driver the server 
> executors (e.g. application name. task id) 
>  The use case is applicable in 2 cases:
>  # streaming
>  # running spark under job server (in this case you have a long-running spark 
> context that handles many tasks from different sources).
> See also [SPARK-8981|https://issues.apache.org/jira/browse/SPARK-8981] that 
> handles the MDC in executor context.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31769) Add support of MDC in spark driver logs.

2020-05-20 Thread Izek Greenfield (Jira)
Izek Greenfield created SPARK-31769:
---

 Summary: Add support of MDC in spark driver logs.
 Key: SPARK-31769
 URL: https://issues.apache.org/jira/browse/SPARK-31769
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Izek Greenfield


In order to align driver logs with spark executors logs, add support to log MDC 
state for properties that are shared between the driver the server executors 
(e.g. application name. task id) 
 The use case is applicable in 2 cases:
 # streaming
 # running spark under job server (in this case you have a long-running spark 
context that handles many tasks from different sources).

See also [SPARK-8981|https://issues.apache.org/jira/browse/SPARK-8981] that 
handles the MDC in executor context.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31437) Try assigning tasks to existing executors by which required resources in ResourceProfile are satisfied

2020-05-20 Thread Hongze Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084552#comment-17084552
 ] 

Hongze Zhang edited comment on SPARK-31437 at 5/20/20, 8:43 AM:


Thanks [~tgraves]. I got your point of making them tied. 

Actually I was thinking of something like this:

1. to break ResourceProfile up to ExecutorResourceProfile and ResourceProfile;
2. ResourceProfile still contains both resource requirement of executor and 
task;
3. ExecutorResourceProfile only includes executor's resource req;
4. ExecutorResourceProfile is required to allocate new executor instances from 
scheduler backend; 
5. Similar to current solution, user specifies ResourceProfile for RDD, then 
tasks are scheduled onto executors that are allocated using 
ExecutorResourceProfile;
6. Each time ResourceProfile comes, ExecutorResourceProfile is created/selected 
within one of several strategies;

Strategies types:

s1. Always creates new ExecutorSpec;
s2. If executor resource requirement in ResourceProfile meets existing 
ExecutorResourceProfile, use the existing one;
s3. ...

bq. My etl tasks uses 8 cores, my ml tasks use 8 cores and 4 cpus.  How do I 
keep my etl tasks from running on the ML executors without wasting resources?

By just using strategy s1, everything should work as current implementation.




was (Author: zhztheplayer):
Thanks [~tgraves]. I got your point of making them tied. 

Actually I was thinking of something like this:

1. to break ResourceProfile up to ExecutorResourceProfile and ResourceProfile;
2. ResourceProfile still contains both resource requirement of executor and 
task;
3. ExecutorResourceProfile only includes executor's resource req;
4. ExecutorResourceProfile is required to allocate new executor instances from 
scheduler backend; 
5. Similar to current solution, user specifies ResourceProfile for RDD, then 
tasks are scheduled onto executors that are allocated using 
ExecutorResourceProfile;
6. Each time ResourceProfile comes, ExecutorResourceProfile is created/selected 
within one of several strategies;

Strategies types:

s1. Always creates new ExecutorSpec;
s2. If executor resource requirement in ResourceProfile meets existing 
ExecutorSpec, use the existing one;
s3. ...

bq. My etl tasks uses 8 cores, my ml tasks use 8 cores and 4 cpus.  How do I 
keep my etl tasks from running on the ML executors without wasting resources?

By just using strategy s1, everything should work as current implementation.



> Try assigning tasks to existing executors by which required resources in 
> ResourceProfile are satisfied
> --
>
> Key: SPARK-31437
> URL: https://issues.apache.org/jira/browse/SPARK-31437
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.1.0
>Reporter: Hongze Zhang
>Priority: Major
>
> By the change in [PR|https://github.com/apache/spark/pull/27773] of 
> SPARK-29154, submitted tasks are scheduled onto executors only if resource 
> profile IDs strictly match. As a result Spark always starts new executors for 
> customized ResourceProfiles.
> This limitation makes working with process-local jobs unfriendly. E.g. Task 
> cores has been increased from 1 to 4 in a new stage, and executor has 8 
> slots, it is expected that 2 new tasks can be run on the existing executor 
> but Spark starts new executors for new ResourceProfile. The behavior is 
> unnecessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31437) Try assigning tasks to existing executors by which required resources in ResourceProfile are satisfied

2020-05-20 Thread Hongze Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084552#comment-17084552
 ] 

Hongze Zhang edited comment on SPARK-31437 at 5/20/20, 8:43 AM:


Thanks [~tgraves]. I got your point of making them tied. 

Actually I was thinking of something like this:

1. to break ResourceProfile up to ExecutorResourceProfile and ResourceProfile;
2. ResourceProfile still contains both resource requirement of executor and 
task;
3. ExecutorResourceProfile only includes executor's resource req;
4. ExecutorResourceProfile is required to allocate new executor instances from 
scheduler backend; 
5. Similar to current solution, user specifies ResourceProfile for RDD, then 
tasks are scheduled onto executors that are allocated using 
ExecutorResourceProfile;
6. Each time ResourceProfile comes, ExecutorResourceProfile is created/selected 
within one of several strategies;

Strategies types:

s1. Always creates new ExecutorSpec;
s2. If executor resource requirement in ResourceProfile meets existing 
ExecutorSpec, use the existing one;
s3. ...

bq. My etl tasks uses 8 cores, my ml tasks use 8 cores and 4 cpus.  How do I 
keep my etl tasks from running on the ML executors without wasting resources?

By just using strategy s1, everything should work as current implementation.




was (Author: zhztheplayer):
Thanks [~tgraves]. I got your point of making them tied. 

Actually I was thinking of something like this:

1. to break ResourceProfile up to ExecutorSpec and ResourceProfile;
2. ResourceProfile's structure remains as is;
3. ExecutorSpec only includes resource requirements for executor, task resource 
requirement is removed;
4. ExecutorSpec is required to allocate new executor instances from scheduler 
backend; 
5. Similar to current solution, user specifies ResourceProfile for RDD, then 
tasks are scheduled onto executors that are allocated using ExecutorResource;
6. Each time ResourceProfile comes, ExecutorSpec is created/selected within one 
of several strategies;

Strategies types:

s1. Always creates new ExecutorSpec;
s2. If executor resource requirement in ResourceProfile meets existing 
ExecutorSpec, use the existing one;
s3. ...

bq. My etl tasks uses 8 cores, my ml tasks use 8 cores and 4 cpus.  How do I 
keep my etl tasks from running on the ML executors without wasting resources?

By just using strategy s1, everything should work as current implementation.



> Try assigning tasks to existing executors by which required resources in 
> ResourceProfile are satisfied
> --
>
> Key: SPARK-31437
> URL: https://issues.apache.org/jira/browse/SPARK-31437
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.1.0
>Reporter: Hongze Zhang
>Priority: Major
>
> By the change in [PR|https://github.com/apache/spark/pull/27773] of 
> SPARK-29154, submitted tasks are scheduled onto executors only if resource 
> profile IDs strictly match. As a result Spark always starts new executors for 
> customized ResourceProfiles.
> This limitation makes working with process-local jobs unfriendly. E.g. Task 
> cores has been increased from 1 to 4 in a new stage, and executor has 8 
> slots, it is expected that 2 new tasks can be run on the existing executor 
> but Spark starts new executors for new ResourceProfile. The behavior is 
> unnecessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31437) Try assigning tasks to existing executors by which required resources in ResourceProfile are satisfied

2020-05-20 Thread Hongze Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084552#comment-17084552
 ] 

Hongze Zhang edited comment on SPARK-31437 at 5/20/20, 8:43 AM:


Thanks [~tgraves]. I got your point of making them tied. 

Actually I was thinking of something like this:

1. to break ResourceProfile up to ExecutorResourceProfile and ResourceProfile;
2. ResourceProfile still contains both resource requirement of executor and 
task;
3. ExecutorResourceProfile only includes executor's resource req;
4. ExecutorResourceProfile is required to allocate new executor instances from 
scheduler backend; 
5. Similar to current solution, user specifies ResourceProfile for RDD, then 
tasks are scheduled onto executors that are allocated using 
ExecutorResourceProfile;
6. Each time ResourceProfile comes, ExecutorResourceProfile is created/selected 
within one of several strategies;

Strategies types:

s1. Always creates new ExecutorResourceProfile;
s2. If executor resource requirement in ResourceProfile meets existing 
ExecutorResourceProfile, use the existing one;
s3. ...

bq. My etl tasks uses 8 cores, my ml tasks use 8 cores and 4 cpus.  How do I 
keep my etl tasks from running on the ML executors without wasting resources?

By just using strategy s1, everything should work as current implementation.




was (Author: zhztheplayer):
Thanks [~tgraves]. I got your point of making them tied. 

Actually I was thinking of something like this:

1. to break ResourceProfile up to ExecutorResourceProfile and ResourceProfile;
2. ResourceProfile still contains both resource requirement of executor and 
task;
3. ExecutorResourceProfile only includes executor's resource req;
4. ExecutorResourceProfile is required to allocate new executor instances from 
scheduler backend; 
5. Similar to current solution, user specifies ResourceProfile for RDD, then 
tasks are scheduled onto executors that are allocated using 
ExecutorResourceProfile;
6. Each time ResourceProfile comes, ExecutorResourceProfile is created/selected 
within one of several strategies;

Strategies types:

s1. Always creates new ExecutorSpec;
s2. If executor resource requirement in ResourceProfile meets existing 
ExecutorResourceProfile, use the existing one;
s3. ...

bq. My etl tasks uses 8 cores, my ml tasks use 8 cores and 4 cpus.  How do I 
keep my etl tasks from running on the ML executors without wasting resources?

By just using strategy s1, everything should work as current implementation.



> Try assigning tasks to existing executors by which required resources in 
> ResourceProfile are satisfied
> --
>
> Key: SPARK-31437
> URL: https://issues.apache.org/jira/browse/SPARK-31437
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.1.0
>Reporter: Hongze Zhang
>Priority: Major
>
> By the change in [PR|https://github.com/apache/spark/pull/27773] of 
> SPARK-29154, submitted tasks are scheduled onto executors only if resource 
> profile IDs strictly match. As a result Spark always starts new executors for 
> customized ResourceProfiles.
> This limitation makes working with process-local jobs unfriendly. E.g. Task 
> cores has been increased from 1 to 4 in a new stage, and executor has 8 
> slots, it is expected that 2 new tasks can be run on the existing executor 
> but Spark starts new executors for new ResourceProfile. The behavior is 
> unnecessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8981) Set applicationId and appName in log4j MDC

2020-05-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-8981:
--

Assignee: Izek Greenfield

> Set applicationId and appName in log4j MDC
> --
>
> Key: SPARK-8981
> URL: https://issues.apache.org/jira/browse/SPARK-8981
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Paweł Kopiczko
>Assignee: Izek Greenfield
>Priority: Minor
> Fix For: 3.1.0
>
>
> It would be nice to have, because it's good to have logs in one file when 
> using log agents (like logentires) in standalone mode. Also allows 
> configuring rolling file appender without a mess when multiple applications 
> are running.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8981) Set applicationId and appName in log4j MDC

2020-05-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-8981.

Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 26624
[https://github.com/apache/spark/pull/26624]

> Set applicationId and appName in log4j MDC
> --
>
> Key: SPARK-8981
> URL: https://issues.apache.org/jira/browse/SPARK-8981
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Paweł Kopiczko
>Priority: Minor
> Fix For: 3.1.0
>
>
> It would be nice to have, because it's good to have logs in one file when 
> using log agents (like logentires) in standalone mode. Also allows 
> configuring rolling file appender without a mess when multiple applications 
> are running.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21784) Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign keys

2020-05-20 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111866#comment-17111866
 ] 

Takeshi Yamamuro commented on SPARK-21784:
--

Not yet. IIUC there is no active work now for supporting constraints in Spark.

> Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign 
> keys
> --
>
> Key: SPARK-21784
> URL: https://issues.apache.org/jira/browse/SPARK-21784
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Suresh Thalamati
>Priority: Major
>
> Currently Spark SQL does not have  DDL support to define primary key , and 
> foreign key constraints. This Jira is to add DDL support to define primary 
> key and foreign key informational constraint using ALTER TABLE syntax. These 
> constraints will be used in query optimization and you can find more details 
> about this in the spec in SPARK-19842
> *Syntax :*
> {code}
> ALTER TABLE [db_name.]table_name ADD [CONSTRAINT constraintName]
>   (PRIMARY KEY (col_names) |
>   FOREIGN KEY (col_names) REFERENCES [db_name.]table_name [(col_names)])
>   [VALIDATE | NOVALIDATE] [RELY | NORELY]
> {code}
> Examples :
> {code:sql}
> ALTER TABLE employee _ADD CONSTRANT pk_ PRIMARY KEY(empno) VALIDATE RELY
> ALTER TABLE department _ADD CONSTRAINT emp_fk_ FOREIGN KEY (mgrno) REFERENCES 
> employee(empno) NOVALIDATE NORELY
> {code}
> *Constraint name generated by the system:*
> {code:sql}
> ALTER TABLE department ADD PRIMARY KEY(deptno) VALIDATE RELY
> ALTER TABLE employee ADD FOREIGN KEY (workdept) REFERENCES department(deptno) 
> VALIDATE RELY;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30689) Allow custom resource scheduling to work with YARN versions that don't support custom resource scheduling

2020-05-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111814#comment-17111814
 ] 

Apache Spark commented on SPARK-30689:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/28591

> Allow custom resource scheduling to work with YARN versions that don't 
> support custom resource scheduling
> -
>
> Key: SPARK-30689
> URL: https://issues.apache.org/jira/browse/SPARK-30689
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.0
>
>
> Many people/companies will not be moving to Hadoop 3.1 or greater, where it 
> supports custom resource scheduling for things like GPUs soon and have 
> requested support for it in older hadoop 2.x versions. This also means that 
> they may not have isolation enabled which is what the default behavior relies 
> on.
> right now the option is to write a custom discovery script to handle on their 
> own. This is ok but has some limitation because the script runs as a separate 
> process.  It also just a shell script.
> I think we can make this a lot more flexible by making the entire resource 
> discovery class pluggable. The default one would stay as is and call the 
> discovery script, but if an advanced user wanted to replace the entire thing 
> they could implement a pluggable class which they could write custom code on 
> how to discovery resource addresses.
> This will also help users if they are running hadoop 3.1.x or greater but 
> don't have the resources configured or aren't running in an isolated 
> environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30689) Allow custom resource scheduling to work with YARN versions that don't support custom resource scheduling

2020-05-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111813#comment-17111813
 ] 

Apache Spark commented on SPARK-30689:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/28591

> Allow custom resource scheduling to work with YARN versions that don't 
> support custom resource scheduling
> -
>
> Key: SPARK-30689
> URL: https://issues.apache.org/jira/browse/SPARK-30689
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.0
>
>
> Many people/companies will not be moving to Hadoop 3.1 or greater, where it 
> supports custom resource scheduling for things like GPUs soon and have 
> requested support for it in older hadoop 2.x versions. This also means that 
> they may not have isolation enabled which is what the default behavior relies 
> on.
> right now the option is to write a custom discovery script to handle on their 
> own. This is ok but has some limitation because the script runs as a separate 
> process.  It also just a shell script.
> I think we can make this a lot more flexible by making the entire resource 
> discovery class pluggable. The default one would stay as is and call the 
> discovery script, but if an advanced user wanted to replace the entire thing 
> they could implement a pluggable class which they could write custom code on 
> how to discovery resource addresses.
> This will also help users if they are running hadoop 3.1.x or greater but 
> don't have the resources configured or aren't running in an isolated 
> environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31693) Investigate AmpLab Jenkins server network issue

2020-05-20 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111807#comment-17111807
 ] 

Hyukjin Kwon commented on SPARK-31693:
--

Awesome!!

> Investigate AmpLab Jenkins server network issue
> ---
>
> Key: SPARK-31693
> URL: https://issues.apache.org/jira/browse/SPARK-31693
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Critical
>
> Given the series of failures in Spark packaging Jenkins job, it seems that 
> there is a network issue in AmbLab Jenkins cluster.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
> - The node failed to talk to GitBox. (SPARK-31687) -> GitHub is okay.
> - The node failed to download the maven mirror. (SPARK-31691) -> The primary 
> host is okay.
> - The node failed to communicate repository.apache.org. (Current master 
> branch Jenkins job failure)
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-deploy-plugin:3.0.0-M1:deploy (default-deploy) 
> on project spark-parent_2.12: ArtifactDeployerException: Failed to retrieve 
> remote metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml: Could 
> not transfer metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml from/to 
> apache.snapshots.https 
> (https://repository.apache.org/content/repositories/snapshots): Transfer 
> failed for 
> https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.12/3.1.0-SNAPSHOT/maven-metadata.xml:
>  Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] 
> failed: Connection timed out (Connection timed out) -> [Help 1]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org