date:20200802

[jira] [Created] (SPARK-32518) CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources

2020-08-02 Thread wuyi (Jira)

wuyi created SPARK-32518:


 Summary: CoarseGrainedSchedulerBackend.maxNumConcurrentTasks 
should consider all kinds of resources
 Key: SPARK-32518
 URL: https://issues.apache.org/jira/browse/SPARK-32518
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: wuyi


Currently, CoarseGrainedSchedulerBackend.maxNumConcurrentTasks only considers 
the CPU for the max concurrent tasks. This can cause the application to hang 
when a barrier stage requires extra custom resources but the cluster doesn't 
have enough corresponding resources. Because, without the checking for other 
custom resources in maxNumConcurrentTasks, the barrier stage can be submitted 
to the TaskSchedulerImpl. But the TaskSchedulerImpl can not launch tasks for 
the barrier stage due to the insufficient task slots calculated by 
calculateAvailableSlots(which does check all kinds of resources). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32495) Update jackson-databind versions to fix various vulnerabilities.

2020-08-02 Thread Prashant Sharma (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-32495:

Summary: Update jackson-databind versions to fix various vulnerabilities.  
(was: Update jackson versions from 2.4.6 and so on(2.4.x))

> Update jackson-databind versions to fix various vulnerabilities.
> 
>
> Key: SPARK-32495
> URL: https://issues.apache.org/jira/browse/SPARK-32495
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: SHOBHIT SHUKLA
>Priority: Major
>
> As a vulnerability for Fasterxml Jackson version 2.6.7.3 is affected by 
> CVE-2017-15095 and CVE-2018-5968 CVEs 
> [https://nvd.nist.gov/vuln/detail/CVE-2018-5968], Would it be possible to 
> upgrade the jackson version for spark-2.4.6 and so on(2.4.x).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32517) Add StorageLevel.DISK_ONLY_3

2020-08-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169739#comment-17169739
 ] 

Apache Spark commented on SPARK-32517:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/29331

> Add StorageLevel.DISK_ONLY_3
> 
>
> Key: SPARK-32517
> URL: https://issues.apache.org/jira/browse/SPARK-32517
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to add `StorageLevel.DISK_ONLY_3` as a built-in StorageLevel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32517) Add StorageLevel.DISK_ONLY_3

2020-08-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32517:


Assignee: (was: Apache Spark)

> Add StorageLevel.DISK_ONLY_3
> 
>
> Key: SPARK-32517
> URL: https://issues.apache.org/jira/browse/SPARK-32517
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to add `StorageLevel.DISK_ONLY_3` as a built-in StorageLevel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32517) Add StorageLevel.DISK_ONLY_3

2020-08-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32517:


Assignee: Apache Spark

> Add StorageLevel.DISK_ONLY_3
> 
>
> Key: SPARK-32517
> URL: https://issues.apache.org/jira/browse/SPARK-32517
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> This issue aims to add `StorageLevel.DISK_ONLY_3` as a built-in StorageLevel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join

2020-08-02 Thread fritz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169736#comment-17169736
 ] 

fritz commented on SPARK-31754:
---

Thanks for your response [~puviarasu]

Yes. Batch processing is not have the NPE issues and working fine.

Totally agreed, the batch is having higher latency. We already having batch 
pipeline, so, changing the streaming pipeline to batch is not an options for us.

What we are doing right now is to just re-run the job, and it is working again, 
but, the issues is reappear if there is NPE occurred and the job is failed and 
get terminated.

I am not sure if this is useful, we are running on Spark 2.4.5 on EMR

> Spark Structured Streaming: NullPointerException in Stream Stream join
> --
>
> Key: SPARK-31754
> URL: https://issues.apache.org/jira/browse/SPARK-31754
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark Version : 2.4.0
> Hadoop Version : 3.0.0
>Reporter: Puviarasu
>Priority: Major
>  Labels: structured-streaming
> Attachments: CodeGen.txt, Excpetion-3.0.0Preview2.txt, 
> Logical-Plan.txt
>
>
> When joining 2 streams with watermarking and windowing we are getting 
> NullPointer Exception after running for few minutes. 
> After failure we analyzed the checkpoint offsets/sources and found the files 
> for which the application failed. These files are not having any null values 
> in the join columns. 
> We even started the job with the files and the application ran. From this we 
> concluded that the exception is not because of the data from the streams.
> *Code:*
>  
> {code:java}
> val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> 
> "1" )
>  val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> 
> "1" )
>  
> spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1")
>  
> spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2")
>  spark.sql("select * from source1 where eventTime1 is not null and col1 is 
> not null").withWatermark("eventTime1", "30 
> minutes").createTempView("viewNotNull1")
>  spark.sql("select * from source2 where eventTime2 is not null and col2 is 
> not null").withWatermark("eventTime2", "30 
> minutes").createTempView("viewNotNull2")
>  spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = 
> b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + 
> interval 2 hours").createTempView("join")
>  val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> 
> "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3")
>  spark.sql("select * from 
> join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 
> seconds")).format("parquet").options(optionsMap3).start()
> {code}
>  
> *Exception:*
>  
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
> Aborting TaskSet 4.0 because task 0 (partition 0)
> cannot run anywhere due to node and executor blacklist.
> Most recent failure:
> Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
> at 
>

[jira] [Commented] (SPARK-32432) Add support for reading ORC/Parquet files with SymlinkTextInputFormat

2020-08-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169734#comment-17169734
 ] 

Apache Spark commented on SPARK-32432:
--

User 'moomindani' has created a pull request for this issue:
https://github.com/apache/spark/pull/29330

> Add support for reading ORC/Parquet files with SymlinkTextInputFormat
> -
>
> Key: SPARK-32432
> URL: https://issues.apache.org/jira/browse/SPARK-32432
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Noritaka Sekiyama
>Priority: Major
>
> Hive style symlink (SymlinkTextInputFormat) is commonly used in different 
> analytic engines including prestodb and prestosql.
> Currently SymlinkTextInputFormat works with JSON/CSV files but does not work 
> with ORC/Parquet files in Apache Spark (and Apache Hive).
> On the other hand, prestodb and prestosql support SymlinkTextInputFormat with 
> ORC/Parquet files.
> This issue is to add support for reading ORC/Parquet files with 
> SymlinkTextInputFormat in Apache Spark.
>  
> Related links
>  * Hive's SymlinkTextInputFormat: 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java]
>  * prestosql's implementation to add support for reading avro files with 
> SymlinkTextInputFormat: 
> [https://github.com/vincentpoon/prestosql/blob/master/presto-hive/src/main/java/io/prestosql/plugin/hive/BackgroundHiveSplitLoader.java]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32517) Add StorageLevel.DISK_ONLY_3

2020-08-02 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-32517:
-

 Summary: Add StorageLevel.DISK_ONLY_3
 Key: SPARK-32517
 URL: https://issues.apache.org/jira/browse/SPARK-32517
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun


This issue aims to add `StorageLevel.DISK_ONLY_3` as a built-in StorageLevel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32432) Add support for reading ORC/Parquet files with SymlinkTextInputFormat

2020-08-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169733#comment-17169733
 ] 

Apache Spark commented on SPARK-32432:
--

User 'moomindani' has created a pull request for this issue:
https://github.com/apache/spark/pull/29330

> Add support for reading ORC/Parquet files with SymlinkTextInputFormat
> -
>
> Key: SPARK-32432
> URL: https://issues.apache.org/jira/browse/SPARK-32432
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Noritaka Sekiyama
>Priority: Major
>
> Hive style symlink (SymlinkTextInputFormat) is commonly used in different 
> analytic engines including prestodb and prestosql.
> Currently SymlinkTextInputFormat works with JSON/CSV files but does not work 
> with ORC/Parquet files in Apache Spark (and Apache Hive).
> On the other hand, prestodb and prestosql support SymlinkTextInputFormat with 
> ORC/Parquet files.
> This issue is to add support for reading ORC/Parquet files with 
> SymlinkTextInputFormat in Apache Spark.
>  
> Related links
>  * Hive's SymlinkTextInputFormat: 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java]
>  * prestosql's implementation to add support for reading avro files with 
> SymlinkTextInputFormat: 
> [https://github.com/vincentpoon/prestosql/blob/master/presto-hive/src/main/java/io/prestosql/plugin/hive/BackgroundHiveSplitLoader.java]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32432) Add support for reading ORC/Parquet files with SymlinkTextInputFormat

2020-08-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32432:


Assignee: Apache Spark

> Add support for reading ORC/Parquet files with SymlinkTextInputFormat
> -
>
> Key: SPARK-32432
> URL: https://issues.apache.org/jira/browse/SPARK-32432
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Noritaka Sekiyama
>Assignee: Apache Spark
>Priority: Major
>
> Hive style symlink (SymlinkTextInputFormat) is commonly used in different 
> analytic engines including prestodb and prestosql.
> Currently SymlinkTextInputFormat works with JSON/CSV files but does not work 
> with ORC/Parquet files in Apache Spark (and Apache Hive).
> On the other hand, prestodb and prestosql support SymlinkTextInputFormat with 
> ORC/Parquet files.
> This issue is to add support for reading ORC/Parquet files with 
> SymlinkTextInputFormat in Apache Spark.
>  
> Related links
>  * Hive's SymlinkTextInputFormat: 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java]
>  * prestosql's implementation to add support for reading avro files with 
> SymlinkTextInputFormat: 
> [https://github.com/vincentpoon/prestosql/blob/master/presto-hive/src/main/java/io/prestosql/plugin/hive/BackgroundHiveSplitLoader.java]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32432) Add support for reading ORC/Parquet files with SymlinkTextInputFormat

2020-08-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32432:


Assignee: (was: Apache Spark)

> Add support for reading ORC/Parquet files with SymlinkTextInputFormat
> -
>
> Key: SPARK-32432
> URL: https://issues.apache.org/jira/browse/SPARK-32432
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Noritaka Sekiyama
>Priority: Major
>
> Hive style symlink (SymlinkTextInputFormat) is commonly used in different 
> analytic engines including prestodb and prestosql.
> Currently SymlinkTextInputFormat works with JSON/CSV files but does not work 
> with ORC/Parquet files in Apache Spark (and Apache Hive).
> On the other hand, prestodb and prestosql support SymlinkTextInputFormat with 
> ORC/Parquet files.
> This issue is to add support for reading ORC/Parquet files with 
> SymlinkTextInputFormat in Apache Spark.
>  
> Related links
>  * Hive's SymlinkTextInputFormat: 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java]
>  * prestosql's implementation to add support for reading avro files with 
> SymlinkTextInputFormat: 
> [https://github.com/vincentpoon/prestosql/blob/master/presto-hive/src/main/java/io/prestosql/plugin/hive/BackgroundHiveSplitLoader.java]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join

2020-08-02 Thread Puviarasu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169727#comment-17169727
 ] 

Puviarasu commented on SPARK-31754:
---

Hello [~fritzwijaya], 

We still have the issue as of today[2020-08-03] with the Stream - Stream join 
in Spark Structured Streaming. 

*Our Workaround:* Batch Processing. We have currently moved our business logic 
from problematic Stream - Stream join in Spark Structured Streaming to an 
equivalent Spark Batch Processing. The batch processing workaround is having 
more latency when compared to Spark Structured Streaming, but it is running 
stable. 

Once the issue is fixed by Spark Community, we will replace our Batch 
processing workaround with the desired Spark Structured Streaming[Stream-Stream 
Join]

Thank you. 

CC: [~kabhwan]

> Spark Structured Streaming: NullPointerException in Stream Stream join
> --
>
> Key: SPARK-31754
> URL: https://issues.apache.org/jira/browse/SPARK-31754
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark Version : 2.4.0
> Hadoop Version : 3.0.0
>Reporter: Puviarasu
>Priority: Major
>  Labels: structured-streaming
> Attachments: CodeGen.txt, Excpetion-3.0.0Preview2.txt, 
> Logical-Plan.txt
>
>
> When joining 2 streams with watermarking and windowing we are getting 
> NullPointer Exception after running for few minutes. 
> After failure we analyzed the checkpoint offsets/sources and found the files 
> for which the application failed. These files are not having any null values 
> in the join columns. 
> We even started the job with the files and the application ran. From this we 
> concluded that the exception is not because of the data from the streams.
> *Code:*
>  
> {code:java}
> val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> 
> "1" )
>  val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> 
> "1" )
>  
> spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1")
>  
> spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2")
>  spark.sql("select * from source1 where eventTime1 is not null and col1 is 
> not null").withWatermark("eventTime1", "30 
> minutes").createTempView("viewNotNull1")
>  spark.sql("select * from source2 where eventTime2 is not null and col2 is 
> not null").withWatermark("eventTime2", "30 
> minutes").createTempView("viewNotNull2")
>  spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = 
> b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + 
> interval 2 hours").createTempView("join")
>  val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> 
> "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3")
>  spark.sql("select * from 
> join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 
> seconds")).format("parquet").options(optionsMap3).start()
> {code}
>  
> *Exception:*
>  
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
> Aborting TaskSet 4.0 because task 0 (partition 0)
> cannot run anywhere due to node and executor blacklist.
> Most recent failure:
> Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
> at 
>

[jira] [Resolved] (SPARK-32509) Unused DPP Filter causes issue in canonicalization and prevents reuse exchange

2020-08-02 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32509.
-
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29318
[https://github.com/apache/spark/pull/29318]

> Unused DPP Filter causes issue in canonicalization and prevents reuse exchange
> --
>
> Key: SPARK-32509
> URL: https://issues.apache.org/jira/browse/SPARK-32509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Prakhar Jain
>Assignee: Prakhar Jain
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> As part of PlanDynamicPruningFilter rule, the unused DPP Filter are simply 
> replaced by `DynamicPruningExpression(TrueLiteral)` so that they can be 
> avoided. But these unnecessary`DynamicPruningExpression(TrueLiteral)` 
> partition filter inside the FileSourceScanExec affects the canonicalization 
> of the node and so in many cases, this can prevent ReuseExchange from 
> happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32509) Unused DPP Filter causes issue in canonicalization and prevents reuse exchange

2020-08-02 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32509:
---

Assignee: Prakhar Jain

> Unused DPP Filter causes issue in canonicalization and prevents reuse exchange
> --
>
> Key: SPARK-32509
> URL: https://issues.apache.org/jira/browse/SPARK-32509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Prakhar Jain
>Assignee: Prakhar Jain
>Priority: Major
>
> As part of PlanDynamicPruningFilter rule, the unused DPP Filter are simply 
> replaced by `DynamicPruningExpression(TrueLiteral)` so that they can be 
> avoided. But these unnecessary`DynamicPruningExpression(TrueLiteral)` 
> partition filter inside the FileSourceScanExec affects the canonicalization 
> of the node and so in many cases, this can prevent ReuseExchange from 
> happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32510) JDBC doesn't check duplicate column names in nested structures

2020-08-02 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32510.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29317
[https://github.com/apache/spark/pull/29317]

> JDBC doesn't check duplicate column names in nested structures 
> ---
>
> Key: SPARK-32510
> URL: https://issues.apache.org/jira/browse/SPARK-32510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> JdbcUtils.getCustomSchema calls checkColumnNameDuplication() which checks 
> duplicates on top-level but not in nested structures as other built-in 
> datasources do, see
> [https://github.com/apache/spark/blob/8bc799f92005c903868ef209f5aec8deb6ccce5a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L822-L823]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32510) JDBC doesn't check duplicate column names in nested structures

2020-08-02 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32510:
---

Assignee: Maxim Gekk

> JDBC doesn't check duplicate column names in nested structures 
> ---
>
> Key: SPARK-32510
> URL: https://issues.apache.org/jira/browse/SPARK-32510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> JdbcUtils.getCustomSchema calls checkColumnNameDuplication() which checks 
> duplicates on top-level but not in nested structures as other built-in 
> datasources do, see
> [https://github.com/apache/spark/blob/8bc799f92005c903868ef209f5aec8deb6ccce5a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L822-L823]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32274) Add in the ability for a user to replace the serialization format of the cache

2020-08-02 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32274.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29067
[https://github.com/apache/spark/pull/29067]

> Add in the ability for a user to replace the serialization format of the cache
> --
>
> Key: SPARK-32274
> URL: https://issues.apache.org/jira/browse/SPARK-32274
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
>Priority: Major
> Fix For: 3.1.0
>
>
> Caching a dataset or dataframe can be a very expensive operation, but has a 
> huge benefit for later queries that use it.  There are many use cases that 
> could benefit from caching the data but not enough to justify the current 
> scheme.  I would like to propose that we make the serialization of the 
> caching plugable.  That way users can explore other formats and compression 
> code.
>  
> As an example I took the line item table from TPCH at a scale factor of 10 
> and converted it to parquet.  This resulted in 2.1 GB of data on disk. With 
> the current caching it can take nearly 8 GB to store that same data in 
> memory, and about 5 GB to store in on disk.
>  
> If I want to read all of that data and and write it out again.
> {code:java}
> scala> val a = spark.read.parquet("../data/tpch/SF10_parquet/lineitem.tbl/")
> a: org.apache.spark.sql.DataFrame = [l_orderkey: bigint, l_partkey: bigint 
> ... 14 more fields]
> scala> spark.time(a.write.mode("overwrite").parquet("./target/tmp"))
> Time taken: 25832 ms {code}
> But a query that reads that data directly from the cache after it is built 
> only takes 21531 ms. For some queries having much more data that can be 
> stored in the cache might be worth the extra query time.
>  
> It also takes about a lot less time to do the parquet compression than it 
> does to do the cache compression.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32274) Add in the ability for a user to replace the serialization format of the cache

2020-08-02 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32274:
---

Assignee: Robert Joseph Evans

> Add in the ability for a user to replace the serialization format of the cache
> --
>
> Key: SPARK-32274
> URL: https://issues.apache.org/jira/browse/SPARK-32274
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
>Priority: Major
>
> Caching a dataset or dataframe can be a very expensive operation, but has a 
> huge benefit for later queries that use it.  There are many use cases that 
> could benefit from caching the data but not enough to justify the current 
> scheme.  I would like to propose that we make the serialization of the 
> caching plugable.  That way users can explore other formats and compression 
> code.
>  
> As an example I took the line item table from TPCH at a scale factor of 10 
> and converted it to parquet.  This resulted in 2.1 GB of data on disk. With 
> the current caching it can take nearly 8 GB to store that same data in 
> memory, and about 5 GB to store in on disk.
>  
> If I want to read all of that data and and write it out again.
> {code:java}
> scala> val a = spark.read.parquet("../data/tpch/SF10_parquet/lineitem.tbl/")
> a: org.apache.spark.sql.DataFrame = [l_orderkey: bigint, l_partkey: bigint 
> ... 14 more fields]
> scala> spark.time(a.write.mode("overwrite").parquet("./target/tmp"))
> Time taken: 25832 ms {code}
> But a query that reads that data directly from the cache after it is built 
> only takes 21531 ms. For some queries having much more data that can be 
> stored in the cache might be worth the extra query time.
>  
> It also takes about a lot less time to do the parquet compression than it 
> does to do the cache compression.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join

2020-08-02 Thread fritz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169677#comment-17169677
 ] 

fritz commented on SPARK-31754:
---

Hi [~puviarasu] [~kabhwan], recently facing similar issue with the NPE when do 
stream-stream join. It throwing same exception with the log that [~puviarasu] 
share above.

The only different with my case is the source from kafka. Other than that is 
same.

Have checked and ensure the join key is not null.

Any advice? Thanks

> Spark Structured Streaming: NullPointerException in Stream Stream join
> --
>
> Key: SPARK-31754
> URL: https://issues.apache.org/jira/browse/SPARK-31754
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark Version : 2.4.0
> Hadoop Version : 3.0.0
>Reporter: Puviarasu
>Priority: Major
>  Labels: structured-streaming
> Attachments: CodeGen.txt, Excpetion-3.0.0Preview2.txt, 
> Logical-Plan.txt
>
>
> When joining 2 streams with watermarking and windowing we are getting 
> NullPointer Exception after running for few minutes. 
> After failure we analyzed the checkpoint offsets/sources and found the files 
> for which the application failed. These files are not having any null values 
> in the join columns. 
> We even started the job with the files and the application ran. From this we 
> concluded that the exception is not because of the data from the streams.
> *Code:*
>  
> {code:java}
> val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> 
> "1" )
>  val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> 
> "1" )
>  
> spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1")
>  
> spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2")
>  spark.sql("select * from source1 where eventTime1 is not null and col1 is 
> not null").withWatermark("eventTime1", "30 
> minutes").createTempView("viewNotNull1")
>  spark.sql("select * from source2 where eventTime2 is not null and col2 is 
> not null").withWatermark("eventTime2", "30 
> minutes").createTempView("viewNotNull2")
>  spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = 
> b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + 
> interval 2 hours").createTempView("join")
>  val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> 
> "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3")
>  spark.sql("select * from 
> join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 
> seconds")).format("parquet").options(optionsMap3).start()
> {code}
>  
> *Exception:*
>  
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
> Aborting TaskSet 4.0 because task 0 (partition 0)
> cannot run anywhere due to node and executor blacklist.
> Most recent failure:
> Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply$mcV$spala:338)
> at 
>

[jira] [Commented] (SPARK-32515) Distinct Function Weird Bug

2020-08-02 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169673#comment-17169673
 ] 

Takeshi Yamamuro commented on SPARK-32515:
--

Thanks for the report. I updated some fields (e.g., remove `blocker` in the 
priority) in this ticket because we need to look into what's a root cause of 
this issue.

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Major
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> image-2020-08-03-07-03-55-716.png, unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32515) Distinct Function Weird Bug

2020-08-02 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32515:
-
Component/s: (was: PySpark)
 SQL

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Major
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> image-2020-08-03-07-03-55-716.png, unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32515) Distinct Function Weird Bug

2020-08-02 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32515:
-
Target Version/s:   (was: 2.4.6)

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Major
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> image-2020-08-03-07-03-55-716.png, unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32515) Distinct Function Weird Bug

2020-08-02 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32515:
-
Labels:   (was: distinct groupby load read)

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Major
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> image-2020-08-03-07-03-55-716.png, unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32515) Distinct Function Weird Bug

2020-08-02 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32515:
-
Priority: Major  (was: Blocker)

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Major
>  Labels: distinct, groupby, load, read
> Fix For: 2.4.6
>
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> image-2020-08-03-07-03-55-716.png, unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28818) FrequentItems applies an incorrect schema to the resulting dataframe when nulls are present

2020-08-02 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-28818:
-
Fix Version/s: 2.4.7

> FrequentItems applies an incorrect schema to the resulting dataframe when 
> nulls are present
> ---
>
> Key: SPARK-28818
> URL: https://issues.apache.org/jira/browse/SPARK-28818
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Matt Hawes
>Assignee: Matt Hawes
>Priority: Minor
> Fix For: 2.4.7, 3.0.0
>
>
> A trivially reproducible bug in the code for `FrequentItems`. The schema for 
> the resulting arrays of frequent items is [hard coded|#L122]] to have 
> non-nullable array elements:
> {code:scala}
> val outputCols = colInfo.map { v =>
> StructField(v._1 + "_freqItems", ArrayType(v._2, false))
>  }
>  val schema = StructType(outputCols).toAttributes
>  Dataset.ofRows(df.sparkSession, LocalRelation.fromExternalRows(schema, 
> Seq(resultRow)))
> {code}
>  
> However if the column contains frequent nulls then these nulls are included 
> in the frequent items array. This results in various errors such as any 
> attempt to `collect()` resulting in a null pointer exception:
> {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession.Builder().getOrCreate()
> df = spark.createDataFrame([
>     (1, 'a'),
>     (2, None),
>     (3, 'b'),
> ], schema="id INTEGER, val STRING")
> rows = df.freqItems(df.columns).collect()
> {code}
>  Results in:
> {code:java}
> Traceback (most recent call last):                                            
>   
>   File "", line 1, in 
>   File 
> "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/dataframe.py", 
> line 533, in collect
>     sock_info = self._jdf.collectToPython()
>   File 
> "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/utils.py", line 
> 63, in deco
>     return f(*a, **kw)
>   File 
> "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o40.collectToPython.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:44)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:44)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:296)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows$lzycompute(LocalTableScanExec.scala:44)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows(LocalTableScanExec.scala:39)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.executeCollect(LocalTableScanExec.scala:70)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3257)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3254)
>   at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
>   at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3254)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at

[jira] [Resolved] (SPARK-32490) Upgrade netty-all to 4.1.51.Final

2020-08-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32490.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29299
[https://github.com/apache/spark/pull/29299]

> Upgrade netty-all to 4.1.51.Final
> -
>
> Key: SPARK-32490
> URL: https://issues.apache.org/jira/browse/SPARK-32490
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.1.0
>
>
> Upgrade netty version to io.netty:netty-all to 4.1.51.Final to fix some bugs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32490) Upgrade netty-all to 4.1.51.Final

2020-08-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32490:
-

Assignee: Yang Jie

> Upgrade netty-all to 4.1.51.Final
> -
>
> Key: SPARK-32490
> URL: https://issues.apache.org/jira/browse/SPARK-32490
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> Upgrade netty version to io.netty:netty-all to 4.1.51.Final to fix some bugs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32515) Distinct Function Weird Bug

2020-08-02 Thread Jayce Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169658#comment-17169658
 ] 

Jayce Jiang commented on SPARK-32515:
-

https://drive.google.com/file/d/1542jrV1-qXiYYmmc-v3ov88o-mkYlS_R/view?usp=sharing

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Blocker
>  Labels: distinct, groupby, load, read
> Fix For: 2.4.6
>
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> image-2020-08-03-07-03-55-716.png, unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32515) Distinct Function Weird Bug

2020-08-02 Thread Jayce Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169657#comment-17169657
 ] 

Jayce Jiang commented on SPARK-32515:
-

Hey, here is a sample data. Please have a try.

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Blocker
>  Labels: distinct, groupby, load, read
> Fix For: 2.4.6
>
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> image-2020-08-03-07-03-55-716.png, unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32515) Distinct Function Weird Bug

2020-08-02 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169653#comment-17169653
 ] 

JinxinTang edited comment on SPARK-32515 at 8/2/20, 11:09 PM:
--

Hi [~tigaiii123], 

Could you please try the data as follow or could you provide the sample data.
{code:java}
username,age
'xiaoming',11
"['xiaohong','huahua']",12
{code}
Seems ok in my side, same result can get from 2.4.6 and 3.0.0:

!image-2020-08-03-07-03-55-716.png!


was (Author: jinxintang):
Hi [~tigaiii123], 

Could you please try the data or could you provide the sample data.
{code:java}
username,age
'xiaoming',11
"['xiaohong','huahua']",12
{code}
Seems ok in my side, same result can get from 2.4.6 and 3.0.0:

!image-2020-08-03-07-03-55-716.png!

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Blocker
>  Labels: distinct, groupby, load, read
> Fix For: 2.4.6
>
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> image-2020-08-03-07-03-55-716.png, unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32515) Distinct Function Weird Bug

2020-08-02 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169653#comment-17169653
 ] 

JinxinTang edited comment on SPARK-32515 at 8/2/20, 11:08 PM:
--

Hi [~tigaiii123], 

Could you please try the data or could you provide the sample data.
{code:java}
username,age
'xiaoming',11
"['xiaohong','huahua']",12
{code}
Seems ok in my side, same result can get from 2.6.4 and 3.0.0:

!image-2020-08-03-07-03-55-716.png!


was (Author: jinxintang):
Hi [~tigaiii123], 

Could you please try the data or could you provide the sample data.
{code:java}
username,age
'xiaoming',11
"['xiaohong','huahua']",12
{code}
Seems ok in my side, both result can get from 2.6.4 and 3.0.0:

!image-2020-08-03-07-03-55-716.png!

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Blocker
>  Labels: distinct, groupby, load, read
> Fix For: 2.4.6
>
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> image-2020-08-03-07-03-55-716.png, unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32515) Distinct Function Weird Bug

2020-08-02 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169653#comment-17169653
 ] 

JinxinTang edited comment on SPARK-32515 at 8/2/20, 11:08 PM:
--

Hi [~tigaiii123], 

Could you please try the data or could you provide the sample data.
{code:java}
username,age
'xiaoming',11
"['xiaohong','huahua']",12
{code}
Seems ok in my side, same result can get from 2.4.6 and 3.0.0:

!image-2020-08-03-07-03-55-716.png!


was (Author: jinxintang):
Hi [~tigaiii123], 

Could you please try the data or could you provide the sample data.
{code:java}
username,age
'xiaoming',11
"['xiaohong','huahua']",12
{code}
Seems ok in my side, same result can get from 2.6.4 and 3.0.0:

!image-2020-08-03-07-03-55-716.png!

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Blocker
>  Labels: distinct, groupby, load, read
> Fix For: 2.4.6
>
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> image-2020-08-03-07-03-55-716.png, unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32515) Distinct Function Weird Bug

2020-08-02 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169653#comment-17169653
 ] 

JinxinTang commented on SPARK-32515:


Hi [~tigaiii123], 

Could you please try the data or could you provide the sample data.
{code:java}
username,age
'xiaoming',11
"['xiaohong','huahua']",12
{code}
Seems ok in my side, both result can get from 2.6.4 and 3.0.0:

!image-2020-08-03-07-03-55-716.png!

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Blocker
>  Labels: distinct, groupby, load, read
> Fix For: 2.4.6
>
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> image-2020-08-03-07-03-55-716.png, unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32515) Distinct Function Weird Bug

2020-08-02 Thread JinxinTang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JinxinTang updated SPARK-32515:
---
Attachment: image-2020-08-03-07-03-55-716.png

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Blocker
>  Labels: distinct, groupby, load, read
> Fix For: 2.4.6
>
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> image-2020-08-03-07-03-55-716.png, unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32516) path option is treated differently for 'format("parquet").load(path)' vs. 'parquet(path)'

2020-08-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169639#comment-17169639
 ] 

Apache Spark commented on SPARK-32516:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/29328

> path option is treated differently for 'format("parquet").load(path)' vs. 
> 'parquet(path)'
> -
>
> Key: SPARK-32516
> URL: https://issues.apache.org/jira/browse/SPARK-32516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Terry Kim
>Priority: Minor
>
> When data is read, "path" option is treated differently depending on how 
> dataframe is created:
> {code:java}
> scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")
>   
>   
> scala> spark.read.option("path", 
> "/tmp/test").format("parquet").load("/tmp/test").show
> +-+
> |value|
> +-+
> |1|
> +-+
> scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
> +-+
> |value|
> +-+
> |1|
> |1|
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32516) path option is treated differently for 'format("parquet").load(path)' vs. 'parquet(path)'

2020-08-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169638#comment-17169638
 ] 

Apache Spark commented on SPARK-32516:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/29328

> path option is treated differently for 'format("parquet").load(path)' vs. 
> 'parquet(path)'
> -
>
> Key: SPARK-32516
> URL: https://issues.apache.org/jira/browse/SPARK-32516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Terry Kim
>Priority: Minor
>
> When data is read, "path" option is treated differently depending on how 
> dataframe is created:
> {code:java}
> scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")
>   
>   
> scala> spark.read.option("path", 
> "/tmp/test").format("parquet").load("/tmp/test").show
> +-+
> |value|
> +-+
> |1|
> +-+
> scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
> +-+
> |value|
> +-+
> |1|
> |1|
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32516) path option is treated differently for 'format("parquet").load(path)' vs. 'parquet(path)'

2020-08-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32516:


Assignee: Apache Spark

> path option is treated differently for 'format("parquet").load(path)' vs. 
> 'parquet(path)'
> -
>
> Key: SPARK-32516
> URL: https://issues.apache.org/jira/browse/SPARK-32516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Minor
>
> When data is read, "path" option is treated differently depending on how 
> dataframe is created:
> {code:java}
> scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")
>   
>   
> scala> spark.read.option("path", 
> "/tmp/test").format("parquet").load("/tmp/test").show
> +-+
> |value|
> +-+
> |1|
> +-+
> scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
> +-+
> |value|
> +-+
> |1|
> |1|
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32516) path option is treated differently for 'format("parquet").load(path)' vs. 'parquet(path)'

2020-08-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32516:


Assignee: (was: Apache Spark)

> path option is treated differently for 'format("parquet").load(path)' vs. 
> 'parquet(path)'
> -
>
> Key: SPARK-32516
> URL: https://issues.apache.org/jira/browse/SPARK-32516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Terry Kim
>Priority: Minor
>
> When data is read, "path" option is treated differently depending on how 
> dataframe is created:
> {code:java}
> scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")
>   
>   
> scala> spark.read.option("path", 
> "/tmp/test").format("parquet").load("/tmp/test").show
> +-+
> |value|
> +-+
> |1|
> +-+
> scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
> +-+
> |value|
> +-+
> |1|
> |1|
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32516) path option is treated differently for 'format("parquet").load(path)' vs. 'parquet(path)'

2020-08-02 Thread Terry Kim (Jira)

Terry Kim created SPARK-32516:
-

 Summary: path option is treated differently for 
'format("parquet").load(path)' vs. 'parquet(path)'
 Key: SPARK-32516
 URL: https://issues.apache.org/jira/browse/SPARK-32516
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 2.4.6
Reporter: Terry Kim


When data is read, "path" option is treated differently depending on how 
dataframe is created:
{code:java}
scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")

scala> spark.read.option("path", 
"/tmp/test").format("parquet").load("/tmp/test").show
+-+
|value|
+-+
|1|
+-+


scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
+-+
|value|
+-+
|1|
|1|
+-+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32427) Omit USING in CREATE TABLE via JDBC Table Catalog

2020-08-02 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169610#comment-17169610
 ] 

L. C. Hsieh commented on SPARK-32427:
-

Do you mean "CREATE TABLE .." without USING indicates using JDBC? It seems 
conflicting with Hive's CREATE TABLE syntax that can omit "STORED AS".

> Omit USING in CREATE TABLE via JDBC Table Catalog
> -
>
> Key: SPARK-32427
> URL: https://issues.apache.org/jira/browse/SPARK-32427
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Support creating tables in JDBC Table Catalog without USING, for instance:
> {code:sql}
> CREATE TABLE h2.test.new_table(i INT, j STRING)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22231) Support of map, filter, withField, dropFields in nested list of structures

2020-08-02 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-22231:

Summary: Support of map, filter, withField, dropFields in nested list of 
structures  (was: Support of map, filter, withColumn, dropColumn in nested list 
of structures)

> Support of map, filter, withField, dropFields in nested list of structures
> --
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: struct (containsNull = true)
> // |||-- a: integer (nullable = true)
> // |||-- b: double (nullable = true)
> result.show(false)
> // +---++--+
> // |foo|bar |items |
> // +---++--+
> // |10

[jira] [Commented] (SPARK-32515) Distinct Function Weird Bug

2020-08-02 Thread Jayce Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169574#comment-17169574
 ] 

Jayce Jiang commented on SPARK-32515:
-

I re-uploaded, as you can see, the CSV file gets loaded into Spark, however, 
any groupBy or Distinct count function from the "username" column using spark 
results in weird grouping resulting including "[]" bracket words. However, the 
entire column only contains strings that do not contain any bracket. However, 
if I convert it into pandas, then do pandas groupby and distinct function, then 
it works. Not sure why only this CSV file is not working, all my other CSV 
works.

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Blocker
>  Labels: distinct, groupby, load, read
> Fix For: 2.4.6
>
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, unknown.png, 
> unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32515) Distinct Function Weird Bug

2020-08-02 Thread Jayce Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayce Jiang updated SPARK-32515:

Attachment: Capture1.png

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Blocker
>  Labels: distinct, groupby, load, read
> Fix For: 2.4.6
>
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, unknown.png, 
> unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32515) Distinct Function Weird Bug

2020-08-02 Thread Jayce Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayce Jiang updated SPARK-32515:

Attachment: Capture2.PNG

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Blocker
>  Labels: distinct, groupby, load, read
> Fix For: 2.4.6
>
> Attachments: Capture.PNG, Capture1.png, Capture2.PNG, unknown.png, 
> unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32515) Distinct Function Weird Bug

2020-08-02 Thread Jayce Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayce Jiang updated SPARK-32515:

Attachment: Capture.PNG

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Blocker
>  Labels: distinct, groupby, load, read
> Fix For: 2.4.6
>
> Attachments: Capture.PNG, unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32515) Distinct Function Weird Bug

2020-08-02 Thread Jayce Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayce Jiang updated SPARK-32515:

Attachment: unknown2.png
unknown1.png
unknown.png

> Distinct Function Weird Bug
> ---
>
> Key: SPARK-32515
> URL: https://issues.apache.org/jira/browse/SPARK-32515
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.6
> Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>Reporter: Jayce Jiang
>Priority: Blocker
>  Labels: distinct, groupby, load, read
> Fix For: 2.4.6
>
> Attachments: unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28818) FrequentItems applies an incorrect schema to the resulting dataframe when nulls are present

2020-08-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169501#comment-17169501
 ] 

Apache Spark commented on SPARK-28818:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29327

> FrequentItems applies an incorrect schema to the resulting dataframe when 
> nulls are present
> ---
>
> Key: SPARK-28818
> URL: https://issues.apache.org/jira/browse/SPARK-28818
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Matt Hawes
>Assignee: Matt Hawes
>Priority: Minor
> Fix For: 3.0.0
>
>
> A trivially reproducible bug in the code for `FrequentItems`. The schema for 
> the resulting arrays of frequent items is [hard coded|#L122]] to have 
> non-nullable array elements:
> {code:scala}
> val outputCols = colInfo.map { v =>
> StructField(v._1 + "_freqItems", ArrayType(v._2, false))
>  }
>  val schema = StructType(outputCols).toAttributes
>  Dataset.ofRows(df.sparkSession, LocalRelation.fromExternalRows(schema, 
> Seq(resultRow)))
> {code}
>  
> However if the column contains frequent nulls then these nulls are included 
> in the frequent items array. This results in various errors such as any 
> attempt to `collect()` resulting in a null pointer exception:
> {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession.Builder().getOrCreate()
> df = spark.createDataFrame([
>     (1, 'a'),
>     (2, None),
>     (3, 'b'),
> ], schema="id INTEGER, val STRING")
> rows = df.freqItems(df.columns).collect()
> {code}
>  Results in:
> {code:java}
> Traceback (most recent call last):                                            
>   
>   File "", line 1, in 
>   File 
> "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/dataframe.py", 
> line 533, in collect
>     sock_info = self._jdf.collectToPython()
>   File 
> "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/utils.py", line 
> 63, in deco
>     return f(*a, **kw)
>   File 
> "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o40.collectToPython.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:44)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:44)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:296)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows$lzycompute(LocalTableScanExec.scala:44)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows(LocalTableScanExec.scala:39)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.executeCollect(LocalTableScanExec.scala:70)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3257)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3254)
>   at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
>   at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3254)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at

[jira] [Commented] (SPARK-28818) FrequentItems applies an incorrect schema to the resulting dataframe when nulls are present

2020-08-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169500#comment-17169500
 ] 

Apache Spark commented on SPARK-28818:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29327

> FrequentItems applies an incorrect schema to the resulting dataframe when 
> nulls are present
> ---
>
> Key: SPARK-28818
> URL: https://issues.apache.org/jira/browse/SPARK-28818
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Matt Hawes
>Assignee: Matt Hawes
>Priority: Minor
> Fix For: 3.0.0
>
>
> A trivially reproducible bug in the code for `FrequentItems`. The schema for 
> the resulting arrays of frequent items is [hard coded|#L122]] to have 
> non-nullable array elements:
> {code:scala}
> val outputCols = colInfo.map { v =>
> StructField(v._1 + "_freqItems", ArrayType(v._2, false))
>  }
>  val schema = StructType(outputCols).toAttributes
>  Dataset.ofRows(df.sparkSession, LocalRelation.fromExternalRows(schema, 
> Seq(resultRow)))
> {code}
>  
> However if the column contains frequent nulls then these nulls are included 
> in the frequent items array. This results in various errors such as any 
> attempt to `collect()` resulting in a null pointer exception:
> {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession.Builder().getOrCreate()
> df = spark.createDataFrame([
>     (1, 'a'),
>     (2, None),
>     (3, 'b'),
> ], schema="id INTEGER, val STRING")
> rows = df.freqItems(df.columns).collect()
> {code}
>  Results in:
> {code:java}
> Traceback (most recent call last):                                            
>   
>   File "", line 1, in 
>   File 
> "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/dataframe.py", 
> line 533, in collect
>     sock_info = self._jdf.collectToPython()
>   File 
> "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/pyspark/sql/utils.py", line 
> 63, in deco
>     return f(*a, **kw)
>   File 
> "/usr/local/bin/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o40.collectToPython.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:44)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:44)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:296)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows$lzycompute(LocalTableScanExec.scala:44)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.unsafeRows(LocalTableScanExec.scala:39)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.executeCollect(LocalTableScanExec.scala:70)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3257)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:3254)
>   at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
>   at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3254)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at

[jira] [Commented] (SPARK-32502) Please fix CVE related to Guava 14.0.1

2020-08-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169466#comment-17169466
 ] 

Apache Spark commented on SPARK-32502:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/29326

> Please fix CVE related to Guava 14.0.1
> --
>
> Key: SPARK-32502
> URL: https://issues.apache.org/jira/browse/SPARK-32502
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Rodney Aaron Stainback
>Priority: Major
>
> Please fix the following CVE related to Guava 14.0.1
> |cve|severity|cvss|
> |CVE-2018-10237|medium|5.9|
>  
> Our security team is trying to block us from using spark because of this issue
>  
> One thing that's very weird is I see from this [pom 
> file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/pom.xml]]
>  you reference guava but it's not clear what version.
>  
> But if I look on 
> [maven|[https://mvnrepository.com/artifact/org.apache.spark/spark-network-common_2.12/3.0.0]]
>  the guava reference is not showing up
>  
> Is this reference somehow being shaded into the network common jar?  It's not 
> clear to me.
>  
> Also, I've noticed code like [this 
> file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/src/main/java/org/apache/spark/network/util/LimitedInputStream.java]]
>  which is a copy-paste of some guava source code.
>  
> The CVE scanner we use Twistlock/Palo Alto Networks - Prisma Cloud Compute 
> Edition is very thorough and will find CVEs in copy-pasted code and shaded 
> jars.
>  
> Please fix this CVE so we can use spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32502) Please fix CVE related to Guava 14.0.1

2020-08-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169464#comment-17169464
 ] 

Apache Spark commented on SPARK-32502:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/29326

> Please fix CVE related to Guava 14.0.1
> --
>
> Key: SPARK-32502
> URL: https://issues.apache.org/jira/browse/SPARK-32502
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Rodney Aaron Stainback
>Priority: Major
>
> Please fix the following CVE related to Guava 14.0.1
> |cve|severity|cvss|
> |CVE-2018-10237|medium|5.9|
>  
> Our security team is trying to block us from using spark because of this issue
>  
> One thing that's very weird is I see from this [pom 
> file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/pom.xml]]
>  you reference guava but it's not clear what version.
>  
> But if I look on 
> [maven|[https://mvnrepository.com/artifact/org.apache.spark/spark-network-common_2.12/3.0.0]]
>  the guava reference is not showing up
>  
> Is this reference somehow being shaded into the network common jar?  It's not 
> clear to me.
>  
> Also, I've noticed code like [this 
> file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/src/main/java/org/apache/spark/network/util/LimitedInputStream.java]]
>  which is a copy-paste of some guava source code.
>  
> The CVE scanner we use Twistlock/Palo Alto Networks - Prisma Cloud Compute 
> Edition is very thorough and will find CVEs in copy-pasted code and shaded 
> jars.
>  
> Please fix this CVE so we can use spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32502) Please fix CVE related to Guava 14.0.1

2020-08-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32502:


Assignee: (was: Apache Spark)

> Please fix CVE related to Guava 14.0.1
> --
>
> Key: SPARK-32502
> URL: https://issues.apache.org/jira/browse/SPARK-32502
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Rodney Aaron Stainback
>Priority: Major
>
> Please fix the following CVE related to Guava 14.0.1
> |cve|severity|cvss|
> |CVE-2018-10237|medium|5.9|
>  
> Our security team is trying to block us from using spark because of this issue
>  
> One thing that's very weird is I see from this [pom 
> file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/pom.xml]]
>  you reference guava but it's not clear what version.
>  
> But if I look on 
> [maven|[https://mvnrepository.com/artifact/org.apache.spark/spark-network-common_2.12/3.0.0]]
>  the guava reference is not showing up
>  
> Is this reference somehow being shaded into the network common jar?  It's not 
> clear to me.
>  
> Also, I've noticed code like [this 
> file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/src/main/java/org/apache/spark/network/util/LimitedInputStream.java]]
>  which is a copy-paste of some guava source code.
>  
> The CVE scanner we use Twistlock/Palo Alto Networks - Prisma Cloud Compute 
> Edition is very thorough and will find CVEs in copy-pasted code and shaded 
> jars.
>  
> Please fix this CVE so we can use spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32502) Please fix CVE related to Guava 14.0.1

2020-08-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32502:


Assignee: Apache Spark

> Please fix CVE related to Guava 14.0.1
> --
>
> Key: SPARK-32502
> URL: https://issues.apache.org/jira/browse/SPARK-32502
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Rodney Aaron Stainback
>Assignee: Apache Spark
>Priority: Major
>
> Please fix the following CVE related to Guava 14.0.1
> |cve|severity|cvss|
> |CVE-2018-10237|medium|5.9|
>  
> Our security team is trying to block us from using spark because of this issue
>  
> One thing that's very weird is I see from this [pom 
> file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/pom.xml]]
>  you reference guava but it's not clear what version.
>  
> But if I look on 
> [maven|[https://mvnrepository.com/artifact/org.apache.spark/spark-network-common_2.12/3.0.0]]
>  the guava reference is not showing up
>  
> Is this reference somehow being shaded into the network common jar?  It's not 
> clear to me.
>  
> Also, I've noticed code like [this 
> file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/src/main/java/org/apache/spark/network/util/LimitedInputStream.java]]
>  which is a copy-paste of some guava source code.
>  
> The CVE scanner we use Twistlock/Palo Alto Networks - Prisma Cloud Compute 
> Edition is very thorough and will find CVEs in copy-pasted code and shaded 
> jars.
>  
> Please fix this CVE so we can use spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32502) Please fix CVE related to Guava 14.0.1

2020-08-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169465#comment-17169465
 ] 

Apache Spark commented on SPARK-32502:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/29325

> Please fix CVE related to Guava 14.0.1
> --
>
> Key: SPARK-32502
> URL: https://issues.apache.org/jira/browse/SPARK-32502
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Rodney Aaron Stainback
>Priority: Major
>
> Please fix the following CVE related to Guava 14.0.1
> |cve|severity|cvss|
> |CVE-2018-10237|medium|5.9|
>  
> Our security team is trying to block us from using spark because of this issue
>  
> One thing that's very weird is I see from this [pom 
> file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/pom.xml]]
>  you reference guava but it's not clear what version.
>  
> But if I look on 
> [maven|[https://mvnrepository.com/artifact/org.apache.spark/spark-network-common_2.12/3.0.0]]
>  the guava reference is not showing up
>  
> Is this reference somehow being shaded into the network common jar?  It's not 
> clear to me.
>  
> Also, I've noticed code like [this 
> file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/src/main/java/org/apache/spark/network/util/LimitedInputStream.java]]
>  which is a copy-paste of some guava source code.
>  
> The CVE scanner we use Twistlock/Palo Alto Networks - Prisma Cloud Compute 
> Edition is very thorough and will find CVEs in copy-pasted code and shaded 
> jars.
>  
> Please fix this CVE so we can use spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

53 matches

Mail list logo