[jira] [Assigned] (SPARK-28089) File source v2: support reading output of file streaming Sink

2019-06-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28089:


Assignee: Apache Spark

> File source v2: support reading output of file streaming Sink
> -
>
> Key: SPARK-28089
> URL: https://issues.apache.org/jira/browse/SPARK-28089
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> File source V1 supports reading output of FileStreamSink as batch. 
> https://github.com/apache/spark/pull/11897
> We should support this in file source V2 as well. When reading with paths, we 
> first check if there is metadata log of FileStreamSink. If yes, we use 
> `MetadataLogFileIndex` for listing files; Otherwise, we use 
> `InMemoryFileIndex`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28089) File source v2: support reading output of file streaming Sink

2019-06-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28089:


Assignee: (was: Apache Spark)

> File source v2: support reading output of file streaming Sink
> -
>
> Key: SPARK-28089
> URL: https://issues.apache.org/jira/browse/SPARK-28089
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> File source V1 supports reading output of FileStreamSink as batch. 
> https://github.com/apache/spark/pull/11897
> We should support this in file source V2 as well. When reading with paths, we 
> first check if there is metadata log of FileStreamSink. If yes, we use 
> `MetadataLogFileIndex` for listing files; Otherwise, we use 
> `InMemoryFileIndex`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28089) File source v2: support reading output of file streaming Sink

2019-06-17 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-28089:
--

 Summary: File source v2: support reading output of file streaming 
Sink
 Key: SPARK-28089
 URL: https://issues.apache.org/jira/browse/SPARK-28089
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


File source V1 supports reading output of FileStreamSink as batch. 
https://github.com/apache/spark/pull/11897
We should support this in file source V2 as well. When reading with paths, we 
first check if there is metadata log of FileStreamSink. If yes, we use 
`MetadataLogFileIndex` for listing files; Otherwise, we use `InMemoryFileIndex`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28088) String Functions: Enhance LPAD/RPAD function

2019-06-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28088:


Assignee: Apache Spark

> String Functions: Enhance LPAD/RPAD function
> 
>
> Key: SPARK-28088
> URL: https://issues.apache.org/jira/browse/SPARK-28088
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28088) String Functions: Enhance LPAD/RPAD function

2019-06-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28088:


Assignee: (was: Apache Spark)

> String Functions: Enhance LPAD/RPAD function
> 
>
> Key: SPARK-28088
> URL: https://issues.apache.org/jira/browse/SPARK-28088
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28088) String Functions: Enhance LPAD/RPAD function

2019-06-17 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28088:
---

 Summary: String Functions: Enhance LPAD/RPAD function
 Key: SPARK-28088
 URL: https://issues.apache.org/jira/browse/SPARK-28088
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

2019-06-17 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28058.
--
   Resolution: Fixed
Fix Version/s: 2.4.4
   3.0.0

Issue resolved by pull request 24894
[https://github.com/apache/spark/pull/24894]

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> ---
>
> Key: SPARK-28058
> URL: https://issues.apache.org/jira/browse/SPARK-28058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Stuart White
>Assignee: Liang-Chi Hsieh
>Priority: Minor
>  Labels: CSV, csv, csvparser
> Fix For: 3.0.0, 2.4.4
>
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +--+--+-++
>   
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +--+
> |fruit |
> +--+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +--+--+
> |fruit |color |
> +--+--+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +--+--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +--+--+-+
> |fruit |color |price|
> +--+--+-+
> |apple |red   |1|
> |banana|yellow|2|
> |orange|orange|3|
> |xxx   |null  |null |
> +--+--+-+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +--+--+-++
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

2019-06-17 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28058:


Assignee: Liang-Chi Hsieh

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> ---
>
> Key: SPARK-28058
> URL: https://issues.apache.org/jira/browse/SPARK-28058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Stuart White
>Assignee: Liang-Chi Hsieh
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +--+--+-++
>   
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +--+
> |fruit |
> +--+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +--+--+
> |fruit |color |
> +--+--+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +--+--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +--+--+-+
> |fruit |color |price|
> +--+--+-+
> |apple |red   |1|
> |banana|yellow|2|
> |orange|orange|3|
> |xxx   |null  |null |
> +--+--+-+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +--+--+-++
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28072) Fix IncompatibleClassChangeError in `FromUnixTime` codegen on JDK9+

2019-06-17 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28072:
--
Summary: Fix IncompatibleClassChangeError in `FromUnixTime` codegen on 
JDK9+  (was: Use `Iso8601TimestampFormatter` in `FromUnixTime` codegen to fix 
`IncompatibleClassChangeError` in JDK9+)

> Fix IncompatibleClassChangeError in `FromUnixTime` codegen on JDK9+
> ---
>
> Key: SPARK-28072
> URL: https://issues.apache.org/jira/browse/SPARK-28072
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> With JDK9+, the generate bytecode of `FromUnixTime` raise 
> `java.lang.IncompatibleClassChangeError` due to 
> [JDK-8145148|https://bugs.openjdk.java.net/browse/JDK-8145148] .
> This is a blocker in our JDK11 Jenkins job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28056) Document SCALAR_ITER Pandas UDF

2019-06-17 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-28056.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24897
[https://github.com/apache/spark/pull/24897]

> Document SCALAR_ITER Pandas UDF
> ---
>
> Key: SPARK-28056
> URL: https://issues.apache.org/jira/browse/SPARK-28056
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
> Fix For: 3.0.0
>
>
> After SPARK-26412, we should document the new SCALAR_ITER Pandas UDF so user 
> can discover the feature and learn how to use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28087) String Functions: Add support split_part

2019-06-17 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28087:
---

 Summary: String Functions: Add support split_part
 Key: SPARK-28087
 URL: https://issues.apache.org/jira/browse/SPARK-28087
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


||Function||Return Type||Description||Example||Result||
|{{split_part(_string_ }}{{text}}{{,_delimiter_ }}{{text}}{{, _field_ 
}}{{int}}{{)}}|{{text}}|Split _string_ on _delimiter_ and return the given 
field (counting from one)|{{split_part('abc~@~def~@~ghi', '~@~', 2)}}|{{def}}|


https://www.postgresql.org/docs/11/functions-string.html
http://prestodb.github.io/docs/current/functions/string.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27666) Do not release lock while TaskContext already completed

2019-06-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27666.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24699
[https://github.com/apache/spark/pull/24699]

> Do not release lock while TaskContext already completed
> ---
>
> Key: SPARK-27666
> URL: https://issues.apache.org/jira/browse/SPARK-27666
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:java}
> Exception in thread "Thread-14" java.lang.AssertionError: assertion failed: 
> Block rdd_0_0 is not locked for reading
>   at scala.Predef$.assert(Predef.scala:223)
>   at 
> org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299)
>   at 
> org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:1000)
>   at 
> org.apache.spark.storage.BlockManager.$anonfun$getLocalValues$5(BlockManager.scala:746)
>   at 
> org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:47)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:36)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at org.apache.spark.rdd.RDDSuite.$anonfun$new$265(RDDSuite.scala:1185)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> We're facing an issue reported by SPARK-18406 and SPARK-25139. And 
> [https://github.com/apache/spark/pull/24542] bypassed the issue by capturing 
> the assertion error to avoid failing the executor. However, when not using 
> pyspark, issue still exists when user implements a custom 
> RDD(https://issues.apache.org/jira/browse/SPARK-18406?focusedCommentId=15969384=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15969384)
>  or task(see demo below), which spawn a separate thread to consume iterator 
> from a cached parent RDD.
> {code:java}
> val rdd0 = sc.parallelize(Range(0, 10), 1).cache()
> rdd0.collect()
> rdd0.mapPartitions { iter =>
>   val t = new Thread(new Runnable {
> override def run(): Unit = {
>   while(iter.hasNext) {
> println(iter.next())
> Thread.sleep(1000)
>   }
> }
>   })
>   t.setDaemon(false)
>   t.start()
>   Iterator(0)
> }.collect()
> {code}
> we could easily to reproduce the issue using the demo above.
> If we could prevent the separate thread from releasing lock on block when 
> TaskContext has already completed, 
> then, we won't hit this issue again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27666) Do not release lock while TaskContext already completed

2019-06-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27666:
---

Assignee: wuyi

> Do not release lock while TaskContext already completed
> ---
>
> Key: SPARK-27666
> URL: https://issues.apache.org/jira/browse/SPARK-27666
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:java}
> Exception in thread "Thread-14" java.lang.AssertionError: assertion failed: 
> Block rdd_0_0 is not locked for reading
>   at scala.Predef$.assert(Predef.scala:223)
>   at 
> org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299)
>   at 
> org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:1000)
>   at 
> org.apache.spark.storage.BlockManager.$anonfun$getLocalValues$5(BlockManager.scala:746)
>   at 
> org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:47)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:36)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at org.apache.spark.rdd.RDDSuite.$anonfun$new$265(RDDSuite.scala:1185)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> We're facing an issue reported by SPARK-18406 and SPARK-25139. And 
> [https://github.com/apache/spark/pull/24542] bypassed the issue by capturing 
> the assertion error to avoid failing the executor. However, when not using 
> pyspark, issue still exists when user implements a custom 
> RDD(https://issues.apache.org/jira/browse/SPARK-18406?focusedCommentId=15969384=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15969384)
>  or task(see demo below), which spawn a separate thread to consume iterator 
> from a cached parent RDD.
> {code:java}
> val rdd0 = sc.parallelize(Range(0, 10), 1).cache()
> rdd0.collect()
> rdd0.mapPartitions { iter =>
>   val t = new Thread(new Runnable {
> override def run(): Unit = {
>   while(iter.hasNext) {
> println(iter.next())
> Thread.sleep(1000)
>   }
> }
>   })
>   t.setDaemon(false)
>   t.start()
>   Iterator(0)
> }.collect()
> {code}
> we could easily to reproduce the issue using the demo above.
> If we could prevent the separate thread from releasing lock on block when 
> TaskContext has already completed, 
> then, we won't hit this issue again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27930) List all built-in UDFs have different names

2019-06-17 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27930:

Description: 
This ticket list all built-in UDFs have different names: 
||PostgreSQL||Spark SQL||Note||
|random|rand| |
|format|format_string|Spark's {{format_string}} is based on the implementation 
of {{java.util.Formatter}}.
 Which makes some formats of PostgreSQL can not supported, such as: 
{{format_string('>>%-s<<', 'Hello')}}|
|to_hex|hex| |
|strpos|locate/position| |

  was:
This ticket list all built-in UDFs have different names: 
||PostgreSQL||Spark SQL||Note||
|random|rand| |
|format|format_string|Spark's {{format_string}} is based on the implementation 
of {{java.util.Formatter}}.
 Which makes some formats of PostgreSQL can not supported, such as: 
{{format_string('>>%-s<<', 'Hello')}}|
|to_hex|hex| |


> List all built-in UDFs have different names
> ---
>
> Key: SPARK-27930
> URL: https://issues.apache.org/jira/browse/SPARK-27930
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> This ticket list all built-in UDFs have different names: 
> ||PostgreSQL||Spark SQL||Note||
> |random|rand| |
> |format|format_string|Spark's {{format_string}} is based on the 
> implementation of {{java.util.Formatter}}.
>  Which makes some formats of PostgreSQL can not supported, such as: 
> {{format_string('>>%-s<<', 'Hello')}}|
> |to_hex|hex| |
> |strpos|locate/position| |



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27969) Non-deterministic expressions in filters or projects can unnecessarily prevent all scan-time column pruning, harming performance

2019-06-17 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866151#comment-16866151
 ] 

Wenchen Fan commented on SPARK-27969:
-

We should think about what non-deterministic means in Spark, and what are the 
guarantees. I do agree it's too conservative in this case, but we should think 
of the big picture before fixing a specific case.

AFAIK, expression in Hive has 2 flags: deterministic and stateful. They have 
different guarantees. I haven't looked into Presto/Impala though.

> Non-deterministic expressions in filters or projects can unnecessarily 
> prevent all scan-time column pruning, harming performance
> 
>
> Key: SPARK-27969
> URL: https://issues.apache.org/jira/browse/SPARK-27969
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> If a scan operator is followed by a projection or filter and those operators 
> contain _any_ non-deterministic expressions then scan column pruning 
> optimizations are completely skipped, harming query performance.
> Here's an example of the problem:
> {code:java}
> import org.apache.spark.sql.functions._
> val df = spark.createDataset(Seq(
>   (1, 2, 3, 4, 5),
>   (1, 2, 3, 4, 5)
> ))
> val tmpPath = 
> java.nio.file.Files.createTempDirectory("column-pruning-bug").toString()
> df.write.parquet(tmpPath)
> val fromParquet = spark.read.parquet(tmpPath){code}
> If all expressions are deterministic then, as expected, column pruning is 
> pushed into the scan
> {code:java}
> fromParquet.select("_1").explain
> == Physical Plan == *(1) FileScan parquet [_1#68] Batched: true, DataFilters: 
> [], Format: Parquet, Location: 
> InMemoryFileIndex[dbfs:/local_disk0/tmp/column-pruning-bug7865798834978814548],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:int>{code}
> However, if we add a non-deterministic filter then no column pruning is 
> performed (even though pruning would be safe!):
> {code:java}
> fromParquet.select("_1").filter(rand() =!= 0).explain
> == Physical Plan ==
> *(1) Project [_1#127]
> +- *(1) Filter NOT (rand(-1515289268025792238) = 0.0)
> +- *(1) FileScan parquet [_1#127,_2#128,_3#129,_4#130,_5#131] Batched: true, 
> DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[dbfs:/local_disk0/tmp/column-pruning-bug4043817424882943496],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<_1:int,_2:int,_3:int,_4:int,_5:int>{code}
> Similarly, a non-deterministic expression in a second projection can end up 
> being collapsed such that it prevents column pruning:
> {code:java}
> fromParquet.select("_1").select($"_1", rand()).explain
> == Physical Plan ==
> *(1) Project [_1#127, rand(1267140591146561563) AS 
> rand(1267140591146561563)#141]
> +- *(1) FileScan parquet [_1#127,_2#128,_3#129,_4#130,_5#131] Batched: true, 
> DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[dbfs:/local_disk0/tmp/column-pruning-bug4043817424882943496],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct<_1:int,_2:int,_3:int,_4:int,_5:int>
> {code}
> I believe that this is caused by SPARK-10316: the Parquet column pruning code 
> relies on the [{{PhysicalOperation}} unapply 
> method|https://github.com/apache/spark/blob/v2.4.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L43]
>  for extracting projects and filters and this helper purposely fails to match 
> if _any_ projection or filter is non-deterministic.
> It looks like this conservative behavior may have originally been added to 
> avoid pushdown / re-ordering of non-deterministic filter expressions. 
> However, in this case I feel that it's _too_ conservative: even though we 
> can't push down non-deterministic filters we should still be able to perform 
> column pruning. 
> /cc [~cloud_fan] and [~marmbrus] (it looks like you [discussed collapsing of 
> non-deterministic 
> projects|https://github.com/apache/spark/pull/8486#issuecomment-136036533] in 
> the SPARK-10316 PR, which is related to why the third example above did not 
> prune).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24175) improve the Spark 2.4 migration guide

2019-06-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24175.
-
Resolution: Won't Fix

> improve the Spark 2.4 migration guide
> -
>
> Key: SPARK-24175
> URL: https://issues.apache.org/jira/browse/SPARK-24175
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current Spark 2.4 migration guide is not well phrased. We should
> 1. State the before behavior
> 2. State the after behavior
> 3. Add a concrete example with code to illustrate.
> For example:
> Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after 
> promotes both sides to TIMESTAMP. To set `false` to 
> `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous 
> behavior. This option will be removed in Spark 3.0.
> -->
> In version 2.3 and earlier, Spark implicitly casts a timestamp column to date 
> type when comparing with a date column. In version 2.4 and later, Spark casts 
> the date column to timestamp type instead. As an example, "xxx" would result 
> in ".." in Spark 2.3, and in Spark 2.4, the result would be "..."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled

2019-06-17 Thread Mark Sirek (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Sirek updated SPARK-28067:
---
Description: 
The following test case involving a join followed by a sum aggregation returns 
the wrong answer for the sum:

 
{code:java}
val df = Seq(
 (BigDecimal("1000"), 1),
 (BigDecimal("1000"), 1),
 (BigDecimal("1000"), 2),
 (BigDecimal("1000"), 2),
 (BigDecimal("1000"), 2),
 (BigDecimal("1000"), 2),
 (BigDecimal("1000"), 2),
 (BigDecimal("1000"), 2),
 (BigDecimal("1000"), 2),
 (BigDecimal("1000"), 2),
 (BigDecimal("1000"), 2),
 (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
"intNum").agg(sum("decNum"))
scala> df2.show(40,false)
 ---

sum(decNum)
---

4000.00
---
 
{code}
 

The result should be 104000..

It appears a partial sum is computed for each join key, as the result returned 
would be the answer for all rows matching intNum === 1.

If only the rows with intNum === 2 are included, the answer given is null:

 
{code:java}
scala> val df3 = df.filter($"intNum" === lit(2))
 df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: 
decimal(38,18), intNum: int]
scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, 
"intNum").agg(sum("decNum"))
 df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
scala> df4.show(40,false)
 ---

sum(decNum)
---

null
---
 
{code}
 

The correct answer, 10., doesn't fit in the 
DataType picked for the result, decimal(38,18), so an overflow occurs, which 
Spark then converts to null.

The first example, which doesn't filter out the intNum === 1 values should also 
return null, indicating overflow, but it doesn't.  This may mislead the user to 
think a valid sum was computed.

If whole-stage code gen is turned off:

spark.conf.set("spark.sql.codegen.wholeStage", false)

... incorrect results are not returned because the overflow is caught as an 
exception:

java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 
exceeds max precision 38

 

 

 

 

 

 

 

  was:
The following test case involving a join followed by a sum aggregation returns 
the wrong answer for the sum:

 
{code:java}
val df = Seq(
 (BigDecimal("1000"), 1),
 (BigDecimal("1000"), 1),
 (BigDecimal("1000"), 2),
 (BigDecimal("1000"), 2),
 (BigDecimal("1000"), 2),
 (BigDecimal("1000"), 2),
 (BigDecimal("1000"), 2),
 (BigDecimal("1000"), 2),
 (BigDecimal("1000"), 2),
 (BigDecimal("1000"), 2),
 (BigDecimal("1000"), 2),
 (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
"intNum").agg(sum("decNum"))
scala> df2.show(40,false)
 ---

sum(decNum)
---

4000.00
---
 
{code}
 

The result should be 104000..

It appears a partial sum is computed for each join key, as the result returned 
would be the answer for all rows matching intNum === 1.

If only the rows with intNum === 2 are included, the answer given is null:

 
{code:java}
scala> val df3 = df.filter($"intNum" === lit(2))
 df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: 
decimal(38,18), intNum: int]
scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, 
"intNum").agg(sum("decNum"))
 df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
scala> df4.show(40,false)
 ---

sum(decNum)
---

null
---
 
{code}
 

The correct answer, 10., doesn't fit in the 
DataType picked for the result, decimal(38,18), so the overflow is converted to 
null.

The first example, which doesn't filter out the intNum === 1 values should also 
return null, indicating overflow, but it doesn't.  This may mislead the user to 
think a valid sum was computed.

If whole-stage code gen is turned off:

spark.conf.set("spark.sql.codegen.wholeStage", false)

... incorrect results are not returned because the overflow is caught as an 
exception:

java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 
exceeds max precision 38

 

 

 

 

 

 

 


> Incorrect results in decimal aggregation with whole-stage 

[jira] [Resolved] (SPARK-28082) Add a note to DROPMALFORMED mode of CSV for column pruning

2019-06-17 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28082.
--
Resolution: Duplicate

> Add a note to DROPMALFORMED mode of CSV for column pruning
> --
>
> Key: SPARK-28082
> URL: https://issues.apache.org/jira/browse/SPARK-28082
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Trivial
>
> This is inspired by SPARK-28058.
> When using {{DROPMALFORMED}} mode, corrupted records aren't dropped if 
> malformed columns aren't read. This behavior is due to CSV parser column 
> pruning. Current doc of {{DROPMALFORMED}} doesn't mention the effect of 
> column pruning. Users will be confused by the fact that {{DROPMALFORMED}} 
> mode doesn't work as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28082) Add a note to DROPMALFORMED mode of CSV for column pruning

2019-06-17 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866098#comment-16866098
 ] 

Hyukjin Kwon commented on SPARK-28082:
--

Sorry, [~viirya] for back and forth. I think your first judgement was correct.

> Add a note to DROPMALFORMED mode of CSV for column pruning
> --
>
> Key: SPARK-28082
> URL: https://issues.apache.org/jira/browse/SPARK-28082
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Trivial
>
> This is inspired by SPARK-28058.
> When using {{DROPMALFORMED}} mode, corrupted records aren't dropped if 
> malformed columns aren't read. This behavior is due to CSV parser column 
> pruning. Current doc of {{DROPMALFORMED}} doesn't mention the effect of 
> column pruning. Users will be confused by the fact that {{DROPMALFORMED}} 
> mode doesn't work as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

2019-06-17 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866097#comment-16866097
 ] 

Hyukjin Kwon commented on SPARK-28058:
--

Yes, that was what I saw and I thought it's a bug in Unviocity. After another 
thought, yes, it looks an intended behaviour in Univocity (I guess). Thanks 
guys for clarification.

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> ---
>
> Key: SPARK-28058
> URL: https://issues.apache.org/jira/browse/SPARK-28058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Stuart White
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +--+--+-++
>   
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +--+
> |fruit |
> +--+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +--+--+
> |fruit |color |
> +--+--+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +--+--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +--+--+-+
> |fruit |color |price|
> +--+--+-+
> |apple |red   |1|
> |banana|yellow|2|
> |orange|orange|3|
> |xxx   |null  |null |
> +--+--+-+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +--+--+-++
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27930) List all built-in UDFs have different names

2019-06-17 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27930:

Description: 
This ticket list all built-in UDFs have different names: 
||PostgreSQL||Spark SQL||Note||
|random|rand| |
|format|format_string|Spark's {{format_string}} is based on the implementation 
of {{java.util.Formatter}}.
 Which makes some formats of PostgreSQL can not supported, such as: 
{{format_string('>>%-s<<', 'Hello')}}|
|to_hex|hex| |

  was:
This ticket list all built-in UDFs have different names: 
|PostgreSQL|Spark SQL|Note|
|random|rand| |
|format|format_string|Spark's {{format_string}} is based on the implementation 
of {{java.util.Formatter}}.
 Which makes some formats of PostgreSQL can not supported, such as: 
{{format_string('>>%-s<<', 'Hello')}}|


> List all built-in UDFs have different names
> ---
>
> Key: SPARK-27930
> URL: https://issues.apache.org/jira/browse/SPARK-27930
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> This ticket list all built-in UDFs have different names: 
> ||PostgreSQL||Spark SQL||Note||
> |random|rand| |
> |format|format_string|Spark's {{format_string}} is based on the 
> implementation of {{java.util.Formatter}}.
>  Which makes some formats of PostgreSQL can not supported, such as: 
> {{format_string('>>%-s<<', 'Hello')}}|
> |to_hex|hex| |



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28041) Increase the minimum pandas version to 0.23.2

2019-06-17 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28041.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24867
[https://github.com/apache/spark/pull/24867]

> Increase the minimum pandas version to 0.23.2
> -
>
> Key: SPARK-28041
> URL: https://issues.apache.org/jira/browse/SPARK-28041
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, the minimum supported Pandas version is 0.19.2. We bumped up 
> testing in the Jenkins env to 0.23.2 and since 0.19.2 was released nearly 3 
> years ago, it is not always compatible with other Python libraries. 
> Increasing the version to 0.23.2 will also allow some workarounds to be 
> removed and make maintenance easier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28041) Increase the minimum pandas version to 0.23.2

2019-06-17 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28041:


Assignee: Hyukjin Kwon

> Increase the minimum pandas version to 0.23.2
> -
>
> Key: SPARK-28041
> URL: https://issues.apache.org/jira/browse/SPARK-28041
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, the minimum supported Pandas version is 0.19.2. We bumped up 
> testing in the Jenkins env to 0.23.2 and since 0.19.2 was released nearly 3 
> years ago, it is not always compatible with other Python libraries. 
> Increasing the version to 0.23.2 will also allow some workarounds to be 
> removed and make maintenance easier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28041) Increase the minimum pandas version to 0.23.2

2019-06-17 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28041:


Assignee: Bryan Cutler  (was: Hyukjin Kwon)

> Increase the minimum pandas version to 0.23.2
> -
>
> Key: SPARK-28041
> URL: https://issues.apache.org/jira/browse/SPARK-28041
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, the minimum supported Pandas version is 0.19.2. We bumped up 
> testing in the Jenkins env to 0.23.2 and since 0.19.2 was released nearly 3 
> years ago, it is not always compatible with other Python libraries. 
> Increasing the version to 0.23.2 will also allow some workarounds to be 
> removed and make maintenance easier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28056) Document SCALAR_ITER Pandas UDF

2019-06-17 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-28056:
-

Assignee: Xiangrui Meng

> Document SCALAR_ITER Pandas UDF
> ---
>
> Key: SPARK-28056
> URL: https://issues.apache.org/jira/browse/SPARK-28056
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> After SPARK-26412, we should document the new SCALAR_ITER Pandas UDF so user 
> can discover the feature and learn how to use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28006) User-defined grouped transform pandas_udf for window operations

2019-06-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28006:


Assignee: Apache Spark

> User-defined grouped transform pandas_udf for window operations
> ---
>
> Key: SPARK-28006
> URL: https://issues.apache.org/jira/browse/SPARK-28006
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Assignee: Apache Spark
>Priority: Major
>
> Currently, pandas_udf supports "grouped aggregate" type that can be used with 
> unbounded and unbounded windows. There is another set of use cases that can 
> benefit from a "grouped transform" type pandas_udf.
> Grouped transform is defined as a N -> N mapping over a group. For example, 
> "compute zscore for values in the group using the grouped mean and grouped 
> stdev", or "rank the values in the group".
> Currently, in order to do this, user needs to use "grouped apply", for 
> example:
> {code:java}
> @pandas_udf(schema, GROUPED_MAP)
> def subtract_mean(pdf)
> v = pdf['v']
> pdf['v'] = v - v.mean()
> return pdf
> df.groupby('id').apply(subtract_mean)
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> This approach has a few downside:
>  * Specifying the full return schema is complicated for the user although the 
> function only changes one column.
>  * The column name 'v' inside as part of the udf, makes the udf less reusable.
>  * The entire dataframe is serialized to pass to Python although only one 
> column is needed.
> Here we propose a new type of pandas_udf to work with these types of use 
> cases:
> {code:java}
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))
> @pandas_udf('double', GROUPED_XFORM)
> def subtract_mean(v):
> return v - v.mean()
> w = Window.partitionBy('id')
> df = df.withColumn('v', subtract_mean(df['v']).over(w))
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> Which addresses the above downsides.
>  * The user only needs to specify the output type of a single column.
>  * The column being zscored is decoupled from the udf implementation
>  * We only need to send one column to Python worker and concat the result 
> with the original dataframe (this is what grouped aggregate is doing already)
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28006) User-defined grouped transform pandas_udf for window operations

2019-06-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28006:


Assignee: (was: Apache Spark)

> User-defined grouped transform pandas_udf for window operations
> ---
>
> Key: SPARK-28006
> URL: https://issues.apache.org/jira/browse/SPARK-28006
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
> Currently, pandas_udf supports "grouped aggregate" type that can be used with 
> unbounded and unbounded windows. There is another set of use cases that can 
> benefit from a "grouped transform" type pandas_udf.
> Grouped transform is defined as a N -> N mapping over a group. For example, 
> "compute zscore for values in the group using the grouped mean and grouped 
> stdev", or "rank the values in the group".
> Currently, in order to do this, user needs to use "grouped apply", for 
> example:
> {code:java}
> @pandas_udf(schema, GROUPED_MAP)
> def subtract_mean(pdf)
> v = pdf['v']
> pdf['v'] = v - v.mean()
> return pdf
> df.groupby('id').apply(subtract_mean)
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> This approach has a few downside:
>  * Specifying the full return schema is complicated for the user although the 
> function only changes one column.
>  * The column name 'v' inside as part of the udf, makes the udf less reusable.
>  * The entire dataframe is serialized to pass to Python although only one 
> column is needed.
> Here we propose a new type of pandas_udf to work with these types of use 
> cases:
> {code:java}
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))
> @pandas_udf('double', GROUPED_XFORM)
> def subtract_mean(v):
> return v - v.mean()
> w = Window.partitionBy('id')
> df = df.withColumn('v', subtract_mean(df['v']).over(w))
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> Which addresses the above downsides.
>  * The user only needs to specify the output type of a single column.
>  * The column being zscored is decoupled from the udf implementation
>  * We only need to send one column to Python worker and concat the result 
> with the original dataframe (this is what grouped aggregate is doing already)
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27767) Built-in function: generate_series

2019-06-17 Thread Dylan Guedes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865912#comment-16865912
 ] 

Dylan Guedes edited comment on SPARK-27767 at 6/17/19 8:02 PM:
---

[~smilegator] by the way, I just checked and there is a (minor) difference: 
when you use `range()` and define it as a sub-query called `x`, for instance, 
the default name for the column became `x.id`, instead of just `x`, that is the 
behaviour in Postgres. For instance:

{code:sql}
from range(-32766, -32764) x;
{code}
In Spark, looks like you should reference to these values as `x.id`. Meanwhile, 
in Postgres you can call them through just `x`. 

EDIT: Btw, this call also does not work:

{code:sql}
SELECT range(1, 100) OVER () FROM empsalary
{code}



was (Author: dylanguedes):
[~smilegator] by the way, I just checked and there is a (minor) difference: 
when you use `range()` and define it as a sub-query called `x`, for instance, 
the default name for the column became `x.id`, instead of just `x`, that is the 
behaviour in Postgres. For instance:

{code:sql}
from range(-32766, -32764) x;
{code}
In Spark, looks like you should reference to these values as `x.id`. Meanwhile, 
in Postgres you can call them through just `x`. 

> Built-in function: generate_series
> --
>
> Key: SPARK-27767
> URL: https://issues.apache.org/jira/browse/SPARK-27767
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> [https://www.postgresql.org/docs/9.1/functions-srf.html]
> generate_series(start, stop): Generate a series of values, from start to stop 
> with a step size of one
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27767) Built-in function: generate_series

2019-06-17 Thread Dylan Guedes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865912#comment-16865912
 ] 

Dylan Guedes commented on SPARK-27767:
--

[~smilegator] by the way, I just checked and there is a (minor) difference: 
when you use `range()` and define it as a sub-query called `x`, for instance, 
the default name for the column became `x.id`, instead of just `x`, that is the 
behaviour in Postgres. For instance:

{code:sql}
from range(-32766, -32764) x;
{code}
In Spark, looks like you should reference to these values as `x.id`. Meanwhile, 
in Postgres you can call them through just `x`. 

> Built-in function: generate_series
> --
>
> Key: SPARK-27767
> URL: https://issues.apache.org/jira/browse/SPARK-27767
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> [https://www.postgresql.org/docs/9.1/functions-srf.html]
> generate_series(start, stop): Generate a series of values, from start to stop 
> with a step size of one
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28086) Adds `random()` sql function

2019-06-17 Thread Dylan Guedes (JIRA)
Dylan Guedes created SPARK-28086:


 Summary: Adds `random()` sql function
 Key: SPARK-28086
 URL: https://issues.apache.org/jira/browse/SPARK-28086
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dylan Guedes


Currently, Spark does not have a `random()` function. Postgres, however, does.

For instance, this one is not valid:

{code:sql}
SELECT rank() OVER (ORDER BY rank() OVER (ORDER BY random()))
{code}

Because of the `random()` call. On the other hand, [Postgres has 
it.|https://www.postgresql.org/docs/8.2/functions-math.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28085) Spark Scala API documentation URLs not working properly in Chrome

2019-06-17 Thread Andrew Leverentz (JIRA)
Andrew Leverentz created SPARK-28085:


 Summary: Spark Scala API documentation URLs not working properly 
in Chrome
 Key: SPARK-28085
 URL: https://issues.apache.org/jira/browse/SPARK-28085
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.4.3
Reporter: Andrew Leverentz


In Chrome version 75, URLs in the Scala API documentation are not working 
properly, which makes them difficult to bookmark.

For example, URLs like the following get redirected to a generic "root" package 
page:

[https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html]

[https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset]

Here's the URL that I get :

[https://spark.apache.org/docs/latest/api/scala/index.html#package]

This issue seems to have appeared between versions 74 and 75 of Chrome, but the 
documentation URLs still work in Safari.  I suspect that this has something to 
do with security-related changes to how Chrome 75 handles frames and/or 
redirects.  I've reported this issue to the Chrome team via the in-browser help 
menu, but I don't have any visibility into their response, so it's not clear 
whether they'll consider this a bug or "working as intended".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27989) Add retries on the connection to the driver

2019-06-17 Thread Jose Luis Pedrosa (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Luis Pedrosa updated SPARK-27989:
--
Description: 
 

Any failure in the executor when trying to connect to the driver, will make 
impossible a connection from that process, which will trigger the creation of 
another executor scheduled.

 

  was:
Due to Java caching of negative DNS resolution (failed requests are never 
retried).

Any failure in the DNS when trying to connect to the driver, will make 
impossible a connection from that process.

This happens specially in Kubernetes where network setup of pods can take some 
time,

 


> Add retries on the connection to the driver
> ---
>
> Key: SPARK-27989
> URL: https://issues.apache.org/jira/browse/SPARK-27989
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Jose Luis Pedrosa
>Priority: Minor
>
>  
> Any failure in the executor when trying to connect to the driver, will make 
> impossible a connection from that process, which will trigger the creation of 
> another executor scheduled.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28084) LOAD DATA command resolving the partition column name considering case senstive manner

2019-06-17 Thread Sujith Chacko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865758#comment-16865758
 ] 

Sujith Chacko commented on SPARK-28084:
---

insert command the partition column will be resolved using the resolver where 
the resovlver will resolve the names based on 

spark.sql.caseSensitive property. same logic can be applied for resolving the 
partition column names in LOAD COMMAND.

I am working on this, will raise a PR soon

> LOAD DATA command resolving the partition column name considering case 
> senstive manner 
> ---
>
> Key: SPARK-28084
> URL: https://issues.apache.org/jira/browse/SPARK-28084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Sujith Chacko
>Priority: Major
> Attachments: parition_casesensitive.PNG
>
>
> LOAD DATA command resolving the partition column name considering case 
> sensitive manner, where as insert command resolves case-insensitive manner.
> Refer the snapshot for more details.
> !image-2019-06-18-00-04-22-475.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24898) Adding spark.checkpoint.compress to the docs

2019-06-17 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-24898:
-

Assignee: Sandeep

> Adding spark.checkpoint.compress to the docs
> 
>
> Key: SPARK-24898
> URL: https://issues.apache.org/jira/browse/SPARK-24898
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Riccardo Corbella
>Assignee: Sandeep
>Priority: Trivial
> Fix For: 2.3.4, 2.4.4, 3.0.0
>
>
> Parameter *spark.checkpoint.compress* is not listed under configuration 
> properties.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24898) Adding spark.checkpoint.compress to the docs

2019-06-17 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24898:
--
Issue Type: Task  (was: Bug)

> Adding spark.checkpoint.compress to the docs
> 
>
> Key: SPARK-24898
> URL: https://issues.apache.org/jira/browse/SPARK-24898
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Riccardo Corbella
>Priority: Trivial
> Fix For: 2.3.4, 2.4.4, 3.0.0
>
>
> Parameter *spark.checkpoint.compress* is not listed under configuration 
> properties.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28084) LOAD DATA command resolving the partition column name considering case senstive manner

2019-06-17 Thread Sujith Chacko (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujith Chacko updated SPARK-28084:
--
Attachment: parition_casesensitive.PNG

> LOAD DATA command resolving the partition column name considering case 
> senstive manner 
> ---
>
> Key: SPARK-28084
> URL: https://issues.apache.org/jira/browse/SPARK-28084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Sujith Chacko
>Priority: Major
> Attachments: parition_casesensitive.PNG
>
>
> LOAD DATA command resolving the partition column name considering case 
> sensitive manner, where as insert command resolves case-insensitive manner.
> Refer the snapshot for more details.
> !image-2019-06-18-00-04-22-475.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28084) LOAD DATA command resolving the partition column name considering case senstive manner

2019-06-17 Thread Sujith Chacko (JIRA)
Sujith Chacko created SPARK-28084:
-

 Summary: LOAD DATA command resolving the partition column name 
considering case senstive manner 
 Key: SPARK-28084
 URL: https://issues.apache.org/jira/browse/SPARK-28084
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: Sujith Chacko
 Attachments: parition_casesensitive.PNG

LOAD DATA command resolving the partition column name considering case 
sensitive manner, where as insert command resolves case-insensitive manner.

Refer the snapshot for more details.

!image-2019-06-18-00-04-22-475.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

2019-06-17 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865714#comment-16865714
 ] 

Liang-Chi Hsieh edited comment on SPARK-28058 at 6/17/19 3:59 PM:
--

[~hyukjin.kwon] Do you mean this is suspect to be a bug:

{code}
scala> spark.read.option("header", "true").option("mode", 
"DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
+--+--+
|fruit |color |
+--+--+
|apple |red   |
|banana|yellow|
|orange|orange|
|xxx   |null  |
+--+--+
{code}

In this case, the reader should read two columns. But the corrupted record has 
only one column. Reasonably, it should be dropped as a malformed one. But we 
see the missing column is filled with null.

This seems to be inherited from Univocity parser, when we use 
{{CsvParserSettings.selectIndexes}} to do field selection. In above case, the 
parser returns two tokens where the second token is just null. I'm not sure if 
it is known behavior of Univocity parser, or it is a bug at Univocity parser.


was (Author: viirya):
[~hyukjin.kwon] Do you mean this is suspect to be a bug:

{code}
scala> spark.read.option("header", "true").option("mode", 
"DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
+--+--+
|fruit |color |
+--+--+
|apple |red   |
|banana|yellow|
|orange|orange|
|xxx   |null  |
+--+--+
{code}

In this case, the reader should read two columns. But the corrupted record has 
only one column. Reasonably, it should be dropped as a malformed one. But we 
see the missing column is filled with null.

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> ---
>
> Key: SPARK-28058
> URL: https://issues.apache.org/jira/browse/SPARK-28058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Stuart White
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +--+--+-++
>   
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +--+
> |fruit |
> +--+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +--+--+
> |fruit |color |
> +--+--+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +--+--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +--+--+-+
> |fruit |color |price|
> +--+--+-+
> |apple |red   |1|
> |banana|yellow|2|
> |orange|orange|3|
> |xxx   |null  |null |
> +--+--+-+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +--+--+-++
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Updated] (SPARK-28082) Add a note to DROPMALFORMED mode of CSV for column pruning

2019-06-17 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28082:
--
Component/s: Documentation

> Add a note to DROPMALFORMED mode of CSV for column pruning
> --
>
> Key: SPARK-28082
> URL: https://issues.apache.org/jira/browse/SPARK-28082
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Trivial
>
> This is inspired by SPARK-28058.
> When using {{DROPMALFORMED}} mode, corrupted records aren't dropped if 
> malformed columns aren't read. This behavior is due to CSV parser column 
> pruning. Current doc of {{DROPMALFORMED}} doesn't mention the effect of 
> column pruning. Users will be confused by the fact that {{DROPMALFORMED}} 
> mode doesn't work as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25341) Support rolling back a shuffle map stage and re-generate the shuffle files

2019-06-17 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25341:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Support rolling back a shuffle map stage and re-generate the shuffle files
> --
>
> Key: SPARK-25341
> URL: https://issues.apache.org/jira/browse/SPARK-25341
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> This is a follow up of https://issues.apache.org/jira/browse/SPARK-23243
> To completely fix that problem, Spark needs to be able to rollback a shuffle 
> map stage and rerun all the map tasks.
> According to https://github.com/apache/spark/pull/9214 , Spark doesn't 
> support it currently, as in shuffle writing "first write wins".
> Since overwriting shuffle files is hard, we can extend the shuffle id to 
> include a "shuffle generation number". Then the reduce task can specify which 
> generation of shuffle it wants to read. 
> https://github.com/apache/spark/pull/6648 seems in the right direction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28074) [SS] Document caveats on using multiple stateful operations in single query

2019-06-17 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28074:
--
Component/s: Documentation

> [SS] Document caveats on using multiple stateful operations in single query
> ---
>
> Key: SPARK-28074
> URL: https://issues.apache.org/jira/browse/SPARK-28074
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Please refer 
> [https://lists.apache.org/thread.html/cc6489a19316e7382661d305fabd8c21915e5faf6a928b4869ac2b4a@%3Cdev.spark.apache.org%3E]
>  to see rationalization of the issue.
> tl;dr. Using multiple stateful operators without fully understanding how 
> global watermark affects the query brings correctness issue.
> While we may not want to fully block the operations given some users could 
> leverage it with understanding of global watermark, it would be necessary for 
> end users to warn on such breaking-correctness possibility, and provide 
> alternative(s).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28066) Optimize UTF8String.trim() for common case of no whitespace

2019-06-17 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28066.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24884

> Optimize UTF8String.trim() for common case of no whitespace
> ---
>
> Key: SPARK-28066
> URL: https://issues.apache.org/jira/browse/SPARK-28066
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> {{UTF8String.trim()}} allocates a new object even if the string has no 
> whitespace, when it can just return itself. A simple check for this case 
> makes the method about 3x faster in the common case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

2019-06-17 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865714#comment-16865714
 ] 

Liang-Chi Hsieh commented on SPARK-28058:
-

[~hyukjin.kwon] Do you mean this is suspect to be a bug:

{code}
scala> spark.read.option("header", "true").option("mode", 
"DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
+--+--+
|fruit |color |
+--+--+
|apple |red   |
|banana|yellow|
|orange|orange|
|xxx   |null  |
+--+--+
{code}

In this case, the reader should read two columns. But the corrupted record has 
only one column. Reasonably, it should be dropped as a malformed one. But we 
see the missing column is filled with null.

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> ---
>
> Key: SPARK-28058
> URL: https://issues.apache.org/jira/browse/SPARK-28058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Stuart White
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +--+--+-++
>   
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +--+
> |fruit |
> +--+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +--+--+
> |fruit |color |
> +--+--+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +--+--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +--+--+-+
> |fruit |color |price|
> +--+--+-+
> |apple |red   |1|
> |banana|yellow|2|
> |orange|orange|3|
> |xxx   |null  |null |
> +--+--+-+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +--+--+-++
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28083) ANSI SQL: LIKE predicate: ESCAPE clause

2019-06-17 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28083:
---

 Summary: ANSI SQL: LIKE predicate: ESCAPE clause
 Key: SPARK-28083
 URL: https://issues.apache.org/jira/browse/SPARK-28083
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


Format:
{noformat}
 ::=

  | 
 ::=
   
 ::=
  [ NOT ] LIKE  [ ESCAPE  ]
 ::=
  
 ::=
  
 ::=
   
 ::=
  [ NOT ] LIKE  [ ESCAPE  ]
 ::=
  
 ::=
  
{noformat}
 

[https://www.postgresql.org/docs/11/functions-matching.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28082) Add a note to DROPMALFORMED mode of CSV for column pruning

2019-06-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28082:


Assignee: Apache Spark

> Add a note to DROPMALFORMED mode of CSV for column pruning
> --
>
> Key: SPARK-28082
> URL: https://issues.apache.org/jira/browse/SPARK-28082
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Trivial
>
> This is inspired by SPARK-28058.
> When using {{DROPMALFORMED}} mode, corrupted records aren't dropped if 
> malformed columns aren't read. This behavior is due to CSV parser column 
> pruning. Current doc of {{DROPMALFORMED}} doesn't mention the effect of 
> column pruning. Users will be confused by the fact that {{DROPMALFORMED}} 
> mode doesn't work as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28082) Add a note to DROPMALFORMED mode of CSV for column pruning

2019-06-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28082:


Assignee: (was: Apache Spark)

> Add a note to DROPMALFORMED mode of CSV for column pruning
> --
>
> Key: SPARK-28082
> URL: https://issues.apache.org/jira/browse/SPARK-28082
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Trivial
>
> This is inspired by SPARK-28058.
> When using {{DROPMALFORMED}} mode, corrupted records aren't dropped if 
> malformed columns aren't read. This behavior is due to CSV parser column 
> pruning. Current doc of {{DROPMALFORMED}} doesn't mention the effect of 
> column pruning. Users will be confused by the fact that {{DROPMALFORMED}} 
> mode doesn't work as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28082) Add a note to DROPMALFORMED mode of CSV for column pruning

2019-06-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865694#comment-16865694
 ] 

Apache Spark commented on SPARK-28082:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/24894

> Add a note to DROPMALFORMED mode of CSV for column pruning
> --
>
> Key: SPARK-28082
> URL: https://issues.apache.org/jira/browse/SPARK-28082
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Trivial
>
> This is inspired by SPARK-28058.
> When using {{DROPMALFORMED}} mode, corrupted records aren't dropped if 
> malformed columns aren't read. This behavior is due to CSV parser column 
> pruning. Current doc of {{DROPMALFORMED}} doesn't mention the effect of 
> column pruning. Users will be confused by the fact that {{DROPMALFORMED}} 
> mode doesn't work as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

2019-06-17 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865695#comment-16865695
 ] 

Liang-Chi Hsieh commented on SPARK-28058:
-

[~stwhit] Thanks for letting us know that!

Although it is in the migration guide, for new users it should be good to have 
the note in {{DROPMALFORMED}} doc. I filed SPARK-28082 to track the doc 
improvement.



> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> ---
>
> Key: SPARK-28058
> URL: https://issues.apache.org/jira/browse/SPARK-28058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Stuart White
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +--+--+-++
>   
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +--+
> |fruit |
> +--+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +--+--+
> |fruit |color |
> +--+--+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +--+--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +--+--+-+
> |fruit |color |price|
> +--+--+-+
> |apple |red   |1|
> |banana|yellow|2|
> |orange|orange|3|
> |xxx   |null  |null |
> +--+--+-+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +--+--+-++
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28082) Add a note to DROPMALFORMED mode of CSV for column pruning

2019-06-17 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-28082:
---

 Summary: Add a note to DROPMALFORMED mode of CSV for column pruning
 Key: SPARK-28082
 URL: https://issues.apache.org/jira/browse/SPARK-28082
 Project: Spark
  Issue Type: Documentation
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


This is inspired by SPARK-28058.

When using {{DROPMALFORMED}} mode, corrupted records aren't dropped if 
malformed columns aren't read. This behavior is due to CSV parser column 
pruning. Current doc of {{DROPMALFORMED}} doesn't mention the effect of column 
pruning. Users will be confused by the fact that {{DROPMALFORMED}} mode doesn't 
work as expected.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27989) Add retries on the connection to the driver

2019-06-17 Thread Jose Luis Pedrosa (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Luis Pedrosa updated SPARK-27989:
--
Component/s: (was: Kubernetes)

> Add retries on the connection to the driver
> ---
>
> Key: SPARK-27989
> URL: https://issues.apache.org/jira/browse/SPARK-27989
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Jose Luis Pedrosa
>Priority: Minor
>
> Due to Java caching of negative DNS resolution (failed requests are never 
> retried).
> Any failure in the DNS when trying to connect to the driver, will make 
> impossible a connection from that process.
> This happens specially in Kubernetes where network setup of pods can take 
> some time,
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

2019-06-17 Thread Stuart White (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865679#comment-16865679
 ] 

Stuart White edited comment on SPARK-28058 at 6/17/19 3:08 PM:
---

Thank you both for your responses.
 
I now see that at the [Spark SQL Upgrading 
Guide|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html], 
under the [Upgrading From Spark SQL 2.3 to 
2.4|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-23-to-24]
 section, it states:

{noformat}
In version 2.3 and earlier, CSV rows are considered as malformed if at least 
one column 
value in the row is malformed. CSV parser dropped such rows in the 
DROPMALFORMED mode or
outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered 
as malformed
only when it contains malformed column values requested from CSV datasource, 
other values
can be ignored. As an example, CSV file contains the “id,name” header and one 
row “1234”.
In Spark 2.4, selection of the id column consists of a row with one column 
value 1234 but
in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the 
previous
behavior, set spark.sql.csv.parser.columnPruning.enabled to false.
{noformat}

I had not noticed that until you called the 
{{spark.sql.csv.parser.columnPruning.enabled}} option to my attention.

Thanks again for the help!


was (Author: stwhit):
Thank you both for your responses.
 
I now see that at the [Spark SQL Upgrading 
Guide|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html], 
under the [Upgrading From Spark SQL 2.3 to 
2.4|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-23-to-24]
 section, it states:

{noformat}
In version 2.3 and earlier, CSV rows are considered as malformed if at least 
one column value in the row is malformed. CSV parser dropped such rows in the 
DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, 
CSV row is considered as malformed only when it contains malformed column 
values requested from CSV datasource, other values can be ignored. As an 
example, CSV file contains the “id,name” header and one row “1234”. In Spark 
2.4, selection of the id column consists of a row with one column value 1234 
but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore 
the previous behavior, set spark.sql.csv.parser.columnPruning.enabled to false.
{noformat}

I had not noticed that until you called the 
{{spark.sql.csv.parser.columnPruning.enabled}} option to my attention.

Thanks again for the help!

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> ---
>
> Key: SPARK-28058
> URL: https://issues.apache.org/jira/browse/SPARK-28058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Stuart White
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +--+--+-++
>   
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +--+
> |fruit |
> +--+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +--+--+
> |fruit |color |
> +--+--+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +--+--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +--+--+-+
> |fruit |color |price|
> +--+--+-+

[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

2019-06-17 Thread Stuart White (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865679#comment-16865679
 ] 

Stuart White commented on SPARK-28058:
--

Thank you both for your responses.
 
I now see that at the [Spark SQL Upgrading 
Guide|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html], 
under the [Upgrading From Spark SQL 2.3 to 
2.4|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-23-to-24]
 section, it states:

{noformat}
In version 2.3 and earlier, CSV rows are considered as malformed if at least 
one column value in the row is malformed. CSV parser dropped such rows in the 
DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, 
CSV row is considered as malformed only when it contains malformed column 
values requested from CSV datasource, other values can be ignored. As an 
example, CSV file contains the “id,name” header and one row “1234”. In Spark 
2.4, selection of the id column consists of a row with one column value 1234 
but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore 
the previous behavior, set spark.sql.csv.parser.columnPruning.enabled to false.
{noformat}

I had not noticed that until you called the 
{{spark.sql.csv.parser.columnPruning.enabled}} option to my attention.

Thanks again for the help!

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> ---
>
> Key: SPARK-28058
> URL: https://issues.apache.org/jira/browse/SPARK-28058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Stuart White
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +--+--+-++
>   
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +--+
> |fruit |
> +--+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +--+--+
> |fruit |color |
> +--+--+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +--+--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +--+--+-+
> |fruit |color |price|
> +--+--+-+
> |apple |red   |1|
> |banana|yellow|2|
> |orange|orange|3|
> |xxx   |null  |null |
> +--+--+-+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +--+--+-++
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

2019-06-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28058:


Assignee: Apache Spark

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> ---
>
> Key: SPARK-28058
> URL: https://issues.apache.org/jira/browse/SPARK-28058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Stuart White
>Assignee: Apache Spark
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +--+--+-++
>   
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +--+
> |fruit |
> +--+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +--+--+
> |fruit |color |
> +--+--+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +--+--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +--+--+-+
> |fruit |color |price|
> +--+--+-+
> |apple |red   |1|
> |banana|yellow|2|
> |orange|orange|3|
> |xxx   |null  |null |
> +--+--+-+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +--+--+-++
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

2019-06-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28058:


Assignee: (was: Apache Spark)

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> ---
>
> Key: SPARK-28058
> URL: https://issues.apache.org/jira/browse/SPARK-28058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Stuart White
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +--+--+-++
>   
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +--+
> |fruit |
> +--+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +--+--+
> |fruit |color |
> +--+--+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +--+--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +--+--+-+
> |fruit |color |price|
> +--+--+-+
> |apple |red   |1|
> |banana|yellow|2|
> |orange|orange|3|
> |xxx   |null  |null |
> +--+--+-+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +--+--+-++
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28081) word2vec 'large' count value too low for very large corpora

2019-06-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28081:


Assignee: Sean Owen  (was: Apache Spark)

> word2vec 'large' count value too low for very large corpora
> ---
>
> Key: SPARK-28081
> URL: https://issues.apache.org/jira/browse/SPARK-28081
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> The word2vec implementation operates on word counts, and uses a hard-coded 
> value of 1e9 to mean "a very large count, larger than any actual count". 
> However this causes the logic to fail if, in fact, a large corpora has some 
> words that really do occur more than this many times. We can probably improve 
> the implementation to better handle very large counts in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28081) word2vec 'large' count value too low for very large corpora

2019-06-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28081:


Assignee: Apache Spark  (was: Sean Owen)

> word2vec 'large' count value too low for very large corpora
> ---
>
> Key: SPARK-28081
> URL: https://issues.apache.org/jira/browse/SPARK-28081
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> The word2vec implementation operates on word counts, and uses a hard-coded 
> value of 1e9 to mean "a very large count, larger than any actual count". 
> However this causes the logic to fail if, in fact, a large corpora has some 
> words that really do occur more than this many times. We can probably improve 
> the implementation to better handle very large counts in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

2019-06-17 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865664#comment-16865664
 ] 

Liang-Chi Hsieh commented on SPARK-28058:
-

Although this isn't a bug, I think it might be worth adding a note to current 
doc to explain/clarify it.

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> ---
>
> Key: SPARK-28058
> URL: https://issues.apache.org/jira/browse/SPARK-28058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Stuart White
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +--+--+-++
>   
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +--+
> |fruit |
> +--+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +--+--+
> |fruit |color |
> +--+--+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +--+--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +--+--+-+
> |fruit |color |price|
> +--+--+-+
> |apple |red   |1|
> |banana|yellow|2|
> |orange|orange|3|
> |xxx   |null  |null |
> +--+--+-+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +--+--+-++
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

2019-06-17 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865650#comment-16865650
 ] 

Liang-Chi Hsieh commented on SPARK-28058:
-

This is due to CSV parser column pruning. You can disable it and the behavior 
would like you expect:
{code:java}
scala> spark.read.option("header", "true").option("mode", 
"DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
+--+
|fruit |
+--+
|apple |
|banana|
|orange|
|xxx   |
+--+

scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false)

scala> spark.read.option("header", "true").option("mode", 
"DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
+--+
|fruit |
+--+
|apple |
|banana|
|orange|
+--+
{code}

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> ---
>
> Key: SPARK-28058
> URL: https://issues.apache.org/jira/browse/SPARK-28058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Stuart White
>Priority: Minor
>  Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +--+--+-++
>   
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +--+
> |fruit |
> +--+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +--+--+
> |fruit |color |
> +--+--+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +--+--+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +--+--+-+
> |fruit |color |price|
> +--+--+-+
> |apple |red   |1|
> |banana|yellow|2|
> |orange|orange|3|
> |xxx   |null  |null |
> +--+--+-+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +--+--+-++
> |fruit |color |price|quantity|
> +--+--+-++
> |apple |red   |1|3   |
> |banana|yellow|2|4   |
> |orange|orange|3|5   |
> +--+--+-++
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28081) word2vec 'large' count value too low for very large corpora

2019-06-17 Thread Sean Owen (JIRA)
Sean Owen created SPARK-28081:
-

 Summary: word2vec 'large' count value too low for very large 
corpora
 Key: SPARK-28081
 URL: https://issues.apache.org/jira/browse/SPARK-28081
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.4.3
Reporter: Sean Owen
Assignee: Sean Owen


The word2vec implementation operates on word counts, and uses a hard-coded 
value of 1e9 to mean "a very large count, larger than any actual count". 
However this causes the logic to fail if, in fact, a large corpora has some 
words that really do occur more than this many times. We can probably improve 
the implementation to better handle very large counts in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28076) String Functions: SUBSTRING support regular expression

2019-06-17 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865608#comment-16865608
 ] 

Yuming Wang commented on SPARK-28076:
-

cc [~lipzhu] Would you like to pick up this?

> String Functions: SUBSTRING support regular expression
> --
>
> Key: SPARK-28076
> URL: https://issues.apache.org/jira/browse/SPARK-28076
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function||Return Type||Description||Example||Result||
> |{{substring(_string_}} from _{{pattern}}_)|{{text}}|Extract substring 
> matching POSIX regular expression. See [Section 
> 9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more 
> information on pattern matching.|{{substring('Thomas' from '...$')}}|{{mas}}|
> |{{substring(_string_}} from _{{pattern}}_ for _{{escape}}_)|{{text}}|Extract 
> substring matching SQL regular expression. See [Section 
> 9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more 
> information on pattern matching.|{{substring('Thomas' from '%#"o_a#"_' for 
> '#')}}|{{oma}}|
> For example:
> {code:sql}
> -- T581 regular expression substring (with SQL's bizarre regexp syntax)
> SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd";
> {code}
> https://www.postgresql.org/docs/11/functions-string.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28012) Hive UDF supports struct type foldable expression

2019-06-17 Thread dzcxzl (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-28012:
---
Summary: Hive UDF supports struct type foldable expression  (was: Hive UDF 
supports literal struct type)

> Hive UDF supports struct type foldable expression
> -
>
> Key: SPARK-28012
> URL: https://issues.apache.org/jira/browse/SPARK-28012
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: dzcxzl
>Priority: Trivial
>
> Currently using hive udf, the parameter is struct type, there will be an 
> exception thrown.
> No handler for Hive UDF 'xxxUDF': java.lang.RuntimeException: Hive doesn't 
> support the constant type [StructType(StructField(name,StringType,true), 
> StructField(value,DecimalType(3,1),true))]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28012) Hive UDF supports literal struct type

2019-06-17 Thread dzcxzl (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-28012:
---
Description: 
Currently using hive udf, the parameter is struct type, there will be an 
exception thrown.

No handler for Hive UDF 'xxxUDF': java.lang.RuntimeException: Hive doesn't 
support the constant type [StructType(StructField(name,StringType,true), 
StructField(value,DecimalType(3,1),true))]

  was:
Currently using hive udf, the parameter is literal struct type, will report an 
error.

No handler for Hive UDF 'xxxUDF': java.lang.RuntimeException: Hive doesn't 
support the constant type [StructType(StructField(name,StringType,true), 
StructField(value,DecimalType(3,1),true))]


> Hive UDF supports literal struct type
> -
>
> Key: SPARK-28012
> URL: https://issues.apache.org/jira/browse/SPARK-28012
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: dzcxzl
>Priority: Trivial
>
> Currently using hive udf, the parameter is struct type, there will be an 
> exception thrown.
> No handler for Hive UDF 'xxxUDF': java.lang.RuntimeException: Hive doesn't 
> support the constant type [StructType(StructField(name,StringType,true), 
> StructField(value,DecimalType(3,1),true))]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2019-06-17 Thread Dominic Ricard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865592#comment-16865592
 ] 

Dominic Ricard commented on SPARK-21067:


In 2.4, the problem also affect Parquet tables creation... So we had to move 
all our CTAS to a different process which does not use the thrift server...

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0, 2.4.0, 2.4.3
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>Priority: Major
> Attachments: SPARK-21067.patch
>
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at 
> 

[jira] [Created] (SPARK-28080) There is a problem to download and watch offline the history of an application with multiple attempts due to UI inconsistency

2019-06-17 Thread Gal Weiss (JIRA)
Gal Weiss created SPARK-28080:
-

 Summary: There is a problem to download and watch offline the 
history of an application with multiple attempts due to UI inconsistency
 Key: SPARK-28080
 URL: https://issues.apache.org/jira/browse/SPARK-28080
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.1
 Environment: I used the spark-2.4.3-bin-hadoop2.7 and 
spark-2.3.1-bin-hadoop2.7 packages from 
[https://spark.apache.org/downloads.html]

Running the history server locally as-is (using default values) on ubunto 
16.04.4 running using WSL (Windows Subsystem for Linux) on my windows 10 
machine.

Browser used is firefox 67.0.2 (64-bit) for windows
Reporter: Gal Weiss


Overview:

If you are looking to watch locally a spark application attempt history, trying 
to see the history of the first attempt (or any attempt but the last one) would 
fail, because some UI inconsistently.

The inconsistency is that in the spark history UI, the "app_id" column is 
clickable and will always take you to this application *last attempt*, but if 
you tried to download only the first attempt, you will get an error of 
application not found.

 

How to reproduce:
 # open spark any spark history server (if using Azure HDinsight the address 
would be https://.azurehdinsight.net/sparkhistory/)
 # look for an application that have multiple attempts (ie - attempt ID > 1)
 # look for the *first* attempt in this application and download it using the 
"download" button in the event column. save it in your local spark history 
folder (default: /tmp/spark-events)
 # Start a local spark history server (typically: using the 
start-history-server.sh script)
 # browse to the local history server and look for the application for which 
you downloaded the history.
 # click the application name in the "App ID" column, and you would get the 
following error:
"Application  not found."

Why ?

because on the remote history server it is assumed that all Attempts history 
files are preset, so the "App ID" column points to the latest attempt of this 
app, while the "Attempt  ID" column points to the specific attempt.

But if we have an application with two attempts, and we only want to research 
the first one, we download it locally, opening with the local history server, 
and intuitively clicking the link in the "app id" column, the link actually 
points to the second attempt, which we haven't even downloaded. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2019-06-17 Thread Dominic Ricard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Ricard updated SPARK-21067:
---
Affects Version/s: 2.4.3

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0, 2.4.0, 2.4.3
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>Priority: Major
> Attachments: SPARK-21067.patch
>
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
> at org.apache.spark.sql.Dataset.(Dataset.scala:185)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
> at 

[jira] [Updated] (SPARK-28078) String Functions: Add support other 4 REGEXP functions

2019-06-17 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28078:

Summary: String Functions: Add support other 4 REGEXP functions  (was: 
String Functions: Add support other 4 REGEXP_ functions)

> String Functions: Add support other 4 REGEXP functions
> --
>
> Key: SPARK-28078
> URL: https://issues.apache.org/jira/browse/SPARK-28078
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {{regexp_match}}, {{regexp_matches}}, {{regexp_split_to_array}} and 
> {{regexp_split_to_table}}
> [https://www.postgresql.org/docs/11/functions-string.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28079) CSV fails to detect corrupt record unless "columnNameOfCorruptRecord" is manually added to the schema

2019-06-17 Thread F Jimenez (JIRA)
F Jimenez created SPARK-28079:
-

 Summary: CSV fails to detect corrupt record unless 
"columnNameOfCorruptRecord" is manually added to the schema
 Key: SPARK-28079
 URL: https://issues.apache.org/jira/browse/SPARK-28079
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.3, 2.3.2
Reporter: F Jimenez


When reading a CSV with mode = "PERMISSIVE", corrupt records are not flagged as 
such and read in. Only way to get them flagged is to manually set 
"columnNameOfCorruptRecord" AND manually setting the schema including this 
column. Example:
{code:java}
// Second row has a 4th column that is not declared in the header/schema
val csvText = s"""
 | FieldA, FieldB, FieldC
 | a1,b1,c1
 | a2,b2,c2,d*""".stripMargin

val csvFile = new File("/tmp/file.csv")
FileUtils.write(csvFile, csvText)

val reader = sqlContext.read
  .format("csv")
  .option("header", "true")
  .option("mode", "PERMISSIVE")
  .option("columnNameOfCorruptRecord", "corrupt")
  .schema("corrupt STRING, fieldA STRING, fieldB STRING, fieldC STRING")

reader.load(csvFile.getAbsolutePath).show(truncate = false)
{code}
This produces the correct result:
{code:java}
++--+--+--+
|corrupt |fieldA|fieldB|fieldC|
++--+--+--+
|null    | a1   |b1    |c1    |
| a2,b2,c2,d*| a2   |b2    |c2    |
++--+--+--+
{code}
However removing the "schema" option and going:
{code:java}
val reader = sqlContext.read
  .format("csv")
  .option("header", "true")
  .option("mode", "PERMISSIVE")
  .option("columnNameOfCorruptRecord", "corrupt")

reader.load(csvFile.getAbsolutePath).show(truncate = false)
{code}
Yields:
{code:java}
+---+---+---+
| FieldA| FieldB| FieldC|
+---+---+---+
| a1    |b1 |c1 |
| a2    |b2 |c2 |
+---+---+---+
{code}
The fourth value "d*" in the second row has been removed and the row not marked 
as corrupt

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28078) String Functions: Add support other 4 REGEXP_ functions

2019-06-17 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28078:
---

 Summary: String Functions: Add support other 4 REGEXP_ functions
 Key: SPARK-28078
 URL: https://issues.apache.org/jira/browse/SPARK-28078
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


{{regexp_match}}, {{regexp_matches}}, {{regexp_split_to_array}} and 
{{regexp_split_to_table}}

[https://www.postgresql.org/docs/11/functions-string.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28075) String Functions: Enhance TRIM function

2019-06-17 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28075:

Summary: String Functions: Enhance TRIM function  (was: Enhance TRIM 
function)

> String Functions: Enhance TRIM function
> ---
>
> Key: SPARK-28075
> URL: https://issues.apache.org/jira/browse/SPARK-28075
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Add support {{TRIM(BOTH/LEADING/TRAILING FROM str)}} format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28077) String Functions: Add support OVERLAY

2019-06-17 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28077:
---

 Summary: String Functions: Add support OVERLAY
 Key: SPARK-28077
 URL: https://issues.apache.org/jira/browse/SPARK-28077
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


||Function||Return Type||Description||Example||Result||
|{{overlay(_string_ placing _string_ from }}{{int}}{{[for 
{{int}}])}}|{{text}}|Replace substring|{{overlay('Tas' placing 'hom' from 2 
for 4)}}|{{Thomas}}|

For example:
{code:sql}
SELECT OVERLAY('abcdef' PLACING '45' FROM 4) AS "abc45f";

SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5) AS "yabadaba";

SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5 FOR 0) AS "yabadabadoo";

SELECT OVERLAY('babosa' PLACING 'ubb' FROM 2 FOR 4) AS "bubba";
{code}
https://www.postgresql.org/docs/11/functions-string.html




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28076) String Functions: Support regular expression substring

2019-06-17 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28076:

Summary: String Functions: Support regular expression substring  (was: 
Support regular expression substring)

> String Functions: Support regular expression substring
> --
>
> Key: SPARK-28076
> URL: https://issues.apache.org/jira/browse/SPARK-28076
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function||Return Type||Description||Example||Result||
> |{{substring(_string_}} from _{{pattern}}_)|{{text}}|Extract substring 
> matching POSIX regular expression. See [Section 
> 9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more 
> information on pattern matching.|{{substring('Thomas' from '...$')}}|{{mas}}|
> |{{substring(_string_}} from _{{pattern}}_ for _{{escape}}_)|{{text}}|Extract 
> substring matching SQL regular expression. See [Section 
> 9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more 
> information on pattern matching.|{{substring('Thomas' from '%#"o_a#"_' for 
> '#')}}|{{oma}}|
> For example:
> {code:sql}
> -- T581 regular expression substring (with SQL's bizarre regexp syntax)
> SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd";
> {code}
> https://www.postgresql.org/docs/11/functions-string.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28076) String Functions: SUBSTRING support regular expression

2019-06-17 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28076:

Summary: String Functions: SUBSTRING support regular expression  (was: 
String Functions: Support regular expression substring)

> String Functions: SUBSTRING support regular expression
> --
>
> Key: SPARK-28076
> URL: https://issues.apache.org/jira/browse/SPARK-28076
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function||Return Type||Description||Example||Result||
> |{{substring(_string_}} from _{{pattern}}_)|{{text}}|Extract substring 
> matching POSIX regular expression. See [Section 
> 9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more 
> information on pattern matching.|{{substring('Thomas' from '...$')}}|{{mas}}|
> |{{substring(_string_}} from _{{pattern}}_ for _{{escape}}_)|{{text}}|Extract 
> substring matching SQL regular expression. See [Section 
> 9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more 
> information on pattern matching.|{{substring('Thomas' from '%#"o_a#"_' for 
> '#')}}|{{oma}}|
> For example:
> {code:sql}
> -- T581 regular expression substring (with SQL's bizarre regexp syntax)
> SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd";
> {code}
> https://www.postgresql.org/docs/11/functions-string.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28076) Support regular expression substring

2019-06-17 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28076:

Description: 
||Function||Return Type||Description||Example||Result||
|{{substring(_string_}} from _{{pattern}}_)|{{text}}|Extract substring matching 
POSIX regular expression. See [Section 
9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more 
information on pattern matching.|{{substring('Thomas' from '...$')}}|{{mas}}|
|{{substring(_string_}} from _{{pattern}}_ for _{{escape}}_)|{{text}}|Extract 
substring matching SQL regular expression. See [Section 
9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more 
information on pattern matching.|{{substring('Thomas' from '%#"o_a#"_' for 
'#')}}|{{oma}}|

For example:
{code:sql}
-- T581 regular expression substring (with SQL's bizarre regexp syntax)
SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd";
{code}

https://www.postgresql.org/docs/11/functions-string.html

  was:
For example:

{code:sql}
-- T581 regular expression substring (with SQL's bizarre regexp syntax)
SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd";
{code}


> Support regular expression substring
> 
>
> Key: SPARK-28076
> URL: https://issues.apache.org/jira/browse/SPARK-28076
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function||Return Type||Description||Example||Result||
> |{{substring(_string_}} from _{{pattern}}_)|{{text}}|Extract substring 
> matching POSIX regular expression. See [Section 
> 9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more 
> information on pattern matching.|{{substring('Thomas' from '...$')}}|{{mas}}|
> |{{substring(_string_}} from _{{pattern}}_ for _{{escape}}_)|{{text}}|Extract 
> substring matching SQL regular expression. See [Section 
> 9.7|https://www.postgresql.org/docs/11/functions-matching.html] for more 
> information on pattern matching.|{{substring('Thomas' from '%#"o_a#"_' for 
> '#')}}|{{oma}}|
> For example:
> {code:sql}
> -- T581 regular expression substring (with SQL's bizarre regexp syntax)
> SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd";
> {code}
> https://www.postgresql.org/docs/11/functions-string.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28076) Support regular expression substring

2019-06-17 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28076:
---

 Summary: Support regular expression substring
 Key: SPARK-28076
 URL: https://issues.apache.org/jira/browse/SPARK-28076
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


For example:

{code:sql}
-- T581 regular expression substring (with SQL's bizarre regexp syntax)
SELECT SUBSTRING('abcdefg' FROM 'a#"(b_d)#"%' FOR '#') AS "bcd";
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs

2019-06-17 Thread Chris Martin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865433#comment-16865433
 ] 

Chris Martin commented on SPARK-27463:
--

 sounds good to me too.

> Support Dataframe Cogroup via Pandas UDFs 
> --
>
> Key: SPARK-27463
> URL: https://issues.apache.org/jira/browse/SPARK-27463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Chris Martin
>Priority: Major
>
> Recent work on Pandas UDFs in Spark, has allowed for improved 
> interoperability between Pandas and Spark.  This proposal aims to extend this 
> by introducing a new Pandas UDF type which would allow for a cogroup 
> operation to be applied to two PySpark DataFrames.
> Full details are in the google document linked below.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28075) Enhance TRIM function

2019-06-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28075:


Assignee: Apache Spark

> Enhance TRIM function
> -
>
> Key: SPARK-28075
> URL: https://issues.apache.org/jira/browse/SPARK-28075
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Add support {{TRIM(BOTH/LEADING/TRAILING FROM str)}} format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28075) Enhance TRIM function

2019-06-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28075:


Assignee: (was: Apache Spark)

> Enhance TRIM function
> -
>
> Key: SPARK-28075
> URL: https://issues.apache.org/jira/browse/SPARK-28075
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Add support {{TRIM(BOTH/LEADING/TRAILING FROM str)}} format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28075) Enhance TRIM function

2019-06-17 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28075:
---

 Summary: Enhance TRIM function
 Key: SPARK-28075
 URL: https://issues.apache.org/jira/browse/SPARK-28075
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


Add support {{TRIM(BOTH/LEADING/TRAILING FROM str)}} format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23128) A new approach to do adaptive execution in Spark SQL

2019-06-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23128:
---

Assignee: Carson Wang  (was: Maryann Xue)

> A new approach to do adaptive execution in Spark SQL
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Assignee: Carson Wang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: AdaptiveExecutioninBaidu.pdf
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL

2019-06-17 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865396#comment-16865396
 ] 

Wenchen Fan commented on SPARK-23128:
-

To split credits, I'm re-assigning this ticket to [~carsonwang] .

> A new approach to do adaptive execution in Spark SQL
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Assignee: Maryann Xue
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: AdaptiveExecutioninBaidu.pdf
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28074) [SS] Document caveats on using multiple stateful operations in single query

2019-06-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28074:


Assignee: Apache Spark

> [SS] Document caveats on using multiple stateful operations in single query
> ---
>
> Key: SPARK-28074
> URL: https://issues.apache.org/jira/browse/SPARK-28074
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> Please refer 
> [https://lists.apache.org/thread.html/cc6489a19316e7382661d305fabd8c21915e5faf6a928b4869ac2b4a@%3Cdev.spark.apache.org%3E]
>  to see rationalization of the issue.
> tl;dr. Using multiple stateful operators without fully understanding how 
> global watermark affects the query brings correctness issue.
> While we may not want to fully block the operations given some users could 
> leverage it with understanding of global watermark, it would be necessary for 
> end users to warn on such breaking-correctness possibility, and provide 
> alternative(s).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28074) [SS] Document caveats on using multiple stateful operations in single query

2019-06-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28074:


Assignee: (was: Apache Spark)

> [SS] Document caveats on using multiple stateful operations in single query
> ---
>
> Key: SPARK-28074
> URL: https://issues.apache.org/jira/browse/SPARK-28074
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Please refer 
> [https://lists.apache.org/thread.html/cc6489a19316e7382661d305fabd8c21915e5faf6a928b4869ac2b4a@%3Cdev.spark.apache.org%3E]
>  to see rationalization of the issue.
> tl;dr. Using multiple stateful operators without fully understanding how 
> global watermark affects the query brings correctness issue.
> While we may not want to fully block the operations given some users could 
> leverage it with understanding of global watermark, it would be necessary for 
> end users to warn on such breaking-correctness possibility, and provide 
> alternative(s).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28074) [SS] Document caveats on using multiple stateful operations in single query

2019-06-17 Thread Jungtaek Lim (JIRA)
Jungtaek Lim created SPARK-28074:


 Summary: [SS] Document caveats on using multiple stateful 
operations in single query
 Key: SPARK-28074
 URL: https://issues.apache.org/jira/browse/SPARK-28074
 Project: Spark
  Issue Type: Documentation
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


Please refer 
[https://lists.apache.org/thread.html/cc6489a19316e7382661d305fabd8c21915e5faf6a928b4869ac2b4a@%3Cdev.spark.apache.org%3E]
 to see rationalization of the issue.

tl;dr. Using multiple stateful operators without fully understanding how global 
watermark affects the query brings correctness issue.

While we may not want to fully block the operations given some users could 
leverage it with understanding of global watermark, it would be necessary for 
end users to warn on such breaking-correctness possibility, and provide 
alternative(s).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org