[jira] [Commented] (SPARK-32619) converting dataframe to dataset for the json schema

2020-08-16 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178731#comment-17178731
 ] 

Hyukjin Kwon commented on SPARK-32619:
--

I don't think this is a bug in a Spark. We can confirm once we see the codes 
and reproducible steps but it doesn't look a bug given the current information. 
I agree with [~viirya]^

> converting dataframe to dataset for the json schema
> ---
>
> Key: SPARK-32619
> URL: https://issues.apache.org/jira/browse/SPARK-32619
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Manjay Kumar
>Priority: Minor
>
> have a schema 
>  
> {
> Details :[{
> phone : "98977999"
> contacts: [{
> name:"manjay"
> -- has missing street block in json
> ]}
> ]}
>  
> }
>  
> Case class , based on schema
> case class Details (
>  phone : String,
> contacts : Array[Adress]
> )
>  
> case class Adress(
> name : String
> street : String
>  
> )
>  
>  
> this throws : No such struct field street - Analysis exception.
>  
> dataframe.as[Details]
>  
> Is this a bug ?? or there is a resolution for this.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32619) converting dataframe to dataset for the json schema

2020-08-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32619.
--
Resolution: Invalid

> converting dataframe to dataset for the json schema
> ---
>
> Key: SPARK-32619
> URL: https://issues.apache.org/jira/browse/SPARK-32619
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Manjay Kumar
>Priority: Minor
>
> have a schema 
>  
> {
> Details :[{
> phone : "98977999"
> contacts: [{
> name:"manjay"
> -- has missing street block in json
> ]}
> ]}
>  
> }
>  
> Case class , based on schema
> case class Details (
>  phone : String,
> contacts : Array[Adress]
> )
>  
> case class Adress(
> name : String
> street : String
>  
> )
>  
>  
> this throws : No such struct field street - Analysis exception.
>  
> dataframe.as[Details]
>  
> Is this a bug ?? or there is a resolution for this.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32633) GenericRowWithSchema cannot be cast to GenTraversableOnce

2020-08-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32633.
--
Resolution: Not A Problem

> GenericRowWithSchema cannot be cast to GenTraversableOnce
> -
>
> Key: SPARK-32633
> URL: https://issues.apache.org/jira/browse/SPARK-32633
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.6, 3.0.0
> Environment: my computer: MacOs 10.15.5
> server: CentOS Linux release 7.7.1908 (Core)
> spark: 2.4.6/3.0.0 pre-build for hadoop 2.7
>Reporter: ImportMengjie
>Priority: Major
>
> When I run this in the server using `spark-submit` or my computer using ide 
> run with spark-2.4.6 is right.
> But when I use `spark-submit` on my computer with spark-2.4.6 or use 
> spark-3.0.0 in server the cast Exception will throw in `rdd.isEmpty()`.
> Here is my code:
> {code:java}
> val rdd = session.sparkContext
>   .parallelize(offsets, offsets.size)
>   .flatMap(offset => {
> val query  = s"${config.exec} SKIP ${offset.start} LIMIT ${offset.size}"
> val result = new Neo4jSessionAwareIterator(neo4jConfig, query, 
> Maps.newHashMap(), false)
> val fields = if (result.hasNext) result.peek().keys().asScala else List()
> val schema =
>   if (result.hasNext)
> StructType(
>   fields
> .map(k => (k, result.peek().get(k).`type`()))
> .map(keyType => CypherTypes.field(keyType)))
>   else new StructType()
> result.map(record => {
>   val row = new Array[Any](record.keys().size())
>   for (i <- row.indices)
> row.update(i, Executor.convert(record.get(i).asObject()))
>   new GenericRowWithSchema(values = row, schema).asInstanceOf[Row]
> })
>   })
> if (rdd.isEmpty())
>   throw new RuntimeException(
> "Please check your cypher sentence. because use it search nothing!")
> val schema = rdd.repartition(1).first().schema
> session.createDataFrame(rdd, schema){code}
> here is my exception msg (section):
> {code:java}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to scala.collection.GenTraversableOnce
> at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:488)
> at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266)
> at scala.collection.Iterator.foreach(Iterator.scala:941)
> at scala.collection.Iterator.foreach$(Iterator.scala:941)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
> at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
> at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
> at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
> at scala.collection.AbstractIterator.to(Iterator.scala:1429)
> at 
> scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
> at 
> scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
> at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
> at 
> scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
> at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1409)
> Driver stacktrace:
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
> at org.apache.spark.rdd.RDD.take(RDD.scala:1382)
> at org.apache.spark.rdd.RDD.$anonfun$isEmpty$1(RDD.scala:1517)
> at 
> scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
> at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1517)
> at 
> com.xx.xx.tools.importer.reader.Neo4JReader.read(ServerBaseReader.scala:146){code}
>  
> Thank you for looking!!!



--
This message was sent by 

[jira] [Commented] (SPARK-32633) GenericRowWithSchema cannot be cast to GenTraversableOnce

2020-08-16 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178730#comment-17178730
 ] 

Hyukjin Kwon commented on SPARK-32633:
--

{{GenericRowWithSchema}} isn't supposed to use it as an API.

> GenericRowWithSchema cannot be cast to GenTraversableOnce
> -
>
> Key: SPARK-32633
> URL: https://issues.apache.org/jira/browse/SPARK-32633
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.6, 3.0.0
> Environment: my computer: MacOs 10.15.5
> server: CentOS Linux release 7.7.1908 (Core)
> spark: 2.4.6/3.0.0 pre-build for hadoop 2.7
>Reporter: ImportMengjie
>Priority: Major
>
> When I run this in the server using `spark-submit` or my computer using ide 
> run with spark-2.4.6 is right.
> But when I use `spark-submit` on my computer with spark-2.4.6 or use 
> spark-3.0.0 in server the cast Exception will throw in `rdd.isEmpty()`.
> Here is my code:
> {code:java}
> val rdd = session.sparkContext
>   .parallelize(offsets, offsets.size)
>   .flatMap(offset => {
> val query  = s"${config.exec} SKIP ${offset.start} LIMIT ${offset.size}"
> val result = new Neo4jSessionAwareIterator(neo4jConfig, query, 
> Maps.newHashMap(), false)
> val fields = if (result.hasNext) result.peek().keys().asScala else List()
> val schema =
>   if (result.hasNext)
> StructType(
>   fields
> .map(k => (k, result.peek().get(k).`type`()))
> .map(keyType => CypherTypes.field(keyType)))
>   else new StructType()
> result.map(record => {
>   val row = new Array[Any](record.keys().size())
>   for (i <- row.indices)
> row.update(i, Executor.convert(record.get(i).asObject()))
>   new GenericRowWithSchema(values = row, schema).asInstanceOf[Row]
> })
>   })
> if (rdd.isEmpty())
>   throw new RuntimeException(
> "Please check your cypher sentence. because use it search nothing!")
> val schema = rdd.repartition(1).first().schema
> session.createDataFrame(rdd, schema){code}
> here is my exception msg (section):
> {code:java}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to scala.collection.GenTraversableOnce
> at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:488)
> at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266)
> at scala.collection.Iterator.foreach(Iterator.scala:941)
> at scala.collection.Iterator.foreach$(Iterator.scala:941)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
> at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
> at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
> at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
> at scala.collection.AbstractIterator.to(Iterator.scala:1429)
> at 
> scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
> at 
> scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
> at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
> at 
> scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
> at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1409)
> Driver stacktrace:
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
> at org.apache.spark.rdd.RDD.take(RDD.scala:1382)
> at org.apache.spark.rdd.RDD.$anonfun$isEmpty$1(RDD.scala:1517)
> at 
> scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
> at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1517)
> at 
> 

[jira] [Comment Edited] (SPARK-25390) Data source V2 API refactoring

2020-08-16 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178729#comment-17178729
 ] 

Hyukjin Kwon edited comment on SPARK-25390 at 8/17/20, 5:44 AM:


[~Kyrdan], let's interact at mailing list or stackoverflow, see 
https://spark.apache.org/community.html to ask questions. That's the 
appropriate channel to ask questions.


was (Author: hyukjin.kwon):
[~Kyrdan], let's interact at https://spark.apache.org/community.html to ask 
questions. That's the appropriate channel to ask questions.

> Data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently it's not very clear how we should abstract data source v2 API. The 
> abstraction should be unified between batch and streaming, or similar but 
> have a well-defined difference between batch and streaming. And the 
> abstraction should also include catalog/table.
> An example of the abstraction:
> {code}
> batch: catalog -> table -> scan
> streaming: catalog -> table -> stream -> scan
> {code}
> We should refactor the data source v2 API according to the abstraction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25390) Data source V2 API refactoring

2020-08-16 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178729#comment-17178729
 ] 

Hyukjin Kwon commented on SPARK-25390:
--

[~Kyrdan], let's interact at https://spark.apache.org/community.html to ask 
questions. That's the appropriate channel to ask questions.

> Data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently it's not very clear how we should abstract data source v2 API. The 
> abstraction should be unified between batch and streaming, or similar but 
> have a well-defined difference between batch and streaming. And the 
> abstraction should also include catalog/table.
> An example of the abstraction:
> {code}
> batch: catalog -> table -> scan
> streaming: catalog -> table -> stream -> scan
> {code}
> We should refactor the data source v2 API according to the abstraction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32611) Querying ORC table in Spark3 using spark.sql.orc.impl=hive produces incorrect when timestamp is present in predicate

2020-08-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32611.
--
Resolution: Cannot Reproduce

> Querying ORC table in Spark3 using spark.sql.orc.impl=hive produces incorrect 
> when timestamp is present in predicate
> 
>
> Key: SPARK-32611
> URL: https://issues.apache.org/jira/browse/SPARK-32611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Sumeet
>Priority: Major
>
> *How to reproduce this behavior?*
>  * TZ="America/Los_Angeles" ./bin/spark-shell
>  * sql("set spark.sql.hive.convertMetastoreOrc=true")
>  * sql("set spark.sql.orc.impl=hive")
>  * sql("create table t_spark(col timestamp) stored as orc;")
>  * sql("insert into t_spark values (cast('2100-01-01 
> 01:33:33.123America/Los_Angeles' as timestamp));")
>  * sql("select col, date_format(col, 'DD') from t_spark where col = 
> cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
>  *This will return empty results, which is incorrect.*
>  * sql("set spark.sql.orc.impl=native")
>  * sql("select col, date_format(col, 'DD') from t_spark where col = 
> cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
>  *This will return 1 row, which is the expected output.*
>  
> The above query using (True, hive) returns *correct results if pushdown 
> filters are turned off*. 
>  * sql("set spark.sql.orc.filterPushdown=false")
>  * sql("select col, date_format(col, 'DD') from t_spark where col = 
> cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
>  *This will return 1 row, which is the expected output.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32187) User Guide - Shipping Python Package

2020-08-16 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178728#comment-17178728
 ] 

Hyukjin Kwon edited comment on SPARK-32187 at 8/17/20, 5:39 AM:


The draft looks good as a start. A couple of comments from my cursory look:

- Let's make sure having copy-and-pastable examples, and let's try to write de 
facto standard given that there are multiple other sites such as 
[http://alkaline-ml.com/2018-07-02-conda-spark/], 
[https://jcristharif.com/venv-pack/spark.html.|https://jcristharif.com/venv-pack/spark.html].
- Let's place the section about shipping zip, egg and .py files onto the top, 
and place pex and virtual environment on the bottom. Arguably it is more common 
to simply use {{ --py-files}} or {{spark.submit.pyFiles}} configuration to ship 
Python packages.

Let's open a PR and loop with other committers to have more reviews. Shipping 
packages is a bit hairy area and there are many other committers who have a 
better insight than me in particular about other clusters Mesos, Kubernates, 
etc.

As for referencing your own stuff, It looks fine. It's okay to mention things 
as a FYI reference.

{quote}
there is no way to set the archives as a config param when not running on YARN. 
I checked the doc and the spark code. So it seems inconsistent. Can you check 
or confirm ?
{quote}

Yes, I think that's correct up to my knowledge. We can just say it's supported 
on Yarn only for now.

SPARK-13587 was not merged so PySpark does not support yet. Yes, it would not 
be in the doc at least for now.



was (Author: hyukjin.kwon):
The draft looks good as a start. A couple of comments from my cursory look:

- Let's make sure having copy-and-pastable examples, and let's try to write de 
facto standard given that there are multiple other sites such as 
[http://alkaline-ml.com/2018-07-02-conda-spark/], 
[https://jcristharif.com/venv-pack/spark.html.|https://jcristharif.com/venv-pack/spark.html].
- Let's place the section about shipping zip, egg and .py files onto the top, 
and place pex and virtual environment on the bottom. Arguably it is more common 
to simply use {{ --py-files}} or {{spark.submit.pyFiles}} configuration to ship 
Python packages.

Let's open a PR and loop with other committers to have more reviews. Shipping 
packages is a bit hairy area and there are many other committers who have a 
better insight than me in particular about other clusters Mesos, Kubernates, 
etc.

As for referencing your own stuff, It looks fine. It's okay to mention things 
as a FYI reference.

{quote}
there is no way to set the archives as a config param when not running on YARN. 
I checked the doc and the spark code. So it seems inconsistent. Can you check 
or confirm ?
{quote}

Yes, I think that's correct up to my knowledge.

SPARK-13587 was not merged so PySpark does not support yet. Yes, it would not 
be in the doc at least for now.


> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> - Zipped file
> - Python files
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-08-16 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178728#comment-17178728
 ] 

Hyukjin Kwon commented on SPARK-32187:
--

The draft looks good as a start. A couple of comments from my cursory look:

- Let's make sure having copy-and-pastable examples, and let's try to write de 
facto standard given that there are multiple other sites such as 
[http://alkaline-ml.com/2018-07-02-conda-spark/], 
[https://jcristharif.com/venv-pack/spark.html.|https://jcristharif.com/venv-pack/spark.html].
- Let's place the section about shipping zip, egg and .py files onto the top, 
and place pex and virtual environment on the bottom. Arguably it is more common 
to simply use {{ --py-files}} or {{spark.submit.pyFiles}} configuration to ship 
Python packages.

Let's open a PR and loop with other committers to have more reviews. Shipping 
packages is a bit hairy area and there are many other committers who have a 
better insight than me in particular about other clusters Mesos, Kubernates, 
etc.

As for referencing your own stuff, It looks fine. It's okay to mention things 
as a FYI reference.

{quote}
there is no way to set the archives as a config param when not running on YARN. 
I checked the doc and the spark code. So it seems inconsistent. Can you check 
or confirm ?
{quote}

Yes, I think that's correct up to my knowledge.

SPARK-13587 was not merged so PySpark does not support yet. Yes, it would not 
be in the doc at least for now.


> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> - Zipped file
> - Python files
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32626) Do not increase the input metrics when read rdd from cache

2020-08-16 Thread Udbhav Agrawal (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udbhav Agrawal resolved SPARK-32626.

Resolution: Not A Problem

> Do not increase the input metrics when read rdd from cache
> --
>
> Key: SPARK-32626
> URL: https://issues.apache.org/jira/browse/SPARK-32626
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Udbhav Agrawal
>Priority: Minor
>
> Input Metrics will be increased after the rdd is first computed, so it is not 
> correct to increment the input metrics when we read rdd from the cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32018) Fix UnsafeRow set overflowed decimal

2020-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178714#comment-17178714
 ] 

Apache Spark commented on SPARK-32018:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29448

> Fix UnsafeRow set overflowed decimal
> 
>
> Key: SPARK-32018
> URL: https://issues.apache.org/jira/browse/SPARK-32018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Allison Wang
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> There is a bug that writing an overflowed decimal into UnsafeRow is fine but 
> reading it out will throw ArithmeticException. This exception is thrown when 
> calling {{getDecimal}} in UnsafeRow with input decimal's precision greater 
> than the input precision. Setting the value of the overflowed decimal to null 
> when writing into UnsafeRow should fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32018) Fix UnsafeRow set overflowed decimal

2020-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178713#comment-17178713
 ] 

Apache Spark commented on SPARK-32018:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29448

> Fix UnsafeRow set overflowed decimal
> 
>
> Key: SPARK-32018
> URL: https://issues.apache.org/jira/browse/SPARK-32018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Allison Wang
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> There is a bug that writing an overflowed decimal into UnsafeRow is fine but 
> reading it out will throw ArithmeticException. This exception is thrown when 
> calling {{getDecimal}} in UnsafeRow with input decimal's precision greater 
> than the input precision. Setting the value of the overflowed decimal to null 
> when writing into UnsafeRow should fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32601) Issue in converting an RDD of Arrow RecordBatches in v3.0.0

2020-08-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32601.
--
Resolution: Not A Problem

> Issue in converting an RDD of Arrow RecordBatches in v3.0.0
> ---
>
> Key: SPARK-32601
> URL: https://issues.apache.org/jira/browse/SPARK-32601
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Tanveer
>Priority: Major
>
> The following simple code snippet for converting an RDD of Arrow 
> RecordBatches works perfectly in Spark v2.3.4.
>  
> {code:java}
> // code placeholder
> from pyspark.sql import SparkSession
> import pyspark
> import pyarrow as pa
> from pyspark.serializers import ArrowSerializer
> def _arrow_record_batch_dumps(rb):
> # Fix for interoperability between pyarrow version >=0.15 and Spark's 
> arrow version
> # Streaming message protocol has changed, remove setting when upgrading 
> spark.
> import os
> os.environ['ARROW_PRE_0_15_IPC_FORMAT'] = '1'
> 
> return bytearray(ArrowSerializer().dumps(rb))
> def rb_return(ardd):
> data = [
> pa.array(range(5), type='int16'),
> pa.array([-10, -5, 0, None, 10], type='int32')
> ]
> schema = pa.schema([pa.field('c0', pa.int16()),
> pa.field('c1', pa.int32())],
>metadata={b'foo': b'bar'})
> return pa.RecordBatch.from_arrays(data, schema=schema)
> if __name__ == '__main__':
> spark = SparkSession \
> .builder \
> .appName("Python Arrow-in-Spark example") \
> .getOrCreate()
> # Enable Arrow-based columnar data transfers
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> sc = spark.sparkContext
> ardd = spark.sparkContext.parallelize([0,1,2], 3)
> ardd = ardd.map(rb_return)
> from pyspark.sql.types import from_arrow_schema
> from pyspark.sql.dataframe import DataFrame
> from pyspark.serializers import ArrowSerializer, PickleSerializer, 
> AutoBatchedSerializer
> # Filter out and cache arrow record batches 
> ardd = ardd.filter(lambda x: isinstance(x, pa.RecordBatch)).cache()
> ardd = ardd.map(_arrow_record_batch_dumps)
> schema = pa.schema([pa.field('c0', pa.int16()),
> pa.field('c1', pa.int32())],
>metadata={b'foo': b'bar'})
> schema = from_arrow_schema(schema)
> jrdd = ardd._to_java_object_rdd()
> jdf = spark._jvm.PythonSQLUtils.arrowPayloadToDataFrame(jrdd, 
> schema.json(), spark._wrapped._jsqlContext)
> df = DataFrame(jdf, spark._wrapped)
> df._schema = schema
> df.show()
> {code}
>  
> But after updating to Spark to v3.0.0, the same functionality with just 
> changing  arrowPayloadToDataFrame() -> toDataFrame() doesn't work.
>  
> {code:java}
> // code placeholder
> from pyspark.sql import SparkSession
> import pyspark
> import pyarrow as pa
> #from pyspark.serializers import ArrowSerializerdef dumps(batch):
> import pyarrow as pa
> import io
> sink = io.BytesIO()
> writer = pa.RecordBatchFileWriter(sink, batch.schema)
> writer.write_batch(batch)
> writer.close()
> return sink.getvalue()def _arrow_record_batch_dumps(rb):
> # Fix for interoperability between pyarrow version >=0.15 and Spark's 
> arrow version
> # Streaming message protocol has changed, remove setting when upgrading 
> spark.
> #import os
> #os.environ['ARROW_PRE_0_15_IPC_FORMAT'] = '1'#return 
> bytearray(ArrowSerializer().dumps(rb))
> return bytearray(dumps(rb))
> def rb_return(ardd):
> data = [
> pa.array(range(5), type='int16'),
> pa.array([-10, -5, 0, None, 10], type='int32')
> ]
> schema = pa.schema([pa.field('c0', pa.int16()),
> pa.field('c1', pa.int32())],
>metadata={b'foo': b'bar'})
> return pa.RecordBatch.from_arrays(data, schema=schema)if __name__ == 
> '__main__':
> spark = SparkSession \
> .builder \
> .appName("Python Arrow-in-Spark example") \
> .getOrCreate()# Enable Arrow-based columnar data transfers
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> sc = spark.sparkContextardd = spark.sparkContext.parallelize([0,1,2], 
> 3)
> ardd = ardd.map(rb_return)from pyspark.sql.pandas.types import 
> from_arrow_schema
> from pyspark.sql.dataframe import DataFrame# Filter out and cache 
> arrow record batches 
> ardd = ardd.filter(lambda x: isinstance(x, pa.RecordBatch)).cache()
> ardd = ardd.map(_arrow_record_batch_dumps)schema = 
> pa.schema([pa.field('c0', pa.int16()),
> pa.field('c1', pa.int32())],
>metadata={b'foo': b'bar'})
> 

[jira] [Created] (SPARK-32633) GenericRowWithSchema cannot be cast to GenTraversableOnce

2020-08-16 Thread ImportMengjie (Jira)
ImportMengjie created SPARK-32633:
-

 Summary: GenericRowWithSchema cannot be cast to GenTraversableOnce
 Key: SPARK-32633
 URL: https://issues.apache.org/jira/browse/SPARK-32633
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0, 2.4.6
 Environment: my computer: MacOs 10.15.5

server: CentOS Linux release 7.7.1908 (Core)

spark: 2.4.6/3.0.0 pre-build for hadoop 2.7
Reporter: ImportMengjie


When I run this in the server using `spark-submit` or my computer using ide run 
with spark-2.4.6 is right.

But when I use `spark-submit` on my computer with spark-2.4.6 or use 
spark-3.0.0 in server the cast Exception will throw in `rdd.isEmpty()`.

Here is my code:
{code:java}
val rdd = session.sparkContext
  .parallelize(offsets, offsets.size)
  .flatMap(offset => {
val query  = s"${config.exec} SKIP ${offset.start} LIMIT ${offset.size}"
val result = new Neo4jSessionAwareIterator(neo4jConfig, query, 
Maps.newHashMap(), false)
val fields = if (result.hasNext) result.peek().keys().asScala else List()
val schema =
  if (result.hasNext)
StructType(
  fields
.map(k => (k, result.peek().get(k).`type`()))
.map(keyType => CypherTypes.field(keyType)))
  else new StructType()
result.map(record => {
  val row = new Array[Any](record.keys().size())
  for (i <- row.indices)
row.update(i, Executor.convert(record.get(i).asObject()))
  new GenericRowWithSchema(values = row, schema).asInstanceOf[Row]
})
  })

if (rdd.isEmpty())
  throw new RuntimeException(
"Please check your cypher sentence. because use it search nothing!")
val schema = rdd.repartition(1).first().schema
session.createDataFrame(rdd, schema){code}
here is my exception msg (section):
{code:java}
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
to scala.collection.GenTraversableOnce
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:488)
at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1409)

Driver stacktrace:
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
at org.apache.spark.rdd.RDD.take(RDD.scala:1382)
at org.apache.spark.rdd.RDD.$anonfun$isEmpty$1(RDD.scala:1517)
at 
scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1517)
at 
com.xx.xx.tools.importer.reader.Neo4JReader.read(ServerBaseReader.scala:146){code}
 

Thank you for looking!!!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-16 Thread chanduhawk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178560#comment-17178560
 ] 

chanduhawk edited comment on SPARK-32614 at 8/17/20, 3:49 AM:
--

[~srowen]

*currently  spark cannt process the row that starts with null character.*
If one of the rows the data file(CSV) starts with null or \u character like 
below(PFA screenshot)
like below
*null*,abc,test

then spark will throw the error as mentioned in the description. i.e spark 
cannt process any row that starts with null character. It can only process the 
row if we set the options like below

option("comment","a character")

*comment* - it will take a character that needs to be treated as comment 
character so spark wont process the row that starts with this character

The above is a work around to process the row that starts with null. But this 
process is also flawed and subject to skip a valid row of data that may start 
with the comment character.
In data ware house most of the time we dont have comment charaters concept and 
all the rows needs to be processed.

So there should be an option in spark which will disable the processing of 
comment characters like below

option("enableProcessingComments", false)

this option will disable checking for any comment character processing and can 
also process the rows that starts with the null or \u character






was (Author: chanduhawk):
*currently  spark cannt process the row that starts with null character.*
If one of the rows the data file(CSV) starts with null or \u character like 
below(PFA screenshot)
like below
*null*,abc,test

then spark will throw the error as mentioned in the description. i.e spark 
cannt process any row that starts with null character. It can only process the 
row if we set the options like below

option("comment","a character")

*comment* - it will take a character that needs to be treated as comment 
character so spark wont process the row that starts with this character

The above is a work around to process the row that starts with null. But this 
process is also flawed and subject to skip a valid row of data that may start 
with the comment character.
In data ware house most of the time we dont have comment charaters concept and 
all the rows needs to be processed.

So there should be an option in spark which will disable the processing of 
comment characters like below

option("enableProcessingComments", false)

this option will disable checking for any comment character processing and can 
also process the rows that starts with the null or \u character





> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: chanduhawk
>Assignee: Jeff Evans
>Priority: Major
> Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u or null character.Though user 
> can set any comment character other than \u, but there is a chance the 
> actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row 
> starts with null \u
> character it will throw the below error.
> *eg: *val df = 
> spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>   df.show(false);
> *+TestData+*
>  
>  !screenshot-1.png! 
> Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
>   at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
>   at 
> com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)
> *Note:*
> Though its the limitation of the univocity parser and the workaround is to 
> provide any other comment character by mentioning .option("comment","#"), but 
> if my actual data starts with this character then the 

[jira] [Commented] (SPARK-32624) Replace getClass.getName with getClass.getCanonicalName in CodegenContext.addReferenceObj

2020-08-16 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178699#comment-17178699
 ] 

Yuming Wang commented on SPARK-32624:
-

Thank you [~srowen] I have updated the description.

> Replace getClass.getName with getClass.getCanonicalName in 
> CodegenContext.addReferenceObj
> -
>
> Key: SPARK-32624
> URL: https://issues.apache.org/jira/browse/SPARK-32624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:java}
> scala> Array[Byte](1, 2).getClass.getName
> res13: String = [B
> scala> Array[Byte](1, 2).getClass.getCanonicalName
> res14: String = byte[]
> {code}
> {{[B}} is not a correct java type. We should use {{byte[]}}. Otherwise will 
> hit compile issue:
> {noformat}
> 20:49:54.885 ERROR 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: 
> ...
> /* 029 */ if (!isNull_2) {
> /* 030 */   value_1 = 
> org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) 
> references[0] /* min */)) >= 0 && 
> org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) 
> references[1] /* max */)) <= 0 ).mightContainBinary(value_2);
> ...
> 20:49:54.886 WARN org.apache.spark.sql.catalyst.expressions.Predicate: Expr 
> codegen error and falling back to interpreter mode
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 30, Column 81: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 30, Column 81: Unexpected token "[" in primary
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32624) Replace getClass.getName with getClass.getCanonicalName in CodegenContext.addReferenceObj

2020-08-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32624:

Description: 
{code:java}
scala> Array[Byte](1, 2).getClass.getName
res13: String = [B

scala> Array[Byte](1, 2).getClass.getCanonicalName
res14: String = byte[]
{code}

{{[B}} is not a correct java type. We should use {{byte[]}}. Otherwise will hit 
compile issue:
{noformat}
20:49:54.885 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: 
...
/* 029 */ if (!isNull_2) {
/* 030 */   value_1 = 
org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) 
references[0] /* min */)) >= 0 && 
org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) 
references[1] /* max */)) <= 0 ).mightContainBinary(value_2);
...

20:49:54.886 WARN org.apache.spark.sql.catalyst.expressions.Predicate: Expr 
codegen error and falling back to interpreter mode
java.util.concurrent.ExecutionException: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 30, 
Column 81: failed to compile: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 30, Column 81: Unexpected token "[" in primary
at 
com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
{noformat}

  was:

{code:java}
scala> Array[Byte](1, 2).getClass.getName
res13: String = [B

scala> Array[Byte](1, 2).getClass.getCanonicalName
res14: String = byte[]
{code}


{noformat}
20:49:54.885 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: 
/* 001 */ public SpecificPredicate generate(Object[] references) {
/* 002 */   return new SpecificPredicate(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificPredicate extends 
org.apache.spark.sql.catalyst.expressions.BasePredicate {
/* 006 */   private final Object[] references;
/* 007 */
/* 008 */
/* 009 */   public SpecificPredicate(Object[] references) {
/* 010 */ this.references = references;
/* 011 */
/* 012 */   }
/* 013 */
/* 014 */   public void initialize(int partitionIndex) {
/* 015 */
/* 016 */   }
/* 017 */
/* 018 */   public boolean eval(InternalRow i) {
/* 019 */ boolean isNull_3 = i.isNullAt(0);
/* 020 */ UTF8String value_3 = isNull_3 ?
/* 021 */ null : (i.getUTF8String(0));
/* 022 */ boolean isNull_2 = isNull_3;
/* 023 */ byte[] value_2 = null;
/* 024 */ if (!isNull_3) {
/* 025 */   value_2 = value_3.getBytes();
/* 026 */ }
/* 027 */ boolean value_1 = true;
/* 028 */ boolean isNull_1 = isNull_2;
/* 029 */ if (!isNull_2) {
/* 030 */   value_1 = 
org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) 
references[0] /* min */)) >= 0 && 
org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) 
references[1] /* max */)) <= 0 && 
((org.apache.spark.util.sketch.BloomFilterImpl) references[2] /* bloomFilter 
*/).mightContainBinary(value_2);
/* 031 */ }
/* 032 */ return !isNull_1 && value_1;
/* 033 */   }
/* 034 */
/* 035 */
/* 036 */ }

20:49:54.886 WARN org.apache.spark.sql.catalyst.expressions.Predicate: Expr 
codegen error and falling back to interpreter mode
java.util.concurrent.ExecutionException: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 30, 
Column 81: failed to compile: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 30, Column 81: Unexpected token "[" in primary
at 
com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
at 
com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1337)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:67)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:26)
{noformat}


> Replace getClass.getName with getClass.getCanonicalName in 
> CodegenContext.addReferenceObj
> 

[jira] [Updated] (SPARK-32624) Replace getClass.getName with getClass.getCanonicalName in CodegenContext.addReferenceObj

2020-08-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32624:

Description: 

{code:java}
scala> Array[Byte](1, 2).getClass.getName
res13: String = [B

scala> Array[Byte](1, 2).getClass.getCanonicalName
res14: String = byte[]
{code}


{noformat}
20:49:54.885 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: 
/* 001 */ public SpecificPredicate generate(Object[] references) {
/* 002 */   return new SpecificPredicate(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificPredicate extends 
org.apache.spark.sql.catalyst.expressions.BasePredicate {
/* 006 */   private final Object[] references;
/* 007 */
/* 008 */
/* 009 */   public SpecificPredicate(Object[] references) {
/* 010 */ this.references = references;
/* 011 */
/* 012 */   }
/* 013 */
/* 014 */   public void initialize(int partitionIndex) {
/* 015 */
/* 016 */   }
/* 017 */
/* 018 */   public boolean eval(InternalRow i) {
/* 019 */ boolean isNull_3 = i.isNullAt(0);
/* 020 */ UTF8String value_3 = isNull_3 ?
/* 021 */ null : (i.getUTF8String(0));
/* 022 */ boolean isNull_2 = isNull_3;
/* 023 */ byte[] value_2 = null;
/* 024 */ if (!isNull_3) {
/* 025 */   value_2 = value_3.getBytes();
/* 026 */ }
/* 027 */ boolean value_1 = true;
/* 028 */ boolean isNull_1 = isNull_2;
/* 029 */ if (!isNull_2) {
/* 030 */   value_1 = 
org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) 
references[0] /* min */)) >= 0 && 
org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) 
references[1] /* max */)) <= 0 && 
((org.apache.spark.util.sketch.BloomFilterImpl) references[2] /* bloomFilter 
*/).mightContainBinary(value_2);
/* 031 */ }
/* 032 */ return !isNull_1 && value_1;
/* 033 */   }
/* 034 */
/* 035 */
/* 036 */ }

20:49:54.886 WARN org.apache.spark.sql.catalyst.expressions.Predicate: Expr 
codegen error and falling back to interpreter mode
java.util.concurrent.ExecutionException: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 30, 
Column 81: failed to compile: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 30, Column 81: Unexpected token "[" in primary
at 
com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
at 
com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1337)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:67)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:26)
{noformat}

  was:
Not all error messages are in {{CodeGenerator}}, such as:

{noformat}
20:49:54.885 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: 
/* 001 */ public SpecificPredicate generate(Object[] references) {
/* 002 */   return new SpecificPredicate(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificPredicate extends 
org.apache.spark.sql.catalyst.expressions.BasePredicate {
/* 006 */   private final Object[] references;
/* 007 */
/* 008 */
/* 009 */   public SpecificPredicate(Object[] references) {
/* 010 */ this.references = references;
/* 011 */
/* 012 */   }
/* 013 */
/* 014 */   public void initialize(int partitionIndex) {
/* 015 */
/* 016 */   }
/* 017 */
/* 018 */   public boolean eval(InternalRow i) {
/* 019 */ boolean isNull_3 = i.isNullAt(0);
/* 020 */ UTF8String value_3 = isNull_3 ?
/* 021 */ null : (i.getUTF8String(0));
/* 022 */ boolean isNull_2 = isNull_3;
/* 023 */ byte[] value_2 = null;
/* 024 */ if (!isNull_3) {
/* 025 */   value_2 = value_3.getBytes();
/* 026 */ }
/* 027 */ boolean value_1 = true;
/* 028 */ boolean isNull_1 = isNull_2;
/* 029 */ if (!isNull_2) {
/* 030 */   value_1 = 
org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) 
references[0] /* min */)) >= 

[jira] [Updated] (SPARK-32624) Replace getClass.getName with getClass.getCanonicalName in CodegenContext.addReferenceObj

2020-08-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32624:

Description: 
Not all error messages are in {{CodeGenerator}}, such as:

{noformat}
20:49:54.885 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: 
/* 001 */ public SpecificPredicate generate(Object[] references) {
/* 002 */   return new SpecificPredicate(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificPredicate extends 
org.apache.spark.sql.catalyst.expressions.BasePredicate {
/* 006 */   private final Object[] references;
/* 007 */
/* 008 */
/* 009 */   public SpecificPredicate(Object[] references) {
/* 010 */ this.references = references;
/* 011 */
/* 012 */   }
/* 013 */
/* 014 */   public void initialize(int partitionIndex) {
/* 015 */
/* 016 */   }
/* 017 */
/* 018 */   public boolean eval(InternalRow i) {
/* 019 */ boolean isNull_3 = i.isNullAt(0);
/* 020 */ UTF8String value_3 = isNull_3 ?
/* 021 */ null : (i.getUTF8String(0));
/* 022 */ boolean isNull_2 = isNull_3;
/* 023 */ byte[] value_2 = null;
/* 024 */ if (!isNull_3) {
/* 025 */   value_2 = value_3.getBytes();
/* 026 */ }
/* 027 */ boolean value_1 = true;
/* 028 */ boolean isNull_1 = isNull_2;
/* 029 */ if (!isNull_2) {
/* 030 */   value_1 = 
org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) 
references[0] /* min */)) >= 0 && 
org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) 
references[1] /* max */)) <= 0 && 
((org.apache.spark.util.sketch.BloomFilterImpl) references[2] /* bloomFilter 
*/).mightContainBinary(value_2);
/* 031 */ }
/* 032 */ return !isNull_1 && value_1;
/* 033 */   }
/* 034 */
/* 035 */
/* 036 */ }

20:49:54.886 WARN org.apache.spark.sql.catalyst.expressions.Predicate: Expr 
codegen error and falling back to interpreter mode
java.util.concurrent.ExecutionException: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 30, 
Column 81: failed to compile: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 30, Column 81: Unexpected token "[" in primary
at 
com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
at 
com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1337)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:67)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:26)
{noformat}

  was:
{noformat}
20:49:54.885 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: 
/* 001 */ public SpecificPredicate generate(Object[] references) {
/* 002 */   return new SpecificPredicate(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificPredicate extends 
org.apache.spark.sql.catalyst.expressions.BasePredicate {
/* 006 */   private final Object[] references;
/* 007 */
/* 008 */
/* 009 */   public SpecificPredicate(Object[] references) {
/* 010 */ this.references = references;
/* 011 */
/* 012 */   }
/* 013 */
/* 014 */   public void initialize(int partitionIndex) {
/* 015 */
/* 016 */   }
/* 017 */
/* 018 */   public boolean eval(InternalRow i) {
/* 019 */ boolean isNull_3 = i.isNullAt(0);
/* 020 */ UTF8String value_3 = isNull_3 ?
/* 021 */ null : (i.getUTF8String(0));
/* 022 */ boolean isNull_2 = isNull_3;
/* 023 */ byte[] value_2 = null;
/* 024 */ if (!isNull_3) {
/* 025 */   value_2 = value_3.getBytes();
/* 026 */ }
/* 027 */ boolean value_1 = true;
/* 028 */ boolean isNull_1 = isNull_2;
/* 029 */ if (!isNull_2) {
/* 030 */   value_1 = 
org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) 
references[0] /* min */)) >= 0 && 
org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) 
references[1] /* max */)) <= 0 && 

[jira] [Updated] (SPARK-32632) Bad partitioning in spark jdbc method with parameter lowerBound and upperBound

2020-08-16 Thread Liu Dinghua (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Dinghua updated SPARK-32632:

Description: 
When I use the jdbc methed
{code:java}
def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties)
{code}
 
  I am confused by the partitions generated by this method,  for  rows of the 
first partition aren't limited by the lowerBound and the ones of the last 
partition are not limited by the upperBound. 
  
 For example, I use the method  as follow:
  
{code:java}
val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
.selectExpr("id","appkey","funnel_name")
data.show(100, false)  
{code}
 

The result partitions info is :

 20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses of 
these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` >= 4

The returned data is:
||id|| appkey||funnel_name||
|0|yanshi|test001|
|1|yanshi|test002|
|2|yanshi|test003|
|3|xingkong|test_funnel|
|4|xingkong|test_funnel2|
|5|xingkong|test_funnel3|
|6|donews|test_funnel4|
|7|donews|test_funnel|
|8|donews|test_funnel2|
|9|dami|test_funnel3|
|13|dami|test_funnel4|
|15|xiaoai|test_funnel6|

 

Normally, the clause of the first partition is " `id` >=2 and `id` < 3 "  
because the lowerBound is 2, and the clause of the last partition is " `id` >= 
4 and `id` < 5 ",  but the facts are not.

 

 
  

  was:
When i use the jdbc methed
{code:java}
def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties)
{code}
 
  I am confused by the partitions generated by this method,  for  rows of the 
first partition aren't limited by the lowerBound and the ones of the last 
partition are not limited by the upperBound. 
  
 For example, I use the method  as follow:
  
{code:java}
val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
.selectExpr("id","appkey","funnel_name")
data.show(100, false)  
{code}
 

The result partitions info is :

 20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses of 
these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` >= 4

The returned data is:
||id|| appkey||funnel_name||
|0|yanshi|test001|
|1|yanshi|test002|
|2|yanshi|test003|
|3|xingkong|test_funnel|
|4|xingkong|test_funnel2|
|5|xingkong|test_funnel3|
|6|donews|test_funnel4|
|7|donews|test_funnel|
|8|donews|test_funnel2|
|9|dami|test_funnel3|
|13|dami|test_funnel4|
|15|xiaoai|test_funnel6|

 

Normally, the clause of the first partition is " `id` >=2 and `id` < 3 "  
because the lowerBound is 2, and the clause of the last partition is " `id` >= 
4 and `id` < 5 ",  but the facts are not.

 

 
  


> Bad partitioning in spark jdbc method with parameter lowerBound and upperBound
> --
>
> Key: SPARK-32632
> URL: https://issues.apache.org/jira/browse/SPARK-32632
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liu Dinghua
>Priority: Major
>
> When I use the jdbc methed
> {code:java}
> def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
> upperBound: Long, numPartitions: Int, connectionProperties: Properties)
> {code}
>  
>   I am confused by the partitions generated by this method,  for  rows of the 
> first partition aren't limited by the lowerBound and the ones of the last 
> partition are not limited by the upperBound. 
>   
>  For example, I use the method  as follow:
>   
> {code:java}
> val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
> .selectExpr("id","appkey","funnel_name")
> data.show(100, false)  
> {code}
>  
> The result partitions info is :
>  20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses 
> of these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` 
> >= 4
> The returned data is:
> ||id|| appkey||funnel_name||
> |0|yanshi|test001|
> |1|yanshi|test002|
> |2|yanshi|test003|
> |3|xingkong|test_funnel|
> |4|xingkong|test_funnel2|
> |5|xingkong|test_funnel3|
> |6|donews|test_funnel4|
> |7|donews|test_funnel|
> |8|donews|test_funnel2|
> |9|dami|test_funnel3|
> |13|dami|test_funnel4|
> |15|xiaoai|test_funnel6|
>  
> Normally, the clause of the first partition is " `id` >=2 and `id` < 3 "  
> because the lowerBound is 2, and the clause of the last partition is " `id` 
> >= 4 and `id` < 5 ",  but the facts are not.
>  
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Resolved] (SPARK-32289) Chinese characters are garbled when opening csv files with Excel

2020-08-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-32289.
-
Resolution: Workaround

 !Workaround.png! 

> Chinese characters are garbled when opening csv files with Excel
> 
>
> Key: SPARK-32289
> URL: https://issues.apache.org/jira/browse/SPARK-32289
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: Workaround.png, garbled.png
>
>
> How to reproduce this issue:
> {code:scala}
> spark.sql("SELECT '我爱中文' AS chinese").write.option("header", 
> "true").csv("/tmp/spark/csv")
> {code}
>  !garbled.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32601) Issue in converting an RDD of Arrow RecordBatches in v3.0.0

2020-08-16 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178691#comment-17178691
 ] 

L. C. Hsieh commented on SPARK-32601:
-

I think that we changed to use Arrow stream format in SPARK-23030. So directly 
changing arrowPayloadToDataFrame() -> toDataFrame() doesn't work.

> Issue in converting an RDD of Arrow RecordBatches in v3.0.0
> ---
>
> Key: SPARK-32601
> URL: https://issues.apache.org/jira/browse/SPARK-32601
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Tanveer
>Priority: Major
>
> The following simple code snippet for converting an RDD of Arrow 
> RecordBatches works perfectly in Spark v2.3.4.
>  
> {code:java}
> // code placeholder
> from pyspark.sql import SparkSession
> import pyspark
> import pyarrow as pa
> from pyspark.serializers import ArrowSerializer
> def _arrow_record_batch_dumps(rb):
> # Fix for interoperability between pyarrow version >=0.15 and Spark's 
> arrow version
> # Streaming message protocol has changed, remove setting when upgrading 
> spark.
> import os
> os.environ['ARROW_PRE_0_15_IPC_FORMAT'] = '1'
> 
> return bytearray(ArrowSerializer().dumps(rb))
> def rb_return(ardd):
> data = [
> pa.array(range(5), type='int16'),
> pa.array([-10, -5, 0, None, 10], type='int32')
> ]
> schema = pa.schema([pa.field('c0', pa.int16()),
> pa.field('c1', pa.int32())],
>metadata={b'foo': b'bar'})
> return pa.RecordBatch.from_arrays(data, schema=schema)
> if __name__ == '__main__':
> spark = SparkSession \
> .builder \
> .appName("Python Arrow-in-Spark example") \
> .getOrCreate()
> # Enable Arrow-based columnar data transfers
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> sc = spark.sparkContext
> ardd = spark.sparkContext.parallelize([0,1,2], 3)
> ardd = ardd.map(rb_return)
> from pyspark.sql.types import from_arrow_schema
> from pyspark.sql.dataframe import DataFrame
> from pyspark.serializers import ArrowSerializer, PickleSerializer, 
> AutoBatchedSerializer
> # Filter out and cache arrow record batches 
> ardd = ardd.filter(lambda x: isinstance(x, pa.RecordBatch)).cache()
> ardd = ardd.map(_arrow_record_batch_dumps)
> schema = pa.schema([pa.field('c0', pa.int16()),
> pa.field('c1', pa.int32())],
>metadata={b'foo': b'bar'})
> schema = from_arrow_schema(schema)
> jrdd = ardd._to_java_object_rdd()
> jdf = spark._jvm.PythonSQLUtils.arrowPayloadToDataFrame(jrdd, 
> schema.json(), spark._wrapped._jsqlContext)
> df = DataFrame(jdf, spark._wrapped)
> df._schema = schema
> df.show()
> {code}
>  
> But after updating to Spark to v3.0.0, the same functionality with just 
> changing  arrowPayloadToDataFrame() -> toDataFrame() doesn't work.
>  
> {code:java}
> // code placeholder
> from pyspark.sql import SparkSession
> import pyspark
> import pyarrow as pa
> #from pyspark.serializers import ArrowSerializerdef dumps(batch):
> import pyarrow as pa
> import io
> sink = io.BytesIO()
> writer = pa.RecordBatchFileWriter(sink, batch.schema)
> writer.write_batch(batch)
> writer.close()
> return sink.getvalue()def _arrow_record_batch_dumps(rb):
> # Fix for interoperability between pyarrow version >=0.15 and Spark's 
> arrow version
> # Streaming message protocol has changed, remove setting when upgrading 
> spark.
> #import os
> #os.environ['ARROW_PRE_0_15_IPC_FORMAT'] = '1'#return 
> bytearray(ArrowSerializer().dumps(rb))
> return bytearray(dumps(rb))
> def rb_return(ardd):
> data = [
> pa.array(range(5), type='int16'),
> pa.array([-10, -5, 0, None, 10], type='int32')
> ]
> schema = pa.schema([pa.field('c0', pa.int16()),
> pa.field('c1', pa.int32())],
>metadata={b'foo': b'bar'})
> return pa.RecordBatch.from_arrays(data, schema=schema)if __name__ == 
> '__main__':
> spark = SparkSession \
> .builder \
> .appName("Python Arrow-in-Spark example") \
> .getOrCreate()# Enable Arrow-based columnar data transfers
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> sc = spark.sparkContextardd = spark.sparkContext.parallelize([0,1,2], 
> 3)
> ardd = ardd.map(rb_return)from pyspark.sql.pandas.types import 
> from_arrow_schema
> from pyspark.sql.dataframe import DataFrame# Filter out and cache 
> arrow record batches 
> ardd = ardd.filter(lambda x: isinstance(x, pa.RecordBatch)).cache()
> ardd = ardd.map(_arrow_record_batch_dumps)schema = 
> 

[jira] [Updated] (SPARK-32632) Bad partitioning in spark jdbc method with parameter lowerBound and upperBound

2020-08-16 Thread Liu Dinghua (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Dinghua updated SPARK-32632:

Description: 
When i use the jdbc methed
{code:java}
def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties)
{code}
 
  I am confused by the partitions generated by this method,  for  rows of the 
first partition aren't limited by the lowerBound and the ones of the last 
partition are not limited by the upperBound. 
  
 For example, I use the method  as follow:
  
{code:java}
val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
.selectExpr("id","appkey","funnel_name")
data.show(100, false)  
{code}
 

The result partitions info is :

 20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses of 
these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` >= 4

The returned data is:
||id|| appkey||funnel_name||
|0|yanshi|test001|
|1|yanshi|test002|
|2|yanshi|test003|
|3|xingkong|test_funnel|
|4|xingkong|test_funnel2|
|5|xingkong|test_funnel3|
|6|donews|test_funnel4|
|7|donews|test_funnel|
|8|donews|test_funnel2|
|9|dami|test_funnel3|
|13|dami|test_funnel4|
|15|xiaoai|test_funnel6|

 

Normally, the clause of the first partition is " `id` >=2 and `id` < 3 "  
because the lowerBound is 2, and the clause of the last partition is " `id` >= 
4 and `id` < 5 ",  but the facts are not.

 

 
  

  was:
When i use the jdbc methed
{code:java}
def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties)
{code}
 
  I am confused by the partitions generated by this method,  for  rows of the 
first partition is not limited by the lowerBound and the ones of the last 
partition isn't limited by the upperBound. 
  
 For example, I use the method  as follow:
  
{code:java}
val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
.selectExpr("id","appkey","funnel_name")
data.show(100, false)  
{code}
 

The result partitions info is :

 20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses of 
these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` >= 4

The returned data is:
||id|| appkey||funnel_name||
|0|yanshi|test001|
|1|yanshi|test002|
|2|yanshi|test003|
|3|xingkong|test_funnel|
|4|xingkong|test_funnel2|
|5|xingkong|test_funnel3|
|6|donews|test_funnel4|
|7|donews|test_funnel|
|8|donews|test_funnel2|
|9|dami|test_funnel3|
|13|dami|test_funnel4|
|15|xiaoai|test_funnel6|

 

Normally, the clause of the first partition is " `id` >=2 and `id` < 3 "  
because the lowerBound is 2, and the clause of the last partition is " `id` >= 
4 and `id` < 5 ",  but the facts are not.

 

 
  


> Bad partitioning in spark jdbc method with parameter lowerBound and upperBound
> --
>
> Key: SPARK-32632
> URL: https://issues.apache.org/jira/browse/SPARK-32632
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liu Dinghua
>Priority: Major
>
> When i use the jdbc methed
> {code:java}
> def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
> upperBound: Long, numPartitions: Int, connectionProperties: Properties)
> {code}
>  
>   I am confused by the partitions generated by this method,  for  rows of the 
> first partition aren't limited by the lowerBound and the ones of the last 
> partition are not limited by the upperBound. 
>   
>  For example, I use the method  as follow:
>   
> {code:java}
> val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
> .selectExpr("id","appkey","funnel_name")
> data.show(100, false)  
> {code}
>  
> The result partitions info is :
>  20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses 
> of these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` 
> >= 4
> The returned data is:
> ||id|| appkey||funnel_name||
> |0|yanshi|test001|
> |1|yanshi|test002|
> |2|yanshi|test003|
> |3|xingkong|test_funnel|
> |4|xingkong|test_funnel2|
> |5|xingkong|test_funnel3|
> |6|donews|test_funnel4|
> |7|donews|test_funnel|
> |8|donews|test_funnel2|
> |9|dami|test_funnel3|
> |13|dami|test_funnel4|
> |15|xiaoai|test_funnel6|
>  
> Normally, the clause of the first partition is " `id` >=2 and `id` < 3 "  
> because the lowerBound is 2, and the clause of the last partition is " `id` 
> >= 4 and `id` < 5 ",  but the facts are not.
>  
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Updated] (SPARK-32632) Bad partitioning in spark jdbc method with parameter lowerBound and upperBound

2020-08-16 Thread Liu Dinghua (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Dinghua updated SPARK-32632:

Description: 
When i use the jdbc methed
{code:java}
def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties)
{code}
 
  I am confused by the partitions generated by this method,  for  rows of the 
first partition is not limited by the lowerBound and the ones of the last 
partition isn't limited by the upperBound. 
  
 For example, I use the method  as follow:
  
{code:java}
val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
.selectExpr("id","appkey","funnel_name")
data.show(100, false)  
{code}
 

The result partitions info is :

 20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses of 
these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` >= 4

The returned data is:
||id|| appkey||funnel_name||
|0|yanshi|test001|
|1|yanshi|test002|
|2|yanshi|test003|
|3|xingkong|test_funnel|
|4|xingkong|test_funnel2|
|5|xingkong|test_funnel3|
|6|donews|test_funnel4|
|7|donews|test_funnel|
|8|donews|test_funnel2|
|9|dami|test_funnel3|
|13|dami|test_funnel4|
|15|xiaoai|test_funnel6|

 

Normally, the clause of the first partition is " `id` >=2 and `id` < 3 "  
because the lowerBound is 2, and the clause of the last partition is " `id` >= 
4 and `id` < 5 ",  but the facts are not.

 

 
  

  was:
When i use the jdbc methed
{code:java}
def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties)
{code}
 
  I am confused by the partitions generated by this method,  for the rows of 
first partition is not limited by the lowerBound and the ones of the last 
partition isn't limited by the upperBound. 
  
 For example, I use the method  as follow:
  
{code:java}
val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
.selectExpr("id","appkey","funnel_name")
data.show(100, false)  
{code}
 

The result partitions info is :

 20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses of 
these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` >= 4

The returned data is:
||id|| appkey||funnel_name||
|0|yanshi|test001|
|1|yanshi|test002|
|2|yanshi|test003|
|3|xingkong|test_funnel|
|4|xingkong|test_funnel2|
|5|xingkong|test_funnel3|
|6|donews|test_funnel4|
|7|donews|test_funnel|
|8|donews|test_funnel2|
|9|dami|test_funnel3|
|13|dami|test_funnel4|
|15|xiaoai|test_funnel6|

 

Normally, the clause of the first partition is " `id` >=2 and `id` < 3 "  
because the lowerBound is 2, and the clause of the last partition is " `id` >= 
4 and `id` < 5 ",  but the facts are not.

 

 
  


> Bad partitioning in spark jdbc method with parameter lowerBound and upperBound
> --
>
> Key: SPARK-32632
> URL: https://issues.apache.org/jira/browse/SPARK-32632
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liu Dinghua
>Priority: Major
>
> When i use the jdbc methed
> {code:java}
> def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
> upperBound: Long, numPartitions: Int, connectionProperties: Properties)
> {code}
>  
>   I am confused by the partitions generated by this method,  for  rows of the 
> first partition is not limited by the lowerBound and the ones of the last 
> partition isn't limited by the upperBound. 
>   
>  For example, I use the method  as follow:
>   
> {code:java}
> val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
> .selectExpr("id","appkey","funnel_name")
> data.show(100, false)  
> {code}
>  
> The result partitions info is :
>  20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses 
> of these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` 
> >= 4
> The returned data is:
> ||id|| appkey||funnel_name||
> |0|yanshi|test001|
> |1|yanshi|test002|
> |2|yanshi|test003|
> |3|xingkong|test_funnel|
> |4|xingkong|test_funnel2|
> |5|xingkong|test_funnel3|
> |6|donews|test_funnel4|
> |7|donews|test_funnel|
> |8|donews|test_funnel2|
> |9|dami|test_funnel3|
> |13|dami|test_funnel4|
> |15|xiaoai|test_funnel6|
>  
> Normally, the clause of the first partition is " `id` >=2 and `id` < 3 "  
> because the lowerBound is 2, and the clause of the last partition is " `id` 
> >= 4 and `id` < 5 ",  but the facts are not.
>  
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Updated] (SPARK-32632) Bad partitioning in spark jdbc method with parameter lowerBound and upperBound

2020-08-16 Thread Liu Dinghua (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Dinghua updated SPARK-32632:

Description: 
When i use the jdbc methed
{code:java}
def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties)
{code}
 
  I am confused by the partitions generated by this method,  for the rows of 
first partition is not limited by the lowerBound and the ones of the last 
partition isn't limited by the upperBound. 
  
 For example, I use the method  as follow:
  
{code:java}
val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
.selectExpr("id","appkey","funnel_name")
data.show(100, false)  
{code}
 

The result partitions info is :

 20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses of 
these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` >= 4

The returned data is:
||id|| appkey||funnel_name||
|0|yanshi|test001|
|1|yanshi|test002|
|2|yanshi|test003|
|3|xingkong|test_funnel|
|4|xingkong|test_funnel2|
|5|xingkong|test_funnel3|
|6|donews|test_funnel4|
|7|donews|test_funnel|
|8|donews|test_funnel2|
|9|dami|test_funnel3|
|13|dami|test_funnel4|
|15|xiaoai|test_funnel6|

 

Normally, the clause of the first partition is " `id` >=2 and `id` < 3 "  
because the lowerBound is 2, and the clause of the last partition is " `id` >= 
4 and `id` < 5 ",  but the facts are not.

 

 
  

  was:
When i use the jdbc methed
{code:java}
def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties)
{code}
 
  I am confused by the partitions generated by this method   for the rows of 
first partition is not limited by the lowerBound and the ones of the last 
partition isn't limited by the upperBound. 
  
 For example, I use the method  as follow:
  
{code:java}
val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
.selectExpr("id","appkey","funnel_name")
data.show(100, false)  
{code}
 

The result partitions info is :

 20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses of 
these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` >= 4

The returned data is:
||id|| appkey||funnel_name||
|0|yanshi|test001|
|1|yanshi|test002|
|2|yanshi|test003|
|3|xingkong|test_funnel|
|4|xingkong|test_funnel2|
|5|xingkong|test_funnel3|
|6|donews|test_funnel4|
|7|donews|test_funnel|
|8|donews|test_funnel2|
|9|dami|test_funnel3|
|13|dami|test_funnel4|
|15|xiaoai|test_funnel6|

 

Normally, the clause of the first partition is " `id` >=2 and `id` < 3 "  
because the lowerBound is 2, and the clause of the last partition is " `id` >= 
4 and `id` < 5 ",  but the facts are not.

 

 
  


> Bad partitioning in spark jdbc method with parameter lowerBound and upperBound
> --
>
> Key: SPARK-32632
> URL: https://issues.apache.org/jira/browse/SPARK-32632
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liu Dinghua
>Priority: Major
>
> When i use the jdbc methed
> {code:java}
> def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
> upperBound: Long, numPartitions: Int, connectionProperties: Properties)
> {code}
>  
>   I am confused by the partitions generated by this method,  for the rows of 
> first partition is not limited by the lowerBound and the ones of the last 
> partition isn't limited by the upperBound. 
>   
>  For example, I use the method  as follow:
>   
> {code:java}
> val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
> .selectExpr("id","appkey","funnel_name")
> data.show(100, false)  
> {code}
>  
> The result partitions info is :
>  20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses 
> of these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` 
> >= 4
> The returned data is:
> ||id|| appkey||funnel_name||
> |0|yanshi|test001|
> |1|yanshi|test002|
> |2|yanshi|test003|
> |3|xingkong|test_funnel|
> |4|xingkong|test_funnel2|
> |5|xingkong|test_funnel3|
> |6|donews|test_funnel4|
> |7|donews|test_funnel|
> |8|donews|test_funnel2|
> |9|dami|test_funnel3|
> |13|dami|test_funnel4|
> |15|xiaoai|test_funnel6|
>  
> Normally, the clause of the first partition is " `id` >=2 and `id` < 3 "  
> because the lowerBound is 2, and the clause of the last partition is " `id` 
> >= 4 and `id` < 5 ",  but the facts are not.
>  
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Updated] (SPARK-32632) Bad partitioning in spark jdbc method with parameter lowerBound and upperBound

2020-08-16 Thread Liu Dinghua (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Dinghua updated SPARK-32632:

Description: 
When i use the jdbc methed
{code:java}
def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties)
{code}
 
  I am confused by the partitions generated by this method   for the rows of 
first partition is not limited by the lowerBound and the ones of the last 
partition isn't limited by the upperBound. 
  
 For example, I use the method  as follow:
  
{code:java}
val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
.selectExpr("id","appkey","funnel_name")
data.show(100, false)  
{code}
 

The result partitions info is :

 20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses of 
these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` >= 4

The returned data is:
||id|| appkey||funnel_name||
|0|yanshi|test001|
|1|yanshi|test002|
|2|yanshi|test003|
|3|xingkong|test_funnel|
|4|xingkong|test_funnel2|
|5|xingkong|test_funnel3|
|6|donews|test_funnel4|
|7|donews|test_funnel|
|8|donews|test_funnel2|
|9|dami|test_funnel3|
|13|dami|test_funnel4|
|15|xiaoai|test_funnel6|

 

Normally, the clause of the first partition is " `id` >=2 and `id` < 3 "  
because the lowerBound is 2, and the clause of the last partition is " `id` >= 
4 and `id` < 5 ",  but the facts are not.

 

 
  

  was:
When i use the jdbc methed
{code:java}
def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties)
{code}
 
  I am confused by the partitions generated by this method   for the rows of 
first partition is not limited by the lowerBound and the ones of the last 
partition isn't limited by the upperBound. 
  
 For example, I use the method  as follow:
  
{code:java}
val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
.selectExpr("id","appkey","funnel_name")
data.show(100, false)  
{code}
 

The result partitions info is :

 20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses of 
these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` >= 4

The returned data is:
||id|| appkey||funnel_name||
|0|yanshi|test001|
|1|yanshi|test002|
|2|yanshi|test003|
|3|xingkong|test_funnel|
|4|xingkong|test_funnel2|
|5|xingkong|test_funnel3|
|6|donews|test_funnel4|
|7|donews|test_funnel|
|8|donews|test_funnel2|
|9|dami|test_funnel3|
|13|dami|test_funnel4|
|15|xiaoai|test_funnel6|

 

Normally, the clause of the first partition is " 'id' >=2 and `id` < 3 "  
because the lowerBound is 2, and the clause of the last partition is " `id` >= 
4", but the facts are not.

 

 
  


> Bad partitioning in spark jdbc method with parameter lowerBound and upperBound
> --
>
> Key: SPARK-32632
> URL: https://issues.apache.org/jira/browse/SPARK-32632
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liu Dinghua
>Priority: Major
>
> When i use the jdbc methed
> {code:java}
> def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
> upperBound: Long, numPartitions: Int, connectionProperties: Properties)
> {code}
>  
>   I am confused by the partitions generated by this method   for the rows of 
> first partition is not limited by the lowerBound and the ones of the last 
> partition isn't limited by the upperBound. 
>   
>  For example, I use the method  as follow:
>   
> {code:java}
> val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
> .selectExpr("id","appkey","funnel_name")
> data.show(100, false)  
> {code}
>  
> The result partitions info is :
>  20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses 
> of these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` 
> >= 4
> The returned data is:
> ||id|| appkey||funnel_name||
> |0|yanshi|test001|
> |1|yanshi|test002|
> |2|yanshi|test003|
> |3|xingkong|test_funnel|
> |4|xingkong|test_funnel2|
> |5|xingkong|test_funnel3|
> |6|donews|test_funnel4|
> |7|donews|test_funnel|
> |8|donews|test_funnel2|
> |9|dami|test_funnel3|
> |13|dami|test_funnel4|
> |15|xiaoai|test_funnel6|
>  
> Normally, the clause of the first partition is " `id` >=2 and `id` < 3 "  
> because the lowerBound is 2, and the clause of the last partition is " `id` 
> >= 4 and `id` < 5 ",  but the facts are not.
>  
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32632) Bad partitioning in spark jdbc method with parameter lowerBound and upperBound

2020-08-16 Thread Liu Dinghua (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Dinghua updated SPARK-32632:

Description: 
When i use the jdbc methed
{code:java}
def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties)
{code}
 
  I am confused by the partitions generated by this method   for the rows of 
first partition is not limited by the lowerBound and the ones of the last 
partition isn't limited by the upperBound. 
  
 For example, I use the method  as follow:
  
{code:java}
val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
.selectExpr("id","appkey","funnel_name")
data.show(100, false)  
{code}
 

The result partitions info is :

 20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses of 
these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` >= 4

The returned data is:
||id|| appkey||funnel_name||
|0|yanshi|test001|
|1|yanshi|test002|
|2|yanshi|test003|
|3|xingkong|test_funnel|
|4|xingkong|test_funnel2|
|5|xingkong|test_funnel3|
|6|donews|test_funnel4|
|7|donews|test_funnel|
|8|donews|test_funnel2|
|9|dami|test_funnel3|
|13|dami|test_funnel4|
|15|xiaoai|test_funnel6|

 

Normally, the clause of the first partition is " 'id' >=2 and `id` < 3 "  
because the lowerBound is 2, and the clause of the last partition is " `id` >= 
4", but the facts are not.

 

 
  

  was:
When i use the jdbc methed
{code:java}
def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties)
{code}
 
 I am confused by the partitions generated by this method   for the rows of 
first partition is not limited by the lowerBound and the ones of the last 
partition isn't limited by the upperBound. 
 
For example, I use the method  as follow:
 
{code:java}
val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
.selectExpr("id","appkey","funnel_name")
data.show(100, false)  
{code}
 

The result partitions info is :

 20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses of 
these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` >= 4

The returned data is:
||id|| appkey||funnel_name||
|0|yanshi|test001|
|1|yanshi|test002|
|2|yanshi|test003|
|3|xingkong|test_funnel|
|4|xingkong|test_funnel2|
|5|xingkong|test_funnel3|
|6|donews|test_funnel4|
|7|donews|test_funnel|
|8|donews|test_funnel2|
|9|dami|test_funnel3|
|13|dami|test_funnel4|
|15|xiaoai|test_funnel6|

 

Normally, the clause of the first partition is " 'id' >=2 and `id` < 3 " for 
the lowerBound is 2, and the clause of the last partition is " `id` >= 4", but 
the facts are not.

 

 
 


> Bad partitioning in spark jdbc method with parameter lowerBound and upperBound
> --
>
> Key: SPARK-32632
> URL: https://issues.apache.org/jira/browse/SPARK-32632
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liu Dinghua
>Priority: Major
>
> When i use the jdbc methed
> {code:java}
> def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
> upperBound: Long, numPartitions: Int, connectionProperties: Properties)
> {code}
>  
>   I am confused by the partitions generated by this method   for the rows of 
> first partition is not limited by the lowerBound and the ones of the last 
> partition isn't limited by the upperBound. 
>   
>  For example, I use the method  as follow:
>   
> {code:java}
> val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
> .selectExpr("id","appkey","funnel_name")
> data.show(100, false)  
> {code}
>  
> The result partitions info is :
>  20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses 
> of these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` 
> >= 4
> The returned data is:
> ||id|| appkey||funnel_name||
> |0|yanshi|test001|
> |1|yanshi|test002|
> |2|yanshi|test003|
> |3|xingkong|test_funnel|
> |4|xingkong|test_funnel2|
> |5|xingkong|test_funnel3|
> |6|donews|test_funnel4|
> |7|donews|test_funnel|
> |8|donews|test_funnel2|
> |9|dami|test_funnel3|
> |13|dami|test_funnel4|
> |15|xiaoai|test_funnel6|
>  
> Normally, the clause of the first partition is " 'id' >=2 and `id` < 3 "  
> because the lowerBound is 2, and the clause of the last partition is " `id` 
> >= 4", but the facts are not.
>  
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32632) Bad partitioning in spark jdbc method with parameter lowerBound and upperBound

2020-08-16 Thread Liu Dinghua (Jira)
Liu Dinghua created SPARK-32632:
---

 Summary: Bad partitioning in spark jdbc method with parameter 
lowerBound and upperBound
 Key: SPARK-32632
 URL: https://issues.apache.org/jira/browse/SPARK-32632
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Liu Dinghua


When i use the jdbc methed
{code:java}
def jdbc( url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties)
{code}
 
 I am confused by the partitions generated by this method   for the rows of 
first partition is not limited by the lowerBound and the ones of the last 
partition isn't limited by the upperBound. 
 
For example, I use the method  as follow:
 
{code:java}
val data = spark.read.jdbc(url, table, "id", 2, 5, 3,buildProperties()) 
.selectExpr("id","appkey","funnel_name")
data.show(100, false)  
{code}
 

The result partitions info is :

 20/08/05 16:58:59 INFO JDBCRelation: Number of partitions: 3, WHERE clauses of 
these partitions: `id` < 3 or `id` is null, `id` >= 3 AND `id` < 4, `id` >= 4

The returned data is:
||id|| appkey||funnel_name||
|0|yanshi|test001|
|1|yanshi|test002|
|2|yanshi|test003|
|3|xingkong|test_funnel|
|4|xingkong|test_funnel2|
|5|xingkong|test_funnel3|
|6|donews|test_funnel4|
|7|donews|test_funnel|
|8|donews|test_funnel2|
|9|dami|test_funnel3|
|13|dami|test_funnel4|
|15|xiaoai|test_funnel6|

 

Normally, the clause of the first partition is " 'id' >=2 and `id` < 3 " for 
the lowerBound is 2, and the clause of the last partition is " `id` >= 4", but 
the facts are not.

 

 
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32631) Handle Null error message in hive ThriftServer UI

2020-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32631:


Assignee: Apache Spark

> Handle Null error message in hive ThriftServer UI
> -
>
> Key: SPARK-32631
> URL: https://issues.apache.org/jira/browse/SPARK-32631
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tianhan Hu
>Assignee: Apache Spark
>Priority: Major
>
> This fix is trying to prevent NullPointerException caused by Null error 
> message by wrapping the message with Option and using getOrElse to handle 
> Null value.
> The possible reason why the error message would be null is that java 
> Throwable allows the detailMessage to be null. However, in the render code we 
> assume that the error message would be be null and try to do indexOf() on the 
> string. Therefore, if some exception doesn't set the error message, it would 
> trigger NullPointerException in the hive ThriftServer UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32631) Handle Null error message in hive ThriftServer UI

2020-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178680#comment-17178680
 ] 

Apache Spark commented on SPARK-32631:
--

User 'tianhanhu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29447

> Handle Null error message in hive ThriftServer UI
> -
>
> Key: SPARK-32631
> URL: https://issues.apache.org/jira/browse/SPARK-32631
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tianhan Hu
>Priority: Major
>
> This fix is trying to prevent NullPointerException caused by Null error 
> message by wrapping the message with Option and using getOrElse to handle 
> Null value.
> The possible reason why the error message would be null is that java 
> Throwable allows the detailMessage to be null. However, in the render code we 
> assume that the error message would be be null and try to do indexOf() on the 
> string. Therefore, if some exception doesn't set the error message, it would 
> trigger NullPointerException in the hive ThriftServer UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32631) Handle Null error message in hive ThriftServer UI

2020-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32631:


Assignee: (was: Apache Spark)

> Handle Null error message in hive ThriftServer UI
> -
>
> Key: SPARK-32631
> URL: https://issues.apache.org/jira/browse/SPARK-32631
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tianhan Hu
>Priority: Major
>
> This fix is trying to prevent NullPointerException caused by Null error 
> message by wrapping the message with Option and using getOrElse to handle 
> Null value.
> The possible reason why the error message would be null is that java 
> Throwable allows the detailMessage to be null. However, in the render code we 
> assume that the error message would be be null and try to do indexOf() on the 
> string. Therefore, if some exception doesn't set the error message, it would 
> trigger NullPointerException in the hive ThriftServer UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32627) Add showSessionLink parameter to SqlStatsPagedTable class in ThriftServerPage

2020-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32627:


Assignee: Apache Spark

> Add showSessionLink parameter to SqlStatsPagedTable class in ThriftServerPage
> -
>
> Key: SPARK-32627
> URL: https://issues.apache.org/jira/browse/SPARK-32627
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tianhan Hu
>Assignee: Apache Spark
>Priority: Major
>
> Introduced showSessionLink argument to SqlStatsPagedTable class in 
> ThriftServerPage. When this argument is set to true, "Session ID" tooltip 
> will be shown to the user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32627) Add showSessionLink parameter to SqlStatsPagedTable class in ThriftServerPage

2020-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32627:


Assignee: Apache Spark

> Add showSessionLink parameter to SqlStatsPagedTable class in ThriftServerPage
> -
>
> Key: SPARK-32627
> URL: https://issues.apache.org/jira/browse/SPARK-32627
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tianhan Hu
>Assignee: Apache Spark
>Priority: Major
>
> Introduced showSessionLink argument to SqlStatsPagedTable class in 
> ThriftServerPage. When this argument is set to true, "Session ID" tooltip 
> will be shown to the user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32627) Add showSessionLink parameter to SqlStatsPagedTable class in ThriftServerPage

2020-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178679#comment-17178679
 ] 

Apache Spark commented on SPARK-32627:
--

User 'tianhanhu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29442

> Add showSessionLink parameter to SqlStatsPagedTable class in ThriftServerPage
> -
>
> Key: SPARK-32627
> URL: https://issues.apache.org/jira/browse/SPARK-32627
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tianhan Hu
>Priority: Major
>
> Introduced showSessionLink argument to SqlStatsPagedTable class in 
> ThriftServerPage. When this argument is set to true, "Session ID" tooltip 
> will be shown to the user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32627) Add showSessionLink parameter to SqlStatsPagedTable class in ThriftServerPage

2020-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32627:


Assignee: (was: Apache Spark)

> Add showSessionLink parameter to SqlStatsPagedTable class in ThriftServerPage
> -
>
> Key: SPARK-32627
> URL: https://issues.apache.org/jira/browse/SPARK-32627
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tianhan Hu
>Priority: Major
>
> Introduced showSessionLink argument to SqlStatsPagedTable class in 
> ThriftServerPage. When this argument is set to true, "Session ID" tooltip 
> will be shown to the user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32631) Handle Null error message in hive ThriftServer UI

2020-08-16 Thread Tianhan Hu (Jira)
Tianhan Hu created SPARK-32631:
--

 Summary: Handle Null error message in hive ThriftServer UI
 Key: SPARK-32631
 URL: https://issues.apache.org/jira/browse/SPARK-32631
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Tianhan Hu


This fix is trying to prevent NullPointerException caused by Null error message 
by wrapping the message with Option and using getOrElse to handle Null value.

The possible reason why the error message would be null is that java Throwable 
allows the detailMessage to be null. However, in the render code we assume that 
the error message would be be null and try to do indexOf() on the string. 
Therefore, if some exception doesn't set the error message, it would trigger 
NullPointerException in the hive ThriftServer UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32342) Kafka events are missing magic byte

2020-08-16 Thread Sridhar Baddela (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178668#comment-17178668
 ] 

Sridhar Baddela commented on SPARK-32342:
-

This magic byte is specific to Confluent schema registry. It is the schema id 
of the avro schema using which the payload is encoded. 

> Kafka events are missing magic byte
> ---
>
> Key: SPARK-32342
> URL: https://issues.apache.org/jira/browse/SPARK-32342
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: Pyspark 3.0.0, Python 3.7 Confluent cloud Kafka with 
> Schema registry 5.5 
>Reporter: Sridhar Baddela
>Priority: Major
>
> Please refer to the documentation link for to_avro and from_avro.[ 
> http://spark.apache.org/docs/latest/sql-data-sources-avro.html|http://spark.apache.org/docs/latest/sql-data-sources-avro.html]
> Tested the to_avro function by making sure that data is sent to Kafka topic. 
> But when a Confluent Avro consumer is used to read data from the same topic, 
> the consumer fails with an error message that event is missing the magic 
> byte. 
> Used another topic to simulate reads from Kafka and further deserialization 
> using from_avro. Use case is, use a Confluent Avro producer to produce a few 
> events. And when I attempt to read this topic using structured streaming and 
> applying the function from_avro, it fails with a message indicating that 
> malformed records are present. 
> Using from_avro (deserialization) and to_avro (serialization) from Spark, 
> only works with Spark. And other consumers outside of Spark which do not use 
> this approach are failing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32399) Support full outer join in shuffled hash join

2020-08-16 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32399.
--
Fix Version/s: 3.1.0
 Assignee: Cheng Su
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/29342

> Support full outer join in shuffled hash join
> -
>
> Key: SPARK-32399
> URL: https://issues.apache.org/jira/browse/SPARK-32399
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently for SQL full outer join, spark always does a sort merge join no 
> matter of how large the join children size are. Inspired by recent discussion 
> in [https://github.com/apache/spark/pull/29130#discussion_r456502678] and 
> [https://github.com/apache/spark/pull/29181], I think we can support full 
> outer join in shuffled hash join in a way that - when looking up stream side 
> keys from build side {{HashedRelation}}. Mark this info inside build side 
> {{HashedRelation}}, and after reading all rows from stream side, output all 
> non-matching rows from build side based on modified {{HashedRelation}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32630) Reduce user confusion and subtle bugs by optionally preventing date & timestamp comparison

2020-08-16 Thread Simeon Simeonov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178624#comment-17178624
 ] 

Simeon Simeonov commented on SPARK-32630:
-

[~rxin] fyi, one of the subtle issues that add friction to new users being 
safely productive on the platform.

> Reduce user confusion and subtle bugs by optionally preventing date & 
> timestamp comparison
> --
>
> Key: SPARK-32630
> URL: https://issues.apache.org/jira/browse/SPARK-32630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: comparison, sql, timestamps
>
> https://issues.apache.org/jira/browse/SPARK-23549 made Spark's handling of 
> date vs. timestamp comparison consistent with SQL, which, unfortunately, 
> isn't consistent with common sense.
> When dates are compared with timestamps, they are promoted to timestamps at 
> midnight of the date, in the server timezone, which is almost always UTC. 
> This only works well if all timestamps in the data are logically time 
> instants as opposed to dates + times, which only become instants with a known 
> timezone.
> The fundamental issue is that dates are a human time concept and instant are 
> a machine time concept. While we can technically promote one to the other, 
> logically, it only works 100% if midnight for all dates in the system is in 
> the server timezone. 
> Every major modern platform offers a clear distinction between machine time 
> (instants) and human time (an instant with a timezone, UTC offset, etc.), 
> because we have learned the hard way that date & time handling is a 
> never-ending source of confusion and bugs. SQL, being an ancient language 
> (40+ years old), is well behind software engineering best practices; using it 
> as a guiding light is necessary for Spark to win market share, but 
> unfortunate in every other way.
> For example, Java has:
>  * java.time.LocalDate
>  * java.time.Instant
>  * java.time.ZonedDateTime
>  * java.time.OffsetDateTime
> I am not suggesting we add new data types to Spark. I am suggesting we go to 
> the heart of the matter, which is that most date vs. time handling issues are 
> the result of confusion or carelessness.
> What about introducing a new setting that makes comparisons between dates and 
> timestamps illegal, preferably with a helpful exception message?
> If it existed, I would certainly make it the default for all our clusters. 
> The minor coding convenience that comes from being able to compare dates & 
> timestamps with an automatic type promotion pales in comparison with the risk 
> of subtle bugs that remain undetected for a long time.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32630) Reduce user confusion and subtle bugs by optionally preventing date & timestamp comparison

2020-08-16 Thread Simeon Simeonov (Jira)
Simeon Simeonov created SPARK-32630:
---

 Summary: Reduce user confusion and subtle bugs by optionally 
preventing date & timestamp comparison
 Key: SPARK-32630
 URL: https://issues.apache.org/jira/browse/SPARK-32630
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Simeon Simeonov


https://issues.apache.org/jira/browse/SPARK-23549 made Spark's handling of date 
vs. timestamp comparison consistent with SQL, which, unfortunately, isn't 
consistent with common sense.

When dates are compared with timestamps, they are promoted to timestamps at 
midnight of the date, in the server timezone, which is almost always UTC. This 
only works well if all timestamps in the data are logically time instants as 
opposed to dates + times, which only become instants with a known timezone.

The fundamental issue is that dates are a human time concept and instant are a 
machine time concept. While we can technically promote one to the other, 
logically, it only works 100% if midnight for all dates in the system is in the 
server timezone. 

Every major modern platform offers a clear distinction between machine time 
(instants) and human time (an instant with a timezone, UTC offset, etc.), 
because we have learned the hard way that date & time handling is a 
never-ending source of confusion and bugs. SQL, being an ancient language (40+ 
years old), is well behind software engineering best practices; using it as a 
guiding light is necessary for Spark to win market share, but unfortunate in 
every other way.

For example, Java has:
 * java.time.LocalDate
 * java.time.Instant
 * java.time.ZonedDateTime
 * java.time.OffsetDateTime

I am not suggesting we add new data types to Spark. I am suggesting we go to 
the heart of the matter, which is that most date vs. time handling issues are 
the result of confusion or carelessness.

What about introducing a new setting that makes comparisons between dates and 
timestamps illegal, preferably with a helpful exception message?

If it existed, I would certainly make it the default for all our clusters. The 
minor coding convenience that comes from being able to compare dates & 
timestamps with an automatic type promotion pales in comparison with the risk 
of subtle bugs that remain undetected for a long time.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32385) Publish a "bill of materials" (BOM) descriptor for Spark with correct versions of various dependencies

2020-08-16 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178588#comment-17178588
 ] 

Sean R. Owen commented on SPARK-32385:
--

This requires us fixing every version of every transitive dependency. How does 
that get updated as the transitive dependency graph changes? this exchanges one 
problem for another I think. That is, we are definitely not trying to fix 
dependency versions except where necessary.

Gradle isn't something that this project supports, but, wouldn't this be a much 
bigger general problem if its resolution rules are different from Maven? that 
is, surely gradle can emulate Maven if necessary. (We have the same issue with 
SBT, which is why it is not used for builds or publishing artifacts)

> Publish a "bill of materials" (BOM) descriptor for Spark with correct 
> versions of various dependencies
> --
>
> Key: SPARK-32385
> URL: https://issues.apache.org/jira/browse/SPARK-32385
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Vladimir Matveev
>Priority: Major
>
> Spark has a lot of dependencies, many of them very common (e.g. Guava, 
> Jackson). Also, versions of these dependencies are not updated as frequently 
> as they are released upstream, which is totally understandable and natural, 
> but which also means that often Spark has a dependency on a lower version of 
> a library, which is incompatible with a higher, more recent version of the 
> same library. This incompatibility can manifest in different ways, e.g as 
> classpath errors or runtime check errors (like with Jackson), in certain 
> cases.
>  
> Spark does attempt to "fix" versions of its dependencies by declaring them 
> explicitly in its {{pom.xml}} file. However, this approach, being somewhat 
> workable if the Spark-using project itself uses Maven, breaks down if another 
> build system is used, like Gradle. The reason is that Maven uses an 
> unconventional "nearest first" version conflict resolution strategy, while 
> many other tools like Gradle use the "highest first" strategy which resolves 
> the highest possible version number inside the entire graph of dependencies. 
> This means that other dependencies of the project can pull a higher version 
> of some dependency, which is incompatible with Spark.
>  
> One example would be an explicit or a transitive dependency on a higher 
> version of Jackson in the project. Spark itself depends on several modules of 
> Jackson; if only one of them gets a higher version, and others remain on the 
> lower version, this will result in runtime exceptions due to an internal 
> version check in Jackson.
>  
> A widely used solution for this kind of version issues is publishing of a 
> "bill of materials" descriptor (see here: 
> [https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html]
>  and here: 
> [https://docs.gradle.org/current/userguide/platforms.html#sub:bom_import]). 
> This descriptor would contain all versions of all dependencies of Spark; then 
> downstream projects will be able to use their build system's support for BOMs 
> to enforce version constraints required for Spark to function correctly.
>  
> One example of successful implementation of the BOM-based approach is Spring: 
> [https://www.baeldung.com/spring-maven-bom#spring-bom]. For different Spring 
> projects, e.g. Spring Boot, there are BOM descriptors published which can be 
> used in downstream projects to fix the versions of Spring components and 
> their dependencies, significantly reducing confusion around proper version 
> numbers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27708) Add documentation for v2 data sources

2020-08-16 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178583#comment-17178583
 ] 

Rafael commented on SPARK-27708:


[~rdblue] [~jlaskowski]

Hey guys, I'm trying to migrate my package where I'm using V2 DataSources into 
Spark 3 version and any docs/guides would be very useful to me and to everybody 
who is using the V2 DataSources

I added my plan here, could you share your knowledge?

https://issues.apache.org/jira/browse/SPARK-25390?focusedCommentId=17178052=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17178052

> Add documentation for v2 data sources
> -
>
> Key: SPARK-27708
> URL: https://issues.apache.org/jira/browse/SPARK-27708
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: documentation
>
> Before the 3.0 release, the new v2 data sources should be documented. This 
> includes:
>  * How to plug in catalog implementations
>  * Catalog plugin configuration
>  * Multi-part identifier behavior
>  * Partition transforms
>  * Table properties that are used to pass table info (e.g. "provider")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32385) Publish a "bill of materials" (BOM) descriptor for Spark with correct versions of various dependencies

2020-08-16 Thread Vladimir Matveev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178581#comment-17178581
 ] 

Vladimir Matveev commented on SPARK-32385:
--

[~srowen] a BOM descriptor can be used as a "platform" in Gradle and most 
likely in Maven (I don't know for sure, but the concept of BOM originates from 
Maven, so supposedly the tool itself supports it) to enforce compatible version 
numbers in the dependency graph. Just regular POMs cannot do this, because a 
regular POM forms just a single node in a dependency graph, and most of the 
dependency resolution tools take the entire graph of dependencies into account, 
which may result in accidentally bumped versions somewhere, require manual and 
ad-hoc resolution in most cases. With BOM, it is sufficient to tell the 
dependency engine that it should use this BOM to enforce versions, and that's 
it - the versions will now be fixed to the versions declared by the framework 
(Spark in this case). As I said, Spring framework uses this concept to a great 
success to ensure that applications using Spring always have compatible and 
tested versions.

 

Naturally, `deps/` files is just a list of jar files, and cannot be used for 
dependency resolution.

 

Also note that such a BOM descriptor would allow to centralize the version 
declarations within the Spark project itself, so it won't be something "on top" 
to support, at least as far as I understand it.

 

> Publish a "bill of materials" (BOM) descriptor for Spark with correct 
> versions of various dependencies
> --
>
> Key: SPARK-32385
> URL: https://issues.apache.org/jira/browse/SPARK-32385
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Vladimir Matveev
>Priority: Major
>
> Spark has a lot of dependencies, many of them very common (e.g. Guava, 
> Jackson). Also, versions of these dependencies are not updated as frequently 
> as they are released upstream, which is totally understandable and natural, 
> but which also means that often Spark has a dependency on a lower version of 
> a library, which is incompatible with a higher, more recent version of the 
> same library. This incompatibility can manifest in different ways, e.g as 
> classpath errors or runtime check errors (like with Jackson), in certain 
> cases.
>  
> Spark does attempt to "fix" versions of its dependencies by declaring them 
> explicitly in its {{pom.xml}} file. However, this approach, being somewhat 
> workable if the Spark-using project itself uses Maven, breaks down if another 
> build system is used, like Gradle. The reason is that Maven uses an 
> unconventional "nearest first" version conflict resolution strategy, while 
> many other tools like Gradle use the "highest first" strategy which resolves 
> the highest possible version number inside the entire graph of dependencies. 
> This means that other dependencies of the project can pull a higher version 
> of some dependency, which is incompatible with Spark.
>  
> One example would be an explicit or a transitive dependency on a higher 
> version of Jackson in the project. Spark itself depends on several modules of 
> Jackson; if only one of them gets a higher version, and others remain on the 
> lower version, this will result in runtime exceptions due to an internal 
> version check in Jackson.
>  
> A widely used solution for this kind of version issues is publishing of a 
> "bill of materials" descriptor (see here: 
> [https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html]
>  and here: 
> [https://docs.gradle.org/current/userguide/platforms.html#sub:bom_import]). 
> This descriptor would contain all versions of all dependencies of Spark; then 
> downstream projects will be able to use their build system's support for BOMs 
> to enforce version constraints required for Spark to function correctly.
>  
> One example of successful implementation of the BOM-based approach is Spring: 
> [https://www.baeldung.com/spring-maven-bom#spring-bom]. For different Spring 
> projects, e.g. Spring Boot, there are BOM descriptors published which can be 
> used in downstream projects to fix the versions of Spring components and 
> their dependencies, significantly reducing confusion around proper version 
> numbers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32611) Querying ORC table in Spark3 using spark.sql.orc.impl=hive produces incorrect when timestamp is present in predicate

2020-08-16 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178576#comment-17178576
 ] 

L. C. Hsieh commented on SPARK-32611:
-

I also tested on branch-3.0, but still cannot reproduce it.

> Querying ORC table in Spark3 using spark.sql.orc.impl=hive produces incorrect 
> when timestamp is present in predicate
> 
>
> Key: SPARK-32611
> URL: https://issues.apache.org/jira/browse/SPARK-32611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Sumeet
>Priority: Major
>
> *How to reproduce this behavior?*
>  * TZ="America/Los_Angeles" ./bin/spark-shell
>  * sql("set spark.sql.hive.convertMetastoreOrc=true")
>  * sql("set spark.sql.orc.impl=hive")
>  * sql("create table t_spark(col timestamp) stored as orc;")
>  * sql("insert into t_spark values (cast('2100-01-01 
> 01:33:33.123America/Los_Angeles' as timestamp));")
>  * sql("select col, date_format(col, 'DD') from t_spark where col = 
> cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
>  *This will return empty results, which is incorrect.*
>  * sql("set spark.sql.orc.impl=native")
>  * sql("select col, date_format(col, 'DD') from t_spark where col = 
> cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
>  *This will return 1 row, which is the expected output.*
>  
> The above query using (True, hive) returns *correct results if pushdown 
> filters are turned off*. 
>  * sql("set spark.sql.orc.filterPushdown=false")
>  * sql("select col, date_format(col, 'DD') from t_spark where col = 
> cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
>  *This will return 1 row, which is the expected output.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32611) Querying ORC table in Spark3 using spark.sql.orc.impl=hive produces incorrect when timestamp is present in predicate

2020-08-16 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178571#comment-17178571
 ] 

L. C. Hsieh commented on SPARK-32611:
-

Hm, I build from current master branch, but cannot reproduce it. Let me try 
branch-3.0 too.

> Querying ORC table in Spark3 using spark.sql.orc.impl=hive produces incorrect 
> when timestamp is present in predicate
> 
>
> Key: SPARK-32611
> URL: https://issues.apache.org/jira/browse/SPARK-32611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Sumeet
>Priority: Major
>
> *How to reproduce this behavior?*
>  * TZ="America/Los_Angeles" ./bin/spark-shell
>  * sql("set spark.sql.hive.convertMetastoreOrc=true")
>  * sql("set spark.sql.orc.impl=hive")
>  * sql("create table t_spark(col timestamp) stored as orc;")
>  * sql("insert into t_spark values (cast('2100-01-01 
> 01:33:33.123America/Los_Angeles' as timestamp));")
>  * sql("select col, date_format(col, 'DD') from t_spark where col = 
> cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
>  *This will return empty results, which is incorrect.*
>  * sql("set spark.sql.orc.impl=native")
>  * sql("select col, date_format(col, 'DD') from t_spark where col = 
> cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
>  *This will return 1 row, which is the expected output.*
>  
> The above query using (True, hive) returns *correct results if pushdown 
> filters are turned off*. 
>  * sql("set spark.sql.orc.filterPushdown=false")
>  * sql("select col, date_format(col, 'DD') from t_spark where col = 
> cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
>  *This will return 1 row, which is the expected output.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32619) converting dataframe to dataset for the json schema

2020-08-16 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178563#comment-17178563
 ] 

L. C. Hsieh commented on SPARK-32619:
-

Could you show the schema of the dataframe? By {{dataframe.printSchema()}}. If 
the schema matches with the case class {{Details}}, it should be able to 
convert to {{Dataset[Details]}}.

> converting dataframe to dataset for the json schema
> ---
>
> Key: SPARK-32619
> URL: https://issues.apache.org/jira/browse/SPARK-32619
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Manjay Kumar
>Priority: Minor
>
> have a schema 
>  
> {
> Details :[{
> phone : "98977999"
> contacts: [{
> name:"manjay"
> -- has missing street block in json
> ]}
> ]}
>  
> }
>  
> Case class , based on schema
> case class Details (
>  phone : String,
> contacts : Array[Adress]
> )
>  
> case class Adress(
> name : String
> street : String
>  
> )
>  
>  
> this throws : No such struct field street - Analysis exception.
>  
> dataframe.as[Details]
>  
> Is this a bug ?? or there is a resolution for this.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer

2020-08-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-27249.
--
Resolution: Won't Fix

> Developers API for Transformers beyond UnaryTransformer
> ---
>
> Key: SPARK-27249
> URL: https://issues.apache.org/jira/browse/SPARK-27249
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Everett Rush
>Priority: Minor
>  Labels: starter
> Attachments: Screen Shot 2020-01-17 at 4.20.57 PM.png
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> It would be nice to have a developers' API for dataset transformations that 
> need more than one column from a row (ie UnaryTransformer inputs one column 
> and outputs one column) or that contain objects too expensive to initialize 
> repeatedly in a UDF such as a database connection. 
>  
> Design:
> Abstract class PartitionTransformer extends Transformer and defines the 
> partition transformation function as Iterator[Row] => Iterator[Row]
> NB: This parallels the UnaryTransformer createTransformFunc method
>  
> When developers subclass this transformer, they can provide their own schema 
> for the output Row in which case the PartitionTransformer creates a row 
> encoder and executes the transformation. Alternatively the developer can set 
> output Datatype and output col name. Then the PartitionTransformer class will 
> create a new schema, a row encoder, and execute the transformation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32629) Record metrics of extra BitSet/HashSet in full outer shuffled hash join

2020-08-16 Thread Cheng Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-32629:
-
Parent: SPARK-32461
Issue Type: Sub-task  (was: Improvement)

> Record metrics of extra BitSet/HashSet in full outer shuffled hash join
> ---
>
> Key: SPARK-32629
> URL: https://issues.apache.org/jira/browse/SPARK-32629
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> This is a followup coming from full outer shuffled hash join PR, where we 
> want to record metrics for size of extra BitSet/HashSet used in full outer 
> shuffled hash join, for matched build side rows. See detailed discussion in 
> [https://github.com/apache/spark/pull/29342/files#r470948744] .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32629) Record metrics of extra BitSet/HashSet in full outer shuffled hash join

2020-08-16 Thread Cheng Su (Jira)
Cheng Su created SPARK-32629:


 Summary: Record metrics of extra BitSet/HashSet in full outer 
shuffled hash join
 Key: SPARK-32629
 URL: https://issues.apache.org/jira/browse/SPARK-32629
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Cheng Su


This is a followup coming from full outer shuffled hash join PR, where we want 
to record metrics for size of extra BitSet/HashSet used in full outer shuffled 
hash join, for matched build side rows. See detailed discussion in 
[https://github.com/apache/spark/pull/29342/files#r470948744] .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32205) Writing timestamp in mysql gets fails

2020-08-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-32205.
--
Resolution: Not A Problem

> Writing timestamp in mysql gets fails 
> --
>
> Key: SPARK-32205
> URL: https://issues.apache.org/jira/browse/SPARK-32205
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.4
>Reporter: Nilesh Patil
>Priority: Major
>
> When we are writing to mysql with TIMESTAMP column it supports only range 
> '1970-01-01 00:00:01' UTC to '2038-01-19 03:14:07'. Mysql has DATETIME 
> datatype which has 1000-01-01 00:00:00' to '-12-31 23:59:59' range.
> How to map spark timestamp datatype to mysql datetime datatype in order to 
> use higher supporting range ?
> [https://dev.mysql.com/doc/refman/5.7/en/datetime.html]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-16 Thread chanduhawk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178560#comment-17178560
 ] 

chanduhawk edited comment on SPARK-32614 at 8/16/20, 5:39 PM:
--

*currently  spark cannt process the row that starts with null character.*
If one of the rows the data file(CSV) starts with null or \u character like 
below(PFA screenshot)
like below
*null*,abc,test

then spark will throw the error as mentioned in the description. i.e spark 
cannt process any row that starts with null character. It can only process the 
row if we set the options like below

option("comment","a character")

*comment* - it will take a character that needs to be treated as comment 
character so spark wont process the row that starts with this character

The above is a work around to process the row that starts with null. But this 
process is also flawed and subject to skip a valid row of data that may start 
with the comment character.
In data ware house most of the time we dont have comment charaters concept and 
all the rows needs to be processed.

So there should be an option in spark which will disable the processing of 
comment characters like below

option("enableProcessingComments", false)

this option will disable checking for any comment character processing and can 
also process the rows that starts with the null or \u character






was (Author: chanduhawk):
*currently  spark cannt process the row that starts with null character.*
If one of the rows the data file(CSV) starts with null or \u character like 
below(PFA screenshot)
like below
*null*,abc,test

then spark will throw the error as mentioned in the description. i.e spark 
cannt process any row that starts with null character. It can only process the 
row if we set the options like below

option("comment","a character")

comment - it will take a character that needs to be treated as comment 
character so spark wont process the row that starts with this character

The above is a work around to process the row that starts with null. But this 
process is also flawed and subject to skip a valid row of data that may start 
with the comment character.
In data ware house most of the time we dont have comment charatcers concept and 
all the rows needs to be processed.

So there should be an option in spark which will disable the processing of 
comment characters like below

option("enableProcessingComments", false)

this option will disable checking for any comment character processing.





> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: chanduhawk
>Assignee: Jeff Evans
>Priority: Major
> Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u or null character.Though user 
> can set any comment character other than \u, but there is a chance the 
> actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row 
> starts with null \u
> character it will throw the below error.
> *eg: *val df = 
> spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>   df.show(false);
> *+TestData+*
>  
>  !screenshot-1.png! 
> Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
>   at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
>   at 
> com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)
> *Note:*
> Though its the limitation of the univocity parser and the workaround is to 
> provide any other comment character by mentioning .option("comment","#"), but 
> if my actual data starts with this character then the particular row will be 
> discarded.
> Currently I pushed the code in univocity parser to 

[jira] [Comment Edited] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-16 Thread chanduhawk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178560#comment-17178560
 ] 

chanduhawk edited comment on SPARK-32614 at 8/16/20, 5:37 PM:
--

*currently  spark cannt process the row that starts with null character.*
If one of the rows the data file(CSV) starts with null or \u character like 
below(PFA screenshot)
like below
*null*,abc,test

then spark will throw the error as mentioned in the description. i.e spark 
cannt process any row that starts with null character. It can only process the 
row if we set the options like below

option("comment","a character")

comment - it will take a character that needs to be treated as comment 
character so spark wont process the row that starts with this character

The above is a work around to process the row that starts with null. But this 
process is also flawed and subject to skip a valid row of data that may start 
with the comment character.
In data ware house most of the time we dont have comment charatcers concept and 
all the rows needs to be processed.

So there should be an option in spark which will disable the processing of 
comment characters like below

option("enableProcessingComments", false)

this option will disable checking for any comment character processing.






was (Author: chanduhawk):
*currently  spark cannt process the row that starts with null character.*
If one of the rows the data file(CSV) starts with null or \u character like 
below(PFA screenshot)
like below
**null*,abc,test*

then spark will throw the error as mentioned in the description. i.e spark 
cannt process any row that starts with null character. It can only process the 
row if we set the options like below

option("comment","a character")

comment - it will take a character that needs to be treated as comment 
character so spark wont process the row that starts with this character

The above is a work around to process the row that starts with null. But this 
process is also flawed and subject to skip a valid row of data that may start 
with the comment character.
In data ware house most of the time we dont have comment charatcers concept and 
all the rows needs to be processed.

So there should be an option in spark which will disable the processing of 
comment characters like below

option("enableProcessingComments", false)

this option will disable checking for any comment character processing.





> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: chanduhawk
>Assignee: Jeff Evans
>Priority: Major
> Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u or null character.Though user 
> can set any comment character other than \u, but there is a chance the 
> actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row 
> starts with null \u
> character it will throw the below error.
> *eg: *val df = 
> spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>   df.show(false);
> *+TestData+*
>  
>  !screenshot-1.png! 
> Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
>   at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
>   at 
> com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)
> *Note:*
> Though its the limitation of the univocity parser and the workaround is to 
> provide any other comment character by mentioning .option("comment","#"), but 
> if my actual data starts with this character then the particular row will be 
> discarded.
> Currently I pushed the code in univocity parser to handle this scenario as 
> part of the below PR
> 

[jira] [Comment Edited] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-16 Thread chanduhawk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178560#comment-17178560
 ] 

chanduhawk edited comment on SPARK-32614 at 8/16/20, 5:37 PM:
--

*currently  spark cannt process the row that starts with null character.*
If one of the rows the data file(CSV) starts with null or \u character like 
below(PFA screenshot)
like below
**null*,abc,test*

then spark will throw the error as mentioned in the description. i.e spark 
cannt process any row that starts with null character. It can only process the 
row if we set the options like below

option("comment","a character")

comment - it will take a character that needs to be treated as comment 
character so spark wont process the row that starts with this character

The above is a work around to process the row that starts with null. But this 
process is also flawed and subject to skip a valid row of data that may start 
with the comment character.
In data ware house most of the time we dont have comment charatcers concept and 
all the rows needs to be processed.

So there should be an option in spark which will disable the processing of 
comment characters like below

option("enableProcessingComments", false)

this option will disable checking for any comment character processing.






was (Author: chanduhawk):
If one of the rows the data file(CSV) starts with null or \u character like 
below(PFA screenshot)
like below
**null*,abc,test*

then spark will throw the error as mentioned in the description. i.e spark 
cannt process any row that starts with null character. It can only process the 
row if we set the options like below

option("comment","a character")

comment - it will take a character that needs to be treated as comment 
character so spark wont process the row that starts with this character

The above is a work around to process the row that starts with null. But this 
process is also flawed and subject to skip a valid row of data that may start 
with the comment character.
In data ware house most of the time we dont have comment charatcers concept and 
all the rows needs to be processed.

So there should be an option in spark which will disable the processing of 
comment characters like below

option("enableProcessingComments", false)

this option will disable checking for any comment character processing.





> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: chanduhawk
>Assignee: Jeff Evans
>Priority: Major
> Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u or null character.Though user 
> can set any comment character other than \u, but there is a chance the 
> actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row 
> starts with null \u
> character it will throw the below error.
> *eg: *val df = 
> spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>   df.show(false);
> *+TestData+*
>  
>  !screenshot-1.png! 
> Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
>   at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
>   at 
> com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)
> *Note:*
> Though its the limitation of the univocity parser and the workaround is to 
> provide any other comment character by mentioning .option("comment","#"), but 
> if my actual data starts with this character then the particular row will be 
> discarded.
> Currently I pushed the code in univocity parser to handle this scenario as 
> part of the below PR
> https://github.com/uniVocity/univocity-parsers/pull/412
> please accept the jira so that we can enable 

[jira] [Commented] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-16 Thread chanduhawk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178560#comment-17178560
 ] 

chanduhawk commented on SPARK-32614:


If one of the rows the data file(CSV) starts with null or \u character like 
below(PFA screenshot)
like below
**null*,abc,test*

then spark will throw the error as mentioned in the description. i.e spark 
cannt process any row that starts with null character. It can only process the 
row if we set the options like below

option("comment","a character")

comment - it will take a character that needs to be treated as comment 
character so spark wont process the row that starts with this character

The above is a work around to process the row that starts with null. But this 
process is also flawed and subject to skip a valid row of data that may start 
with the comment character.
In data ware house most of the time we dont have comment charatcers concept and 
all the rows needs to be processed.

So there should be an option in spark which will disable the processing of 
comment characters like below

option("enableProcessingComments", false)

this option will disable checking for any comment character processing.





> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: chanduhawk
>Assignee: Jeff Evans
>Priority: Major
> Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u or null character.Though user 
> can set any comment character other than \u, but there is a chance the 
> actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row 
> starts with null \u
> character it will throw the below error.
> *eg: *val df = 
> spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>   df.show(false);
> *+TestData+*
>  
>  !screenshot-1.png! 
> Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
>   at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
>   at 
> com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)
> *Note:*
> Though its the limitation of the univocity parser and the workaround is to 
> provide any other comment character by mentioning .option("comment","#"), but 
> if my actual data starts with this character then the particular row will be 
> discarded.
> Currently I pushed the code in univocity parser to handle this scenario as 
> part of the below PR
> https://github.com/uniVocity/univocity-parsers/pull/412
> please accept the jira so that we can enable this feature in spark-csv by 
> adding a parameter in spark csvoptions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32502) Please fix CVE related to Guava 14.0.1

2020-08-16 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178559#comment-17178559
 ] 

Sean R. Owen commented on SPARK-32502:
--

Yes it's shaded. The problem is that Hadoop < 3.2.1 and current Hive versions 
can't use the latest Guava, and that's all packaged together. Even if we wanted 
to update it - and we have forever - it won't quite work. 

generally, the answer is: is this CVE actually a problem? scanners have no 
idea. I can't say for sure but it doesn't look like it.

If the fix is in LimitedInputStream maybe we can just apply the patch, as 
indeed we had to copy it to keep it working across Guava 11, Guava 14-dependent 
libraries (which may no longer be needed)

BTW this duplicated a few times already.

> Please fix CVE related to Guava 14.0.1
> --
>
> Key: SPARK-32502
> URL: https://issues.apache.org/jira/browse/SPARK-32502
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Rodney Aaron Stainback
>Priority: Major
>
> Please fix the following CVE related to Guava 14.0.1
> |cve|severity|cvss|
> |CVE-2018-10237|medium|5.9|
>  
> Our security team is trying to block us from using spark because of this issue
>  
> One thing that's very weird is I see from this [pom 
> file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/pom.xml]]
>  you reference guava but it's not clear what version.
>  
> But if I look on 
> [maven|[https://mvnrepository.com/artifact/org.apache.spark/spark-network-common_2.12/3.0.0]]
>  the guava reference is not showing up
>  
> Is this reference somehow being shaded into the network common jar?  It's not 
> clear to me.
>  
> Also, I've noticed code like [this 
> file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/src/main/java/org/apache/spark/network/util/LimitedInputStream.java]]
>  which is a copy-paste of some guava source code.
>  
> The CVE scanner we use Twistlock/Palo Alto Networks - Prisma Cloud Compute 
> Edition is very thorough and will find CVEs in copy-pasted code and shaded 
> jars.
>  
> Please fix this CVE so we can use spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32502) Please fix CVE related to Guava 14.0.1

2020-08-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-32502.
--
Resolution: Duplicate

> Please fix CVE related to Guava 14.0.1
> --
>
> Key: SPARK-32502
> URL: https://issues.apache.org/jira/browse/SPARK-32502
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Rodney Aaron Stainback
>Priority: Major
>
> Please fix the following CVE related to Guava 14.0.1
> |cve|severity|cvss|
> |CVE-2018-10237|medium|5.9|
>  
> Our security team is trying to block us from using spark because of this issue
>  
> One thing that's very weird is I see from this [pom 
> file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/pom.xml]]
>  you reference guava but it's not clear what version.
>  
> But if I look on 
> [maven|[https://mvnrepository.com/artifact/org.apache.spark/spark-network-common_2.12/3.0.0]]
>  the guava reference is not showing up
>  
> Is this reference somehow being shaded into the network common jar?  It's not 
> clear to me.
>  
> Also, I've noticed code like [this 
> file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/src/main/java/org/apache/spark/network/util/LimitedInputStream.java]]
>  which is a copy-paste of some guava source code.
>  
> The CVE scanner we use Twistlock/Palo Alto Networks - Prisma Cloud Compute 
> Edition is very thorough and will find CVEs in copy-pasted code and shaded 
> jars.
>  
> Please fix this CVE so we can use spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32534) Cannot load a Pipeline Model on a stopped Spark Context

2020-08-16 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178558#comment-17178558
 ] 

Sean R. Owen commented on SPARK-32534:
--

Generally speaking, it's not going to work on stop and start a SparkContext. If 
there's some easy way to fix it, sure, but lots of things can go wrong if you 
are doing that. SparkContext lives exactly as long as the app.

> Cannot load a Pipeline Model on a stopped Spark Context
> ---
>
> Key: SPARK-32534
> URL: https://issues.apache.org/jira/browse/SPARK-32534
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Kubernetes
>Affects Versions: 2.4.6
>Reporter: Kevin Van Lieshout
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am running Spark in a Kubernetes cluster than is running Spark NLP using 
> the Pyspark ML Pipeline Model class to load the model and then transform on 
> the spark dataframe. We run this within a docker container that starts up a 
> spark context, mounts volumes, spins up executors, etc and then does it 
> transformations, udfs, etc and then closes down the spark context. The first 
> time I load the model when my service has just been started, everything is 
> fine. If I run my application for a second time without resetting my service, 
> even though the context is entirely stopped from the previous run and a new 
> one is started up, the Pipeline Model has some attribute in one of its base 
> classes that thinks the context its running on is closed, so then I get a : 
> cannot call a function on a stopped spark context when I try and load the 
> model in my service again. I have to shut down my service each time if I want 
> consecutive runs through my spark pipeline, which is not ideal, so I was 
> wondering if this was a common issue amongst fellow pyspark users that use 
> Pipeline Model, or is there a common work around to resetting all spark 
> contexts or whether the pipeline model caches a spark context of some sort. 
> Any help is very useful. 
>  
>  
> cls.pipeline = PipelineModel.read().load(NLP_MODEL)
>  
> is how I load the model. And our spark context is very similar to a typical 
> kubernetes/spark setup. Nothing special there



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32569) Gaussian can not handle data close to MaxDouble

2020-08-16 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178557#comment-17178557
 ] 

Sean R. Owen commented on SPARK-32569:
--

Please, if you can, narrow down the actual error in the description. Most of 
that is noise.
You can be clearer too: values of what? and this is GaussianMixtureModels?
They are not the same as k-means at all of course. Yes you are going to run 
into overflow / convergence problems with incredibly large values; when would 
input be that large?

> Gaussian can not handle data close to MaxDouble
> ---
>
> Key: SPARK-32569
> URL: https://issues.apache.org/jira/browse/SPARK-32569
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.0.0
> Environment: Running Spark in local mode within java application on 
> Windows 10
>Reporter: Tobias Haar
>Priority: Major
>
> Running Gaussian from Apache Spark MLlib with [this 
> dataset|[https://user.informatik.uni-goettingen.de/~sherbol/MaxDouble.arff]] 
> containing values close to MaxDouble (values >10^306) results in the error 
> below. KMeans and Bisecting KMeans can both handle the same dataset which for 
> me raises the question, if this would be a bug or to be expected behavior.
> Stacktrace:
> "org.apache.spark.SparkException: Failed to execute user defined 
> function(GaussianMixtureModel$$Lambda$2841/0x0001003ab040: 
> (struct,values:array>) => 
> struct,values:array>)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1070)
>  at 
> org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
>  at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:83)
>  at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$17.$anonfun$applyOrElse$71(Optimizer.scala:1502)
>  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>  at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>  at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>  at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>  at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$17.applyOrElse(Optimizer.scala:1502)
>  at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$17.applyOrElse(Optimizer.scala:1497)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291)
>  at 
> 

[jira] [Commented] (SPARK-32578) PageRank not sending the correct values in Pergel sendMessage

2020-08-16 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178554#comment-17178554
 ] 

Sean R. Owen commented on SPARK-32578:
--

I feel like this has come up a couple times over the years and it was 'correct' 
just an alternative formulation. Not 100% sure. Can you show it gives incorrect 
results? the tests seem to show it's doing the right thing

> PageRank not sending the correct values in Pergel sendMessage
> -
>
> Key: SPARK-32578
> URL: https://issues.apache.org/jira/browse/SPARK-32578
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: Shay Elbaz
>Priority: Major
>
> The core sendMessage method is incorrect:
> {code:java}
> def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = {
>  if (edge.srcAttr._2 > tol) {
>Iterator((edge.dstId, edge.srcAttr._2 * edge.attr))
>   // *** THIS ^ ***
>  } else {
>Iterator.empty
>  }
> }{code}
>  
> Instead of using the source PR value, it's using the PR delta (2nd tuple 
> arg). This is not the documented behavior, nor a valid PR algorithm AFAIK.
> This is a 7 years old code, all versions affected.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32604) Bug in ALSModel Python Documentation

2020-08-16 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178552#comment-17178552
 ] 

Sean R. Owen commented on SPARK-32604:
--

Yeah, the docs are generated from the code. The change to support only python 
3.5 is in 3.1, so i think the docs also need the change in 3.1 only, which 
isn't released. Therefore this is a duplicate of SPARK-32138

> Bug in ALSModel Python Documentation
> 
>
> Key: SPARK-32604
> URL: https://issues.apache.org/jira/browse/SPARK-32604
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Zach Cahoone
>Priority: Minor
>
> In the ALSModel documentation 
> ([https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]), 
> there is a bug which causes data frame creation to fail with the following 
> error:
> {code:java}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 15, 10.0.0.133, executor 10): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, 
> in main
> process()
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, 
> in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 390, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 1354, in takeUpToNumLeft
> yield next(iterator)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in 
> wrapper
> return f(*args, **kwargs)
>   File "", line 24, in 
> NameError: name 'long' is not defined
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
>   

[jira] [Resolved] (SPARK-32604) Bug in ALSModel Python Documentation

2020-08-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-32604.
--
Resolution: Duplicate

> Bug in ALSModel Python Documentation
> 
>
> Key: SPARK-32604
> URL: https://issues.apache.org/jira/browse/SPARK-32604
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Zach Cahoone
>Priority: Minor
>
> In the ALSModel documentation 
> ([https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]), 
> there is a bug which causes data frame creation to fail with the following 
> error:
> {code:java}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 15, 10.0.0.133, executor 10): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, 
> in main
> process()
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, 
> in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 390, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 1354, in takeUpToNumLeft
> yield next(iterator)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in 
> wrapper
> return f(*args, **kwargs)
>   File "", line 24, in 
> NameError: name 'long' is not defined
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> 

[jira] [Resolved] (SPARK-32610) Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper version

2020-08-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-32610.
--
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29426
[https://github.com/apache/spark/pull/29426]

> Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper 
> version
> --
>
> Key: SPARK-32610
> URL: https://issues.apache.org/jira/browse/SPARK-32610
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> There are links to metrics.dropwizard.io in monitoring.md but the link 
> targets refer the version 3.1.0, while we use 4.1.1.
> Now that users can create their own metrics using the dropwizard library, 
> it's better to fix the links to refer the proper version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32612) int columns produce inconsistent results on pandas UDFs

2020-08-16 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178543#comment-17178543
 ] 

Sean R. Owen commented on SPARK-32612:
--

I don't think it's correct to upgrade it to float in all cases. int overflow is 
simply what you have to deal with if you assert that you are using a 32-bit 
data type. Use long instead if you need more bytes.

> int columns produce inconsistent results on pandas UDFs
> ---
>
> Key: SPARK-32612
> URL: https://issues.apache.org/jira/browse/SPARK-32612
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> This is similar to SPARK-30187 but I personally consider this data corruption.
> If I have a simple pandas UDF
> {code}
>  >>> def add(a, b):
> return a + b
>  >>> my_udf = pandas_udf(add, returnType=LongType())
> {code}
> And I want to process some data with it, say 32 bit ints
> {code}
> >>> df = spark.createDataFrame([(1037694399, 1204615848),(3,4)], 
> >>> StructType([StructField("a", IntegerType()), StructField("b", 
> >>> IntegerType())]))
> >>> df.select(my_udf(col("a") - 3, col("b")).show()
> +--+--+---+
> | a| b|add((a - 3), b)|
> +--+--+---+
> |1037694399|1204615848|-2052657052|
> | 3| 4|  4|
> +--+--+---+
> {code}
> I get an integer overflow for the data as I would expect.  But as soon as I 
> add a {{None}} to the data, even on a different row the result I get back is 
> totally different.
> {code}
> >>> df = spark.createDataFrame([(1037694399, 1204615848),(3,None)], 
> >>> StructType([StructField("a", IntegerType()), StructField("b", 
> >>> IntegerType())]))
> >>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show()
> +--+--+---+
> | a| b|add((a - 3), b)|
> +--+--+---+
> |1037694399|1204615848| 2242310244|
> | 3|  null|   null|
> +--+--+---+
> {code}
> The integer overflow disappears.  This is because arrow and/or pandas changes 
> the data type to a float in order to be able to store the null value.  So 
> then the processing is being done on floating point there is no overflow.  
> This in and of itself is annoying but understandable because it is dealing 
> with a limitation in pandas. 
> Where it becomes a bug is that this happens on a per batch basis.  This means 
> that I can have the same two rows in different parts of my data set and get 
> different results depending on their proximity to a null value.
> {code}
> >>> df = spark.createDataFrame([(1037694399, 
> >>> 1204615848),(3,None),(1037694399, 1204615848),(3,4)], 
> >>> StructType([StructField("a", IntegerType()), StructField("b", 
> >>> IntegerType())]))
> >>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show()
> +--+--+---+
> | a| b|add((a - 3), b)|
> +--+--+---+
> |1037694399|1204615848| 2242310244|
> | 3|  null|   null|
> |1037694399|1204615848| 2242310244|
> | 3| 4|  4|
> +--+--+---+
> >>> spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', '2')
> >>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show()
> +--+--+---+
> | a| b|add((a - 3), b)|
> +--+--+---+
> |1037694399|1204615848| 2242310244|
> | 3|  null|   null|
> |1037694399|1204615848|-2052657052|
> | 3| 4|  4|
> +--+--+---+
> {code}
> For me personally I would prefer to have all nullable integer columns 
> upgraded to float all the time, that way it is at least consistent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-16 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178537#comment-17178537
 ] 

Sean R. Owen commented on SPARK-32614:
--

I dont' really understand this. Are you just saying you need to treat the null 
byte as a comment character? so set that as the comment char? what is the Spark 
issue here.

> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: chanduhawk
>Assignee: Jeff Evans
>Priority: Major
> Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u or null character.Though user 
> can set any comment character other than \u, but there is a chance the 
> actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row 
> starts with null \u
> character it will throw the below error.
> *eg: *val df = 
> spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>   df.show(false);
> *+TestData+*
>  
>  !screenshot-1.png! 
> Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
>   at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
>   at 
> com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)
> *Note:*
> Though its the limitation of the univocity parser and the workaround is to 
> provide any other comment character by mentioning .option("comment","#"), but 
> if my actual data starts with this character then the particular row will be 
> discarded.
> Currently I pushed the code in univocity parser to handle this scenario as 
> part of the below PR
> https://github.com/uniVocity/univocity-parsers/pull/412
> please accept the jira so that we can enable this feature in spark-csv by 
> adding a parameter in spark csvoptions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32618) ORC writer doesn't support colon in column names

2020-08-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-32618.
--
Resolution: Invalid

Maybe, this is likely a duplicate. You'd have to test, and at least on 2.4.6 if 
not 3.0, not vs 2.2.x

> ORC writer doesn't support colon in column names
> 
>
> Key: SPARK-32618
> URL: https://issues.apache.org/jira/browse/SPARK-32618
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Pierre Gramme
>Priority: Major
>
> Hi,
> I'm getting an {{IllegalArgumentException: Can't parse category at 
> 'struct'}} when exporting to ORC a dataframe whose column names 
> contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if 
> the name with colon appears nested as member of a struct.
> Seems related with SPARK-21791(which was solved in 2.3.0).
> In my real-life case, the column was actually {{xsi:type}}, coming from some 
> parsed xml. Thus other users may be affected too.
> Has it been fixed after Spark 2.3.0? (sorry, can't test easily)
> Any workaround? Would be acceptable for me to find and replace all colons 
> with underscore in column names, but not easy to do in a big set of nested 
> struct columns...
> Thanks
>  
>  
> {code:java}
>  spark.conf.set("spark.sql.orc.impl", "native")
>  val dfColon = Seq(1).toDF("a:b")
>  dfColon.printSchema()
>  dfColon.show()
>  dfColon.write.orc("test_colon")
>  // Fails with IllegalArgumentException: Can't parse category at 
> 'struct'
>  
>  import org.apache.spark.sql.functions.struct
>  val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b")
>  dfColonStruct.printSchema()
>  dfColonStruct.show()
>  dfColon.write.orc("test_colon_struct")
>  // Fails with IllegalArgumentException: Can't parse category at 
> 'struct>'
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32624) Replace getClass.getName with getClass.getCanonicalName in CodegenContext.addReferenceObj

2020-08-16 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178534#comment-17178534
 ] 

Sean R. Owen commented on SPARK-32624:
--

Lots of your JIRAs are lacking descriptions or no detail beyond some code. 
Could you please explain what the issue is and your proposed solution if you 
have one [~yumwang]

> Replace getClass.getName with getClass.getCanonicalName in 
> CodegenContext.addReferenceObj
> -
>
> Key: SPARK-32624
> URL: https://issues.apache.org/jira/browse/SPARK-32624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> 20:49:54.885 ERROR 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: 
> /* 001 */ public SpecificPredicate generate(Object[] references) {
> /* 002 */   return new SpecificPredicate(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificPredicate extends 
> org.apache.spark.sql.catalyst.expressions.BasePredicate {
> /* 006 */   private final Object[] references;
> /* 007 */
> /* 008 */
> /* 009 */   public SpecificPredicate(Object[] references) {
> /* 010 */ this.references = references;
> /* 011 */
> /* 012 */   }
> /* 013 */
> /* 014 */   public void initialize(int partitionIndex) {
> /* 015 */
> /* 016 */   }
> /* 017 */
> /* 018 */   public boolean eval(InternalRow i) {
> /* 019 */ boolean isNull_3 = i.isNullAt(0);
> /* 020 */ UTF8String value_3 = isNull_3 ?
> /* 021 */ null : (i.getUTF8String(0));
> /* 022 */ boolean isNull_2 = isNull_3;
> /* 023 */ byte[] value_2 = null;
> /* 024 */ if (!isNull_3) {
> /* 025 */   value_2 = value_3.getBytes();
> /* 026 */ }
> /* 027 */ boolean value_1 = true;
> /* 028 */ boolean isNull_1 = isNull_2;
> /* 029 */ if (!isNull_2) {
> /* 030 */   value_1 = 
> org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) 
> references[0] /* min */)) >= 0 && 
> org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) 
> references[1] /* max */)) <= 0 && 
> ((org.apache.spark.util.sketch.BloomFilterImpl) references[2] /* bloomFilter 
> */).mightContainBinary(value_2);
> /* 031 */ }
> /* 032 */ return !isNull_1 && value_1;
> /* 033 */   }
> /* 034 */
> /* 035 */
> /* 036 */ }
> 20:49:54.886 WARN org.apache.spark.sql.catalyst.expressions.Predicate: Expr 
> codegen error and falling back to interpreter mode
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 30, Column 81: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 30, Column 81: Unexpected token "[" in primary
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
>   at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
>   at 
> com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
>   at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
>   at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1337)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:26)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32336) 11 Critical & 4 High severity issues in Apcahe Spark 3.0.0 - dependency libraries

2020-08-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-32336.
--
Resolution: Invalid

Some of these are _Spark_ CVEs that are already resolved.
Some do not seem to affect Spark.
This isn't useful to dump the output of a static checker; which if any do think 
affect spark and what's the resolution?
There is no further description here.

> 11 Critical & 4 High severity issues in Apcahe Spark 3.0.0 - dependency 
> libraries
> -
>
> Key: SPARK-32336
> URL: https://issues.apache.org/jira/browse/SPARK-32336
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Security
>Affects Versions: 3.0.0
> Environment: Generic Linux  - but these dependencies are in the 
> libraries that spark pulls in.
> Given that several of these are sveral yrs old, and highly severe (remote 
> code execution is possible) these libraries are ripe for exploitation and it 
> is highlt likly that exploits curretnly exist for these issues.
>  
> Please upgrade the dependant libraries and run OWASP dependency check prior 
> to all future releases/
>Reporter: Albert Baker
>Priority: Major
>  Labels: easyfix, security
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> ||*[CVE-2018-1337|https://nvd.nist.gov/vuln/detail/CVE-2018-1337]*|In Apache 
> Directory LDAP API before 1.0.2,   - upgrade dependency to 1.0.2|
> ||*[CVE-2018-17190|https://nvd.nist.gov/vuln/detail/CVE-2018-17190]*|In all 
> versions of Apache Spark,|
> ||*[CVE-2017-15718|https://nvd.nist.gov/vuln/detail/CVE-2017-15718]*|The YARN 
> NodeManager in Apache Hadoop 2.7.3 and 2.7.4 - upgrade lib|
> ||*[CVE-2018-21234|https://nvd.nist.gov/vuln/detail/CVE-2018-21234]*|Jodd 
> before 5.0.4 performs Deserialization of Untrusted JSON Data when 
> setClassMetadataName is set.|
> ||*[CVE-2019-17571|https://nvd.nist.gov/vuln/detail/CVE-2019-17571]*|Included 
> in Log4j 1.2 is a SocketServer class that is vulnerable to deserialization of 
> untrusted data which can be exploited to remotely execute arbitrary code when 
> combined with a deserialization gadget when listening to untrusted network 
> traffic for log data. This affects Log4j versions up to 1.2 up to 1.2.17.|
> ||*[CVE-2018-17190|https://nvd.nist.gov/vuln/detail/CVE-2018-17190]*|In all 
> versions of Apache Spark, its standalone resource manager accepts code to 
> execute on a 'master' host, that then runs that code on 'worker|
> ||*[CVE-2020-9480|https://nvd.nist.gov/vuln/detail/CVE-2020-9480]*|In Apache 
> Spark 2.4.5 and earlier, a standalone resource manager's master may be 
> configured to require authentication (spark.authenticate) via a shared 
> secret.|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32342) Kafka events are missing magic byte

2020-08-16 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178531#comment-17178531
 ] 

Sean R. Owen commented on SPARK-32342:
--

Is the magic byte supposed to be part of Avro's spec or specific to Confluent 
somehow?

> Kafka events are missing magic byte
> ---
>
> Key: SPARK-32342
> URL: https://issues.apache.org/jira/browse/SPARK-32342
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: Pyspark 3.0.0, Python 3.7 Confluent cloud Kafka with 
> Schema registry 5.5 
>Reporter: Sridhar Baddela
>Priority: Major
>
> Please refer to the documentation link for to_avro and from_avro.[ 
> http://spark.apache.org/docs/latest/sql-data-sources-avro.html|http://spark.apache.org/docs/latest/sql-data-sources-avro.html]
> Tested the to_avro function by making sure that data is sent to Kafka topic. 
> But when a Confluent Avro consumer is used to read data from the same topic, 
> the consumer fails with an error message that event is missing the magic 
> byte. 
> Used another topic to simulate reads from Kafka and further deserialization 
> using from_avro. Use case is, use a Confluent Avro producer to produce a few 
> events. And when I attempt to read this topic using structured streaming and 
> applying the function from_avro, it fails with a message indicating that 
> malformed records are present. 
> Using from_avro (deserialization) and to_avro (serialization) from Spark, 
> only works with Spark. And other consumers outside of Spark which do not use 
> this approach are failing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32359) Implement max_error metric evaluator for spark regression mllib

2020-08-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-32359.
--
Resolution: Invalid

> Implement max_error metric evaluator for spark regression mllib
> ---
>
> Key: SPARK-32359
> URL: https://issues.apache.org/jira/browse/SPARK-32359
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Sayed Mohammad Hossein Torabi
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32385) Publish a "bill of materials" (BOM) descriptor for Spark with correct versions of various dependencies

2020-08-16 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178528#comment-17178528
 ] 

Sean R. Owen commented on SPARK-32385:
--

What does this record that isn't available from a POM and/or the 'deps/' files?
I get the problem about dependencies - total nightmare.
But do we want yet another description to manage? it doesn't solve the problem.

> Publish a "bill of materials" (BOM) descriptor for Spark with correct 
> versions of various dependencies
> --
>
> Key: SPARK-32385
> URL: https://issues.apache.org/jira/browse/SPARK-32385
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Vladimir Matveev
>Priority: Major
>
> Spark has a lot of dependencies, many of them very common (e.g. Guava, 
> Jackson). Also, versions of these dependencies are not updated as frequently 
> as they are released upstream, which is totally understandable and natural, 
> but which also means that often Spark has a dependency on a lower version of 
> a library, which is incompatible with a higher, more recent version of the 
> same library. This incompatibility can manifest in different ways, e.g as 
> classpath errors or runtime check errors (like with Jackson), in certain 
> cases.
>  
> Spark does attempt to "fix" versions of its dependencies by declaring them 
> explicitly in its {{pom.xml}} file. However, this approach, being somewhat 
> workable if the Spark-using project itself uses Maven, breaks down if another 
> build system is used, like Gradle. The reason is that Maven uses an 
> unconventional "nearest first" version conflict resolution strategy, while 
> many other tools like Gradle use the "highest first" strategy which resolves 
> the highest possible version number inside the entire graph of dependencies. 
> This means that other dependencies of the project can pull a higher version 
> of some dependency, which is incompatible with Spark.
>  
> One example would be an explicit or a transitive dependency on a higher 
> version of Jackson in the project. Spark itself depends on several modules of 
> Jackson; if only one of them gets a higher version, and others remain on the 
> lower version, this will result in runtime exceptions due to an internal 
> version check in Jackson.
>  
> A widely used solution for this kind of version issues is publishing of a 
> "bill of materials" descriptor (see here: 
> [https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html]
>  and here: 
> [https://docs.gradle.org/current/userguide/platforms.html#sub:bom_import]). 
> This descriptor would contain all versions of all dependencies of Spark; then 
> downstream projects will be able to use their build system's support for BOMs 
> to enforce version constraints required for Spark to function correctly.
>  
> One example of successful implementation of the BOM-based approach is Spring: 
> [https://www.baeldung.com/spring-maven-bom#spring-bom]. For different Spring 
> projects, e.g. Spring Boot, there are BOM descriptors published which can be 
> used in downstream projects to fix the versions of Spring components and 
> their dependencies, significantly reducing confusion around proper version 
> numbers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32628) Use bloom filter to improve dynamicPartitionPruning

2020-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32628:


Assignee: (was: Apache Spark)

> Use bloom filter to improve dynamicPartitionPruning
> ---
>
> Key: SPARK-32628
> URL: https://issues.apache.org/jira/browse/SPARK-32628
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> It will throw exception when 
> {{spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly}} is 
> disabled:
> {code:sql}
> select catalog_sales.* from  catalog_sales join catalog_returns  where 
> cr_order_number = cs_sold_date_sk and cr_returned_time_sk < 4;
> {code}
> {noformat}
> 20/08/16 06:44:42 ERROR TaskSetManager: Total size of serialized results of 
> 494 tasks (1225.3 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)
> {noformat}
> We can improve it with minimum, maximum and Bloom filter to reduce serialized 
> results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32628) Use bloom filter to improve dynamicPartitionPruning

2020-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32628:


Assignee: Apache Spark

> Use bloom filter to improve dynamicPartitionPruning
> ---
>
> Key: SPARK-32628
> URL: https://issues.apache.org/jira/browse/SPARK-32628
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> It will throw exception when 
> {{spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly}} is 
> disabled:
> {code:sql}
> select catalog_sales.* from  catalog_sales join catalog_returns  where 
> cr_order_number = cs_sold_date_sk and cr_returned_time_sk < 4;
> {code}
> {noformat}
> 20/08/16 06:44:42 ERROR TaskSetManager: Total size of serialized results of 
> 494 tasks (1225.3 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)
> {noformat}
> We can improve it with minimum, maximum and Bloom filter to reduce serialized 
> results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32628) Use bloom filter to improve dynamicPartitionPruning

2020-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178515#comment-17178515
 ] 

Apache Spark commented on SPARK-32628:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/29446

> Use bloom filter to improve dynamicPartitionPruning
> ---
>
> Key: SPARK-32628
> URL: https://issues.apache.org/jira/browse/SPARK-32628
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> It will throw exception when 
> {{spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly}} is 
> disabled:
> {code:sql}
> select catalog_sales.* from  catalog_sales join catalog_returns  where 
> cr_order_number = cs_sold_date_sk and cr_returned_time_sk < 4;
> {code}
> {noformat}
> 20/08/16 06:44:42 ERROR TaskSetManager: Total size of serialized results of 
> 494 tasks (1225.3 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)
> {noformat}
> We can improve it with minimum, maximum and Bloom filter to reduce serialized 
> results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32628) Use bloom filter to improve dynamicPartitionPruning

2020-08-16 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178511#comment-17178511
 ] 

Yuming Wang commented on SPARK-32628:
-

Benchmark:

{code:scala}
spark.range(200L)
  .select(col("id"), col("id").%(2000).as("k"))
  .write
  .partitionBy("k")
  .mode("overwrite")
  .saveAsTable("df1")

spark.range(20L)
  .select(col("id"), col("id").as("k"))
  .write
  .mode("overwrite")
  .saveAsTable("df2")

spark.sql("CREATE TABLE t_result1 USING parquet as SELECT df1.id, df2.k FROM 
df1 JOIN df2 ON df1.k = df2.k AND df2.id > 1500 AND df2.id < 10L")

{code}


> Use bloom filter to improve dynamicPartitionPruning
> ---
>
> Key: SPARK-32628
> URL: https://issues.apache.org/jira/browse/SPARK-32628
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> It will throw exception when 
> {{spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly}} is 
> disabled:
> {code:sql}
> select catalog_sales.* from  catalog_sales join catalog_returns  where 
> cr_order_number = cs_sold_date_sk and cr_returned_time_sk < 4;
> {code}
> {noformat}
> 20/08/16 06:44:42 ERROR TaskSetManager: Total size of serialized results of 
> 494 tasks (1225.3 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)
> {noformat}
> We can improve it with minimum, maximum and Bloom filter to reduce serialized 
> results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32092) CrossvalidatorModel does not save all submodels (it saves only 3)

2020-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178484#comment-17178484
 ] 

Apache Spark commented on SPARK-32092:
--

User 'Louiszr' has created a pull request for this issue:
https://github.com/apache/spark/pull/29445

> CrossvalidatorModel does not save all submodels (it saves only 3)
> -
>
> Key: SPARK-32092
> URL: https://issues.apache.org/jira/browse/SPARK-32092
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0, 2.4.5
> Environment: Ran on two systems:
>  * Local pyspark installation (Windows): spark 2.4.5
>  * Spark 2.4.0 on a cluster
>Reporter: An De Rijdt
>Priority: Major
>
> When saving a CrossValidatorModel with more than 3 subModels and loading 
> again, a different amount of subModels is returned. It seems every time 3 
> subModels are returned.
> With less than two submodels (so 2 folds) writing plainly fails.
> Issue seems to be (but I am not so familiar with the scala/java side)
>  * python object is converted to scala/java
>  * in scala we save subModels until numFolds:
>  
> {code:java}
> val subModelsPath = new Path(path, "subModels") 
>for (splitIndex <- 0 until instance.getNumFolds) {
>   val splitPath = new Path(subModelsPath, 
> s"fold${splitIndex.toString}")
>   for (paramIndex <- 0 until instance.getEstimatorParamMaps.length) {
> val modelPath = new Path(splitPath, paramIndex.toString).toString
> 
> instance.subModels(splitIndex)(paramIndex).asInstanceOf[MLWritable].save(modelPath)
>   }
> {code}
>  * numFolds is not available on the CrossValidatorModel in pyspark
>  * default numFolds is 3 so somehow it tries to save 3 subModels.
> The first issue can be reproduced by following failing tests, where spark is 
> a SparkSession and tmp_path is a (temporary) directory.
> {code:java}
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, 
> CrossValidatorModel
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.linalg import Vectors
> def test_save_load_cross_validator(spark, tmp_path):
> temp_path = str(tmp_path)
> dataset = spark.createDataFrame(
> [
> (Vectors.dense([0.0]), 0.0),
> (Vectors.dense([0.4]), 1.0),
> (Vectors.dense([0.5]), 0.0),
> (Vectors.dense([0.6]), 1.0),
> (Vectors.dense([1.0]), 1.0),
> ]
> * 10,
> ["features", "label"],
> )
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(
> estimator=lr,
> estimatorParamMaps=grid,
> evaluator=evaluator,
> collectSubModels=True,
> numFolds=4,
> )
> cvModel = cv.fit(dataset)
> # test save/load of CrossValidatorModel
> cvModelPath = temp_path + "/cvModel"
> cvModel.write().overwrite().save(cvModelPath)
> loadedModel = CrossValidatorModel.load(cvModelPath)
> assert len(loadedModel.subModels) == len(cvModel.subModels)
> {code}
>  
> The second as follows (will fail writing):
> {code:java}
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, 
> CrossValidatorModel
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.linalg import Vectors
> def test_save_load_cross_validator(spark, tmp_path):
> temp_path = str(tmp_path)
> dataset = spark.createDataFrame(
> [
> (Vectors.dense([0.0]), 0.0),
> (Vectors.dense([0.4]), 1.0),
> (Vectors.dense([0.5]), 0.0),
> (Vectors.dense([0.6]), 1.0),
> (Vectors.dense([1.0]), 1.0),
> ]
> * 10,
> ["features", "label"],
> )
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(
> estimator=lr,
> estimatorParamMaps=grid,
> evaluator=evaluator,
> collectSubModels=True,
> numFolds=2,
> )
> cvModel = cv.fit(dataset)
> # test save/load of CrossValidatorModel
> cvModelPath = temp_path + "/cvModel"
> cvModel.write().overwrite().save(cvModelPath)
> loadedModel = CrossValidatorModel.load(cvModelPath)
> assert len(loadedModel.subModels) == len(cvModel.subModels)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Commented] (SPARK-32092) CrossvalidatorModel does not save all submodels (it saves only 3)

2020-08-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178485#comment-17178485
 ] 

Apache Spark commented on SPARK-32092:
--

User 'Louiszr' has created a pull request for this issue:
https://github.com/apache/spark/pull/29445

> CrossvalidatorModel does not save all submodels (it saves only 3)
> -
>
> Key: SPARK-32092
> URL: https://issues.apache.org/jira/browse/SPARK-32092
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0, 2.4.5
> Environment: Ran on two systems:
>  * Local pyspark installation (Windows): spark 2.4.5
>  * Spark 2.4.0 on a cluster
>Reporter: An De Rijdt
>Priority: Major
>
> When saving a CrossValidatorModel with more than 3 subModels and loading 
> again, a different amount of subModels is returned. It seems every time 3 
> subModels are returned.
> With less than two submodels (so 2 folds) writing plainly fails.
> Issue seems to be (but I am not so familiar with the scala/java side)
>  * python object is converted to scala/java
>  * in scala we save subModels until numFolds:
>  
> {code:java}
> val subModelsPath = new Path(path, "subModels") 
>for (splitIndex <- 0 until instance.getNumFolds) {
>   val splitPath = new Path(subModelsPath, 
> s"fold${splitIndex.toString}")
>   for (paramIndex <- 0 until instance.getEstimatorParamMaps.length) {
> val modelPath = new Path(splitPath, paramIndex.toString).toString
> 
> instance.subModels(splitIndex)(paramIndex).asInstanceOf[MLWritable].save(modelPath)
>   }
> {code}
>  * numFolds is not available on the CrossValidatorModel in pyspark
>  * default numFolds is 3 so somehow it tries to save 3 subModels.
> The first issue can be reproduced by following failing tests, where spark is 
> a SparkSession and tmp_path is a (temporary) directory.
> {code:java}
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, 
> CrossValidatorModel
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.linalg import Vectors
> def test_save_load_cross_validator(spark, tmp_path):
> temp_path = str(tmp_path)
> dataset = spark.createDataFrame(
> [
> (Vectors.dense([0.0]), 0.0),
> (Vectors.dense([0.4]), 1.0),
> (Vectors.dense([0.5]), 0.0),
> (Vectors.dense([0.6]), 1.0),
> (Vectors.dense([1.0]), 1.0),
> ]
> * 10,
> ["features", "label"],
> )
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(
> estimator=lr,
> estimatorParamMaps=grid,
> evaluator=evaluator,
> collectSubModels=True,
> numFolds=4,
> )
> cvModel = cv.fit(dataset)
> # test save/load of CrossValidatorModel
> cvModelPath = temp_path + "/cvModel"
> cvModel.write().overwrite().save(cvModelPath)
> loadedModel = CrossValidatorModel.load(cvModelPath)
> assert len(loadedModel.subModels) == len(cvModel.subModels)
> {code}
>  
> The second as follows (will fail writing):
> {code:java}
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, 
> CrossValidatorModel
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.linalg import Vectors
> def test_save_load_cross_validator(spark, tmp_path):
> temp_path = str(tmp_path)
> dataset = spark.createDataFrame(
> [
> (Vectors.dense([0.0]), 0.0),
> (Vectors.dense([0.4]), 1.0),
> (Vectors.dense([0.5]), 0.0),
> (Vectors.dense([0.6]), 1.0),
> (Vectors.dense([1.0]), 1.0),
> ]
> * 10,
> ["features", "label"],
> )
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(
> estimator=lr,
> estimatorParamMaps=grid,
> evaluator=evaluator,
> collectSubModels=True,
> numFolds=2,
> )
> cvModel = cv.fit(dataset)
> # test save/load of CrossValidatorModel
> cvModelPath = temp_path + "/cvModel"
> cvModel.write().overwrite().save(cvModelPath)
> loadedModel = CrossValidatorModel.load(cvModelPath)
> assert len(loadedModel.subModels) == len(cvModel.subModels)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Assigned] (SPARK-32092) CrossvalidatorModel does not save all submodels (it saves only 3)

2020-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32092:


Assignee: (was: Apache Spark)

> CrossvalidatorModel does not save all submodels (it saves only 3)
> -
>
> Key: SPARK-32092
> URL: https://issues.apache.org/jira/browse/SPARK-32092
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0, 2.4.5
> Environment: Ran on two systems:
>  * Local pyspark installation (Windows): spark 2.4.5
>  * Spark 2.4.0 on a cluster
>Reporter: An De Rijdt
>Priority: Major
>
> When saving a CrossValidatorModel with more than 3 subModels and loading 
> again, a different amount of subModels is returned. It seems every time 3 
> subModels are returned.
> With less than two submodels (so 2 folds) writing plainly fails.
> Issue seems to be (but I am not so familiar with the scala/java side)
>  * python object is converted to scala/java
>  * in scala we save subModels until numFolds:
>  
> {code:java}
> val subModelsPath = new Path(path, "subModels") 
>for (splitIndex <- 0 until instance.getNumFolds) {
>   val splitPath = new Path(subModelsPath, 
> s"fold${splitIndex.toString}")
>   for (paramIndex <- 0 until instance.getEstimatorParamMaps.length) {
> val modelPath = new Path(splitPath, paramIndex.toString).toString
> 
> instance.subModels(splitIndex)(paramIndex).asInstanceOf[MLWritable].save(modelPath)
>   }
> {code}
>  * numFolds is not available on the CrossValidatorModel in pyspark
>  * default numFolds is 3 so somehow it tries to save 3 subModels.
> The first issue can be reproduced by following failing tests, where spark is 
> a SparkSession and tmp_path is a (temporary) directory.
> {code:java}
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, 
> CrossValidatorModel
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.linalg import Vectors
> def test_save_load_cross_validator(spark, tmp_path):
> temp_path = str(tmp_path)
> dataset = spark.createDataFrame(
> [
> (Vectors.dense([0.0]), 0.0),
> (Vectors.dense([0.4]), 1.0),
> (Vectors.dense([0.5]), 0.0),
> (Vectors.dense([0.6]), 1.0),
> (Vectors.dense([1.0]), 1.0),
> ]
> * 10,
> ["features", "label"],
> )
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(
> estimator=lr,
> estimatorParamMaps=grid,
> evaluator=evaluator,
> collectSubModels=True,
> numFolds=4,
> )
> cvModel = cv.fit(dataset)
> # test save/load of CrossValidatorModel
> cvModelPath = temp_path + "/cvModel"
> cvModel.write().overwrite().save(cvModelPath)
> loadedModel = CrossValidatorModel.load(cvModelPath)
> assert len(loadedModel.subModels) == len(cvModel.subModels)
> {code}
>  
> The second as follows (will fail writing):
> {code:java}
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, 
> CrossValidatorModel
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.linalg import Vectors
> def test_save_load_cross_validator(spark, tmp_path):
> temp_path = str(tmp_path)
> dataset = spark.createDataFrame(
> [
> (Vectors.dense([0.0]), 0.0),
> (Vectors.dense([0.4]), 1.0),
> (Vectors.dense([0.5]), 0.0),
> (Vectors.dense([0.6]), 1.0),
> (Vectors.dense([1.0]), 1.0),
> ]
> * 10,
> ["features", "label"],
> )
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(
> estimator=lr,
> estimatorParamMaps=grid,
> evaluator=evaluator,
> collectSubModels=True,
> numFolds=2,
> )
> cvModel = cv.fit(dataset)
> # test save/load of CrossValidatorModel
> cvModelPath = temp_path + "/cvModel"
> cvModel.write().overwrite().save(cvModelPath)
> loadedModel = CrossValidatorModel.load(cvModelPath)
> assert len(loadedModel.subModels) == len(cvModel.subModels)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32092) CrossvalidatorModel does not save all submodels (it saves only 3)

2020-08-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32092:


Assignee: Apache Spark

> CrossvalidatorModel does not save all submodels (it saves only 3)
> -
>
> Key: SPARK-32092
> URL: https://issues.apache.org/jira/browse/SPARK-32092
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0, 2.4.5
> Environment: Ran on two systems:
>  * Local pyspark installation (Windows): spark 2.4.5
>  * Spark 2.4.0 on a cluster
>Reporter: An De Rijdt
>Assignee: Apache Spark
>Priority: Major
>
> When saving a CrossValidatorModel with more than 3 subModels and loading 
> again, a different amount of subModels is returned. It seems every time 3 
> subModels are returned.
> With less than two submodels (so 2 folds) writing plainly fails.
> Issue seems to be (but I am not so familiar with the scala/java side)
>  * python object is converted to scala/java
>  * in scala we save subModels until numFolds:
>  
> {code:java}
> val subModelsPath = new Path(path, "subModels") 
>for (splitIndex <- 0 until instance.getNumFolds) {
>   val splitPath = new Path(subModelsPath, 
> s"fold${splitIndex.toString}")
>   for (paramIndex <- 0 until instance.getEstimatorParamMaps.length) {
> val modelPath = new Path(splitPath, paramIndex.toString).toString
> 
> instance.subModels(splitIndex)(paramIndex).asInstanceOf[MLWritable].save(modelPath)
>   }
> {code}
>  * numFolds is not available on the CrossValidatorModel in pyspark
>  * default numFolds is 3 so somehow it tries to save 3 subModels.
> The first issue can be reproduced by following failing tests, where spark is 
> a SparkSession and tmp_path is a (temporary) directory.
> {code:java}
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, 
> CrossValidatorModel
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.linalg import Vectors
> def test_save_load_cross_validator(spark, tmp_path):
> temp_path = str(tmp_path)
> dataset = spark.createDataFrame(
> [
> (Vectors.dense([0.0]), 0.0),
> (Vectors.dense([0.4]), 1.0),
> (Vectors.dense([0.5]), 0.0),
> (Vectors.dense([0.6]), 1.0),
> (Vectors.dense([1.0]), 1.0),
> ]
> * 10,
> ["features", "label"],
> )
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(
> estimator=lr,
> estimatorParamMaps=grid,
> evaluator=evaluator,
> collectSubModels=True,
> numFolds=4,
> )
> cvModel = cv.fit(dataset)
> # test save/load of CrossValidatorModel
> cvModelPath = temp_path + "/cvModel"
> cvModel.write().overwrite().save(cvModelPath)
> loadedModel = CrossValidatorModel.load(cvModelPath)
> assert len(loadedModel.subModels) == len(cvModel.subModels)
> {code}
>  
> The second as follows (will fail writing):
> {code:java}
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, 
> CrossValidatorModel
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.linalg import Vectors
> def test_save_load_cross_validator(spark, tmp_path):
> temp_path = str(tmp_path)
> dataset = spark.createDataFrame(
> [
> (Vectors.dense([0.0]), 0.0),
> (Vectors.dense([0.4]), 1.0),
> (Vectors.dense([0.5]), 0.0),
> (Vectors.dense([0.6]), 1.0),
> (Vectors.dense([1.0]), 1.0),
> ]
> * 10,
> ["features", "label"],
> )
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(
> estimator=lr,
> estimatorParamMaps=grid,
> evaluator=evaluator,
> collectSubModels=True,
> numFolds=2,
> )
> cvModel = cv.fit(dataset)
> # test save/load of CrossValidatorModel
> cvModelPath = temp_path + "/cvModel"
> cvModel.write().overwrite().save(cvModelPath)
> loadedModel = CrossValidatorModel.load(cvModelPath)
> assert len(loadedModel.subModels) == len(cvModel.subModels)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32628) Use bloom filter to improve dynamicPartitionPruning

2020-08-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32628:

Description: 
It will throw exception when 
{{spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly}} is disabled:
{code:sql}
select catalog_sales.* from  catalog_sales join catalog_returns  where 
cr_order_number = cs_sold_date_sk and cr_returned_time_sk < 4;
{code}
{noformat}
20/08/16 06:44:42 ERROR TaskSetManager: Total size of serialized results of 494 
tasks (1225.3 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)
{noformat}

We can improve it with minimum, maximum and Bloom filter to reduce serialized 
results.

  was:
It will throw exception when 
{{spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly}} is disabled:
{code:sql}
select catalog_sales.* from  catalog_sales join catalog_returns  where 
cr_order_number = cs_sold_date_sk and cr_returned_time_sk < 4;
{code}
{noformat}
20/08/16 06:44:42 ERROR TaskSetManager: Total size of serialized results of 494 
tasks (1225.3 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)
{noformat}

We can improve it with minimum, maximum and bloom filter to reduce serialized 
results.


> Use bloom filter to improve dynamicPartitionPruning
> ---
>
> Key: SPARK-32628
> URL: https://issues.apache.org/jira/browse/SPARK-32628
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> It will throw exception when 
> {{spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly}} is 
> disabled:
> {code:sql}
> select catalog_sales.* from  catalog_sales join catalog_returns  where 
> cr_order_number = cs_sold_date_sk and cr_returned_time_sk < 4;
> {code}
> {noformat}
> 20/08/16 06:44:42 ERROR TaskSetManager: Total size of serialized results of 
> 494 tasks (1225.3 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)
> {noformat}
> We can improve it with minimum, maximum and Bloom filter to reduce serialized 
> results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32628) Use bloom filter to improve dynamicPartitionPruning

2020-08-16 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-32628:
---

 Summary: Use bloom filter to improve dynamicPartitionPruning
 Key: SPARK-32628
 URL: https://issues.apache.org/jira/browse/SPARK-32628
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang


It will throw exception when 
{{spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly}} is disabled:
{code:sql}
select catalog_sales.* from  catalog_sales join catalog_returns  where 
cr_order_number = cs_sold_date_sk and cr_returned_time_sk < 4;
{code}
{noformat}
20/08/16 06:44:42 ERROR TaskSetManager: Total size of serialized results of 494 
tasks (1225.3 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)
{noformat}

We can improve it with minimum, maximum and bloom filter to reduce serialized 
results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32542) Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-08-16 Thread karl wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178477#comment-17178477
 ] 

karl wang commented on SPARK-32542:
---

ok

> Add an optimizer rule to split an Expand into multiple Expands for aggregates
> -
>
> Key: SPARK-32542
> URL: https://issues.apache.org/jira/browse/SPARK-32542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: karl wang
>Priority: Major
>
> Split an expand into several small Expand, which contains the Specified 
> number of projections.
> For instance, like this sql.select a, b, c, d, count(1) from table1 group by 
> a, b, c, d with cube. It can expand 2^4 times of original data size.
> Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be 
> split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve 
> performance in multidimensional analysis when data is huge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32542) Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-08-16 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178475#comment-17178475
 ] 

Takeshi Yamamuro commented on SPARK-32542:
--

I unset the target/fix version. Please do not set them because they are 
reserved for committers. 

> Add an optimizer rule to split an Expand into multiple Expands for aggregates
> -
>
> Key: SPARK-32542
> URL: https://issues.apache.org/jira/browse/SPARK-32542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: karl wang
>Priority: Major
>
> Split an expand into several small Expand, which contains the Specified 
> number of projections.
> For instance, like this sql.select a, b, c, d, count(1) from table1 group by 
> a, b, c, d with cube. It can expand 2^4 times of original data size.
> Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be 
> split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve 
> performance in multidimensional analysis when data is huge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32542) Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-08-16 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32542:
-
Fix Version/s: (was: 3.0.0)

> Add an optimizer rule to split an Expand into multiple Expands for aggregates
> -
>
> Key: SPARK-32542
> URL: https://issues.apache.org/jira/browse/SPARK-32542
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.0.0
>Reporter: karl wang
>Priority: Major
>
> Split an expand into several small Expand, which contains the Specified 
> number of projections.
> For instance, like this sql.select a, b, c, d, count(1) from table1 group by 
> a, b, c, d with cube. It can expand 2^4 times of original data size.
> Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be 
> split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve 
> performance in multidimensional analysis when data is huge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32542) Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-08-16 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32542:
-
Component/s: (was: Optimizer)
 SQL

> Add an optimizer rule to split an Expand into multiple Expands for aggregates
> -
>
> Key: SPARK-32542
> URL: https://issues.apache.org/jira/browse/SPARK-32542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: karl wang
>Priority: Major
>
> Split an expand into several small Expand, which contains the Specified 
> number of projections.
> For instance, like this sql.select a, b, c, d, count(1) from table1 group by 
> a, b, c, d with cube. It can expand 2^4 times of original data size.
> Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be 
> split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve 
> performance in multidimensional analysis when data is huge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32542) Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-08-16 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32542:
-
Target Version/s:   (was: 3.0.0)

> Add an optimizer rule to split an Expand into multiple Expands for aggregates
> -
>
> Key: SPARK-32542
> URL: https://issues.apache.org/jira/browse/SPARK-32542
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.0.0
>Reporter: karl wang
>Priority: Major
>
> Split an expand into several small Expand, which contains the Specified 
> number of projections.
> For instance, like this sql.select a, b, c, d, count(1) from table1 group by 
> a, b, c, d with cube. It can expand 2^4 times of original data size.
> Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be 
> split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve 
> performance in multidimensional analysis when data is huge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-16 Thread Jarred Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178445#comment-17178445
 ] 

Jarred Li commented on SPARK-32582:
---

??I am not sure it would be helpful since there is no API in Hadoop to list 
partial files in a folder.??



We don't need to list all partitions in one table. The "sample" here means we 
sample some of the partitions not all the partitions. In the partition level, 
we can list all the files in that folder. 

 

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88];,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32125) [UI] Support get taskList by status in Web UI and SHS Rest API

2020-08-16 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-32125:

Fix Version/s: 3.1.0

> [UI] Support get taskList by status in Web UI and SHS Rest API
> --
>
> Key: SPARK-32125
> URL: https://issues.apache.org/jira/browse/SPARK-32125
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
> Fix For: 3.1.0
>
>
> Support fetching taskList by status as below:
> /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org