date:20200813

[jira] [Created] (SPARK-32615) AQE aggregateMetrics java.util.NoSuchElementException

2020-08-13 Thread Leanken.Lin (Jira)

Leanken.Lin created SPARK-32615:
---

 Summary: AQE aggregateMetrics java.util.NoSuchElementException
 Key: SPARK-32615
 URL: https://issues.apache.org/jira/browse/SPARK-32615
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Leanken.Lin


Reproduce Step
{code:java}
//代码占位符
sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- 
-z "SPARK-32573: Eliminate NAAJ when BuildSide is 
EmptyHashedRelationWithAllNullKeys"
{code}
{code:java}
//代码占位符
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: 
Uncaught exception in thread 
element-tracking-store-workerjava.util.NoSuchElementException: key not found: 
12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
 at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ 
when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 
milliseconds)
{code}
This issue is mainly because during AQE, while sub-plan changed, the metrics 
update is overwrite. for example, in this UT, change from BroadcastHashJoinExec 
into a LocalTableScanExec, and in the onExecutionEnd action it will try 
aggregate all metrics during the execution, which will cause 
NoSuchElementException



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32609) Incorrect exchange reuse with DataSourceV2

2020-08-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32609:


Assignee: Apache Spark

> Incorrect exchange reuse with DataSourceV2
> --
>
> Key: SPARK-32609
> URL: https://issues.apache.org/jira/browse/SPARK-32609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Mingjia Liu
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
>  
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
> df.createOrReplaceTempView(table)
> 
> df = spark.sql(""" 
> WITH t1 AS (
> SELECT 
> d_year, d_month_seq
> FROM (
> SELECT t1.d_year , t2.d_month_seq  
> FROM 
> date_dim t1
> cross join
> date_dim t2
> where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
> and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
> )
> GROUP BY d_year, d_month_seq)
>
>  SELECT
> prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> df.show(){code}
>  
> the repro query :
> A. defines a temp table t1  
> B. cross join t1 (year 2002)  and  t2 (year 2001)
> With reuse exchange enabled, the plan incorrectly "decides" to re-use 
> persisted shuffle writes of A filtering on year 2002 , for year 2001.
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS 
> FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L])
> +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS 
> year#24367L, d_month_seq#24371L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#23551L]
>   :   :  +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] 
> (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), 
> isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#24371L]
>   : +- *(2) ScanV2 
> BigQueryDataSourceV2[d_month_seq#24371L] (Filters: 
> [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), 
> isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], 
> functions=[])
>  +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange 
> hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code}
>  
>  
> And the result is obviously incorrect because prev_year should be 2001
> {code:java}
> +-++---+
> |prev_year|year|d_month_seq|
> +-++---+
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> +-++---+
> only showing top 20 rows
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32609) Incorrect exchange reuse with DataSourceV2

2020-08-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177528#comment-17177528
 ] 

Apache Spark commented on SPARK-32609:
--

User 'mingjialiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29430

> Incorrect exchange reuse with DataSourceV2
> --
>
> Key: SPARK-32609
> URL: https://issues.apache.org/jira/browse/SPARK-32609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Mingjia Liu
>Priority: Major
>  Labels: correctness
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
>  
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
> df.createOrReplaceTempView(table)
> 
> df = spark.sql(""" 
> WITH t1 AS (
> SELECT 
> d_year, d_month_seq
> FROM (
> SELECT t1.d_year , t2.d_month_seq  
> FROM 
> date_dim t1
> cross join
> date_dim t2
> where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
> and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
> )
> GROUP BY d_year, d_month_seq)
>
>  SELECT
> prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> df.show(){code}
>  
> the repro query :
> A. defines a temp table t1  
> B. cross join t1 (year 2002)  and  t2 (year 2001)
> With reuse exchange enabled, the plan incorrectly "decides" to re-use 
> persisted shuffle writes of A filtering on year 2002 , for year 2001.
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS 
> FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L])
> +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS 
> year#24367L, d_month_seq#24371L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#23551L]
>   :   :  +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] 
> (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), 
> isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#24371L]
>   : +- *(2) ScanV2 
> BigQueryDataSourceV2[d_month_seq#24371L] (Filters: 
> [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), 
> isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], 
> functions=[])
>  +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange 
> hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code}
>  
>  
> And the result is obviously incorrect because prev_year should be 2001
> {code:java}
> +-++---+
> |prev_year|year|d_month_seq|
> +-++---+
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> +-++---+
> only showing top 20 rows
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32609) Incorrect exchange reuse with DataSourceV2

2020-08-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32609:


Assignee: (was: Apache Spark)

> Incorrect exchange reuse with DataSourceV2
> --
>
> Key: SPARK-32609
> URL: https://issues.apache.org/jira/browse/SPARK-32609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Mingjia Liu
>Priority: Major
>  Labels: correctness
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
>  
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
> df.createOrReplaceTempView(table)
> 
> df = spark.sql(""" 
> WITH t1 AS (
> SELECT 
> d_year, d_month_seq
> FROM (
> SELECT t1.d_year , t2.d_month_seq  
> FROM 
> date_dim t1
> cross join
> date_dim t2
> where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
> and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
> )
> GROUP BY d_year, d_month_seq)
>
>  SELECT
> prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> df.show(){code}
>  
> the repro query :
> A. defines a temp table t1  
> B. cross join t1 (year 2002)  and  t2 (year 2001)
> With reuse exchange enabled, the plan incorrectly "decides" to re-use 
> persisted shuffle writes of A filtering on year 2002 , for year 2001.
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS 
> FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L])
> +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS 
> year#24367L, d_month_seq#24371L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#23551L]
>   :   :  +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] 
> (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), 
> isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#24371L]
>   : +- *(2) ScanV2 
> BigQueryDataSourceV2[d_month_seq#24371L] (Filters: 
> [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), 
> isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], 
> functions=[])
>  +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange 
> hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code}
>  
>  
> And the result is obviously incorrect because prev_year should be 2001
> {code:java}
> +-++---+
> |prev_year|year|d_month_seq|
> +-++---+
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> +-++---+
> only showing top 20 rows
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32345) SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark client for Spark session

2020-08-13 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-32345.
-
Resolution: Invalid

> SemanticException Failed to get a spark session: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark 
> client for Spark session
> --
>
> Key: SPARK-32345
> URL: https://issues.apache.org/jira/browse/SPARK-32345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: 任建亭
>Priority: Blocker
>
>  when using hive on spark engine:
>     FAILED: SemanticException Failed to get a spark session: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark 
> client for Spark session
> hadoop version: 2.7.3 / hive version: 3.1.2 / spark version 3.0.0
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-13 Thread Chandan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan updated SPARK-32614:

Description: 
In most of the data ware housing scenarios files does not have comment records 
and every line needs to be treated as a valid record even though it starts with 
default comment character as \u or null character.Though user can set any 
comment character other than \u, but there is a chance the actual record 
can start with those characters.

Currently for the below piece of code and the given testdata where first row 
starts with null \u
character it will throw the below error.

*eg: *val df = 
spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
  df.show(false);

*+TestData+*
 
 !screenshot-1.png! 

Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
at 
com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552)
at 
org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160)
at 
org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148)
at 
org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)

*Note:*

Though its the limitation of the univocity parser and the workaround is to 
provide any other comment character by mentioning .option("comment","#"), but 
if my actual data starts with this character then the particular row will be 
discarded.

Currently I pushed the code in univocity parser to handle this scenario as part 
of the below PR
https://github.com/uniVocity/univocity-parsers/pull/412

please accept the jira so that we can enable this feature in spark-csv by 
adding a parameter in spark csvoptions.
 

  was:
In most of the data ware housing scenarios files does not have comment records 
and every line needs to be treated as a valid record even though it starts with 
default comment character as \u or null character.Though user can set any 
comment character other than \u, but there is a chance the actual record 
can start with those characters.

Currently for the below piece of code and the given testdata where first row 
starts with null \u
character it will throw the below error.

Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
at 
com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552)
at 
org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160)
at 
org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148)
at 
org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)

eg: val df = 
spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
  df.show(false);

*+TestData+*
 
 !screenshot-1.png! 
 


> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Chandan
>Assignee: Jeff Evans
>Priority: Major
> Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u or null character.Though user 
> can set any comment character other than \u, but there is a chance the 
> actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row 
> starts with null \u
> character it will throw the below error.
> *eg: *val df = 
> spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>   df.show(false);
> *+TestData+*
>  
>  !screenshot-1.png! 
> Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
>   at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
>   at 
> co

[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-13 Thread Chandan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan updated SPARK-32614:

Description: 
In most of the data ware housing scenarios files does not have comment records 
and every line needs to be treated as a valid record even though it starts with 
default comment character as \u or null character.Though user can set any 
comment character other than \u, but there is a chance the actual record 
can start with those characters.

Currently for the below piece of code and the given testdata where first row 
starts with null \u
character it will throw the below error.

Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
at 
com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552)
at 
org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160)
at 
org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148)
at 
org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)

eg: val df = 
spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
  df.show(false);

*+TestData+*
 
 !screenshot-1.png! 
 

  was:
In most of the data ware housing scenarios files does not have comment records 
and every line needs to be treated as a valid record even though it starts with 
default comment character as \u or null character.Though user can set any 
comment character other than \u, but there is a chance the actual record 
can start with those characters.

Currently for the below piece of code and the given testdata where first row 
starts with null \u
character it will throw the 

eg: val df = 
spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
  df.show(false);

*+TestData+*
 
 !screenshot-1.png! 
 


> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Chandan
>Assignee: Jeff Evans
>Priority: Major
> Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u or null character.Though user 
> can set any comment character other than \u, but there is a chance the 
> actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row 
> starts with null \u
> character it will throw the below error.
> Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
>   at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
>   at 
> com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)
> eg: val df = 
> spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>   df.show(false);
> *+TestData+*
>  
>  !screenshot-1.png! 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-13 Thread Chandan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan updated SPARK-32614:

Description: 
In most of the data ware housing scenarios files does not have comment records 
and every line needs to be treated as a valid record even though it starts with 
default comment character as \u or null character.Though user can set any 
comment character other than \u, but there is a chance the actual record 
can start with those characters.

Currently for the below piece of code and the given testdata where first row 
starts with null \u
character it will throw the 

eg: val df = 
spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
  df.show(false);

*+TestData+*
 
 !screenshot-1.png! 
 

  was:
In most of the data ware housing scenarios files does not have comment records 
and every line needs to be treated as a valid record even though it starts with 
default comment character as \u or null character.Though user can set any 
comment character other than \u, but there is a chance the actual record 
can start with those characters.

Currently for the below piece of code and the given testdata where first row 
starts with null \u
character

eg: val df = 
spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
  df.show(false);

*+TestData+*
 
 !screenshot-1.png! 
 


> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Chandan
>Assignee: Jeff Evans
>Priority: Major
> Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u or null character.Though user 
> can set any comment character other than \u, but there is a chance the 
> actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row 
> starts with null \u
> character it will throw the 
> eg: val df = 
> spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>   df.show(false);
> *+TestData+*
>  
>  !screenshot-1.png! 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-13 Thread Chandan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan updated SPARK-32614:

Issue Type: Bug  (was: New Feature)

> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Chandan
>Assignee: Jeff Evans
>Priority: Major
> Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u or null character.Though user 
> can set any comment character other than \u, but there is a chance the 
> actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row 
> starts with null \u
> character it will throw the 
> eg: val df = 
> spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>   df.show(false);
> *+TestData+*
>  
>  !screenshot-1.png! 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-13 Thread Chandan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan updated SPARK-32614:

Affects Version/s: 2.4.5

> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Chandan
>Assignee: Jeff Evans
>Priority: Major
> Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u or null character.Though user 
> can set any comment character other than \u, but there is a chance the 
> actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row 
> starts with null \u
> character it will throw the 
> eg: val df = 
> spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>   df.show(false);
> *+TestData+*
>  
>  !screenshot-1.png! 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-13 Thread Chandan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan updated SPARK-32614:

Issue Type: New Feature  (was: Improvement)

> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Chandan
>Assignee: Jeff Evans
>Priority: Major
> Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u or null character.Though user 
> can set any comment character other than \u, but there is a chance the 
> actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row 
> starts with null \u
> character
> eg: val df = 
> spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>   df.show(false);
> *+TestData+*
>  
>  !screenshot-1.png! 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-13 Thread Chandan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan updated SPARK-32614:

Description: 
In most of the data ware housing scenarios files does not have comment records 
and every line needs to be treated as a valid record even though it starts with 
default comment character as \u or null character.Though user can set any 
comment character other than \u, but there is a chance the actual record 
can start with those characters.

Currently for the below piece of code and the given testdata where first row 
starts with null \u
character

eg: val df = 
spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
  df.show(false);

*+TestData+*
 
 !screenshot-1.png! 
 

  was:
In most of the data ware housing scenarios files does not have comment records 
and every line needs to be treated as a valid record even though it starts with 
default comment character as \u or null character.Though user can set any 
comment character other than \u, but there is a chance the actual record 
can start with those characters.

Currently 

eg: Dataset df = spark.read().option("inferSchema", "true")
                                                          .option("header", 
"false")
  .option("delimiter", 
",")
  
.csv("/tmp/delimitedfile.dat)

*+TestData+*
 

 


> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Chandan
>Assignee: Jeff Evans
>Priority: Major
> Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u or null character.Though user 
> can set any comment character other than \u, but there is a chance the 
> actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row 
> starts with null \u
> character
> eg: val df = 
> spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>   df.show(false);
> *+TestData+*
>  
>  !screenshot-1.png! 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-13 Thread Chandan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan updated SPARK-32614:

Attachment: screenshot-1.png

> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Chandan
>Assignee: Jeff Evans
>Priority: Major
> Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u or null character.Though user 
> can set any comment character other than \u, but there is a chance the 
> actual record can start with those characters.
> Currently 
> eg: Dataset df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>   
> .option("delimiter", ",")
>   
> .csv("/tmp/delimitedfile.dat)
> *+TestData+*
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-13 Thread Chandan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan updated SPARK-32614:

Description: 
In most of the data ware housing scenarios files does not have comment records 
and every line needs to be treated as a valid record even though it starts with 
default comment character as \u or null character.Though user can set any 
comment character other than \u, but there is a chance the actual record 
can start with those characters.

Currently 

eg: Dataset df = spark.read().option("inferSchema", "true")
                                                          .option("header", 
"false")
  .option("delimiter", 
",")
  
.csv("/tmp/delimitedfile.dat)

*+TestData+*
 

 

  was:
Currently, the delimiter option Spark 2.0 to read and split CSV files/data only 
support a single character delimiter. If we try to provide multiple delimiters, 
we observer the following error message.

eg: Dataset df = spark.read().option("inferSchema", "true")
                                                          .option("header", 
"false")

                                                         .option("delimiter", 
", ")
                                                          .csv("C:\test.txt");

Exception in thread "main" java.lang.IllegalArgumentException: Delimiter cannot 
be more than one character: , 

at 
org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
 at scala.Option.orElse(Option.scala:289)
 at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)

 

Generally, the data to be processed contains multiple character delimiters and 
presently we need to do a manual data clean up on the source/input file, which 
doesn't work well in large applications which consumes numerous files.

There seems to be work-around like reading data as text and using the split 
option, but this in my opinion defeats the purpose, advantage and efficiency of 
a direct read from CSV file.

 


> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Chandan
>Assignee: Jeff Evans
>Priority: Major
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u or null character.Though user 
> can set any comment character other than \u, but there is a chance the 
> actual record can start with those characters.
> Currently 
> eg: Dataset df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>   
> .option("delimiter", ",")
>   
> .csv("/tmp/delimitedfile.dat)
> *+TestData+*
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-13 Thread Chandan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan updated SPARK-32614:

Fix Version/s: (was: 3.0.0)

> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chandan
>Assignee: Jeff Evans
>Priority: Minor
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 
> delimiters, we observer the following error message.
> eg: Dataset df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>                                                          .option("delimiter", 
> ", ")
>                                                           .csv("C:\test.txt");
> Exception in thread "main" java.lang.IllegalArgumentException: Delimiter 
> cannot be more than one character: , 
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
>  
> Generally, the data to be processed contains multiple character delimiters 
> and presently we need to do a manual data clean up on the source/input file, 
> which doesn't work well in large applications which consumes numerous files.
> There seems to be work-around like reading data as text and using the split 
> option, but this in my opinion defeats the purpose, advantage and efficiency 
> of a direct read from CSV file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-13 Thread Chandan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan updated SPARK-32614:

Affects Version/s: (was: 2.3.1)
   3.0.0

> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chandan
>Assignee: Jeff Evans
>Priority: Major
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 
> delimiters, we observer the following error message.
> eg: Dataset df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>                                                          .option("delimiter", 
> ", ")
>                                                           .csv("C:\test.txt");
> Exception in thread "main" java.lang.IllegalArgumentException: Delimiter 
> cannot be more than one character: , 
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
>  
> Generally, the data to be processed contains multiple character delimiters 
> and presently we need to do a manual data clean up on the source/input file, 
> which doesn't work well in large applications which consumes numerous files.
> There seems to be work-around like reading data as text and using the split 
> option, but this in my opinion defeats the purpose, advantage and efficiency 
> of a direct read from CSV file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-13 Thread Chandan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan updated SPARK-32614:

Priority: Major  (was: Minor)

> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chandan
>Assignee: Jeff Evans
>Priority: Major
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 
> delimiters, we observer the following error message.
> eg: Dataset df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>                                                          .option("delimiter", 
> ", ")
>                                                           .csv("C:\test.txt");
> Exception in thread "main" java.lang.IllegalArgumentException: Delimiter 
> cannot be more than one character: , 
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
>  
> Generally, the data to be processed contains multiple character delimiters 
> and presently we need to do a manual data clean up on the source/input file, 
> which doesn't work well in large applications which consumes numerous files.
> There seems to be work-around like reading data as text and using the split 
> option, but this in my opinion defeats the purpose, advantage and efficiency 
> of a direct read from CSV file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-13 Thread Chandan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan updated SPARK-32614:

Component/s: Spark Core

> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Chandan
>Assignee: Jeff Evans
>Priority: Major
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 
> delimiters, we observer the following error message.
> eg: Dataset df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>                                                          .option("delimiter", 
> ", ")
>                                                           .csv("C:\test.txt");
> Exception in thread "main" java.lang.IllegalArgumentException: Delimiter 
> cannot be more than one character: , 
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
>  
> Generally, the data to be processed contains multiple character delimiters 
> and presently we need to do a manual data clean up on the source/input file, 
> which doesn't work well in large applications which consumes numerous files.
> There seems to be work-around like reading data as text and using the split 
> option, but this in my opinion defeats the purpose, advantage and efficiency 
> of a direct read from CSV file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-13 Thread Chandan (Jira)

Chandan created SPARK-32614:
---

 Summary: Support for treating the line as valid record if it 
starts with \u or null character, or starts with any character mentioned as 
comment
 Key: SPARK-32614
 URL: https://issues.apache.org/jira/browse/SPARK-32614
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Chandan
Assignee: Jeff Evans
 Fix For: 3.0.0


Currently, the delimiter option Spark 2.0 to read and split CSV files/data only 
support a single character delimiter. If we try to provide multiple delimiters, 
we observer the following error message.

eg: Dataset df = spark.read().option("inferSchema", "true")
                                                          .option("header", 
"false")

                                                         .option("delimiter", 
", ")
                                                          .csv("C:\test.txt");

Exception in thread "main" java.lang.IllegalArgumentException: Delimiter cannot 
be more than one character: , 

at 
org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
 at scala.Option.orElse(Option.scala:289)
 at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)

 

Generally, the data to be processed contains multiple character delimiters and 
presently we need to do a manual data clean up on the source/input file, which 
doesn't work well in large applications which consumes numerous files.

There seems to be work-around like reading data as text and using the split 
option, but this in my opinion defeats the purpose, advantage and efficiency of 
a direct read from CSV file.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32345) SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark client for Spark session

2020-08-13 Thread ZhouDaHong (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177468#comment-17177468
 ] 

ZhouDaHong commented on SPARK-32345:


If the cause of the version conflict is excluded. You can look at queue 
resources.

If the queue resource reaches 100% and there is no free task resource released 
for creating spark session in a short time, the task will fail and this 
exception will be thrown.

Solution: increase the connection time interval of hive client to 15 minutes;

set hive.spark.client . server.connect.timeout=90 ;

> SemanticException Failed to get a spark session: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark 
> client for Spark session
> --
>
> Key: SPARK-32345
> URL: https://issues.apache.org/jira/browse/SPARK-32345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: 任建亭
>Priority: Blocker
>
>  when using hive on spark engine:
>     FAILED: SemanticException Failed to get a spark session: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark 
> client for Spark session
> hadoop version: 2.7.3 / hive version: 3.1.2 / spark version 3.0.0
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32357) Publish failed and succeeded test reports in GitHub Actions

2020-08-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32357:
--
Summary: Publish failed and succeeded test reports in GitHub Actions  (was: 
Investigate test result reporter integration)

> Publish failed and succeeded test reports in GitHub Actions
> ---
>
> Key: SPARK-32357
> URL: https://issues.apache.org/jira/browse/SPARK-32357
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, the readability in the logs are not really good. For example, see 
> https://pipelines.actions.githubusercontent.com/gik0C3if0ep5i8iNpgFlcJRQk9UyifmoD6XvJANMVttkEP5xje/_apis/pipelines/1/runs/564/signedlogcontent/4?urlExpires=2020-07-09T14%3A05%3A52.5110439Z&urlSigningMethod=HMACV1&urlSignature=gMGczJ8vtNPeQFE0GpjMxSS1BGq14RJLXUfjsLnaX7s%3D
> Maybe we should have a way to report the results in an easy way to read. For 
> example, Jenkins test report-like feature.
> We should maybe also take a look for 
> https://github.com/check-run-reporter/action.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32357) Investigate test result reporter integration

2020-08-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32357:
-

Assignee: Hyukjin Kwon

> Investigate test result reporter integration
> 
>
> Key: SPARK-32357
> URL: https://issues.apache.org/jira/browse/SPARK-32357
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, the readability in the logs are not really good. For example, see 
> https://pipelines.actions.githubusercontent.com/gik0C3if0ep5i8iNpgFlcJRQk9UyifmoD6XvJANMVttkEP5xje/_apis/pipelines/1/runs/564/signedlogcontent/4?urlExpires=2020-07-09T14%3A05%3A52.5110439Z&urlSigningMethod=HMACV1&urlSignature=gMGczJ8vtNPeQFE0GpjMxSS1BGq14RJLXUfjsLnaX7s%3D
> Maybe we should have a way to report the results in an easy way to read. For 
> example, Jenkins test report-like feature.
> We should maybe also take a look for 
> https://github.com/check-run-reporter/action.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32357) Investigate test result reporter integration

2020-08-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32357.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29333
[https://github.com/apache/spark/pull/29333]

> Investigate test result reporter integration
> 
>
> Key: SPARK-32357
> URL: https://issues.apache.org/jira/browse/SPARK-32357
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, the readability in the logs are not really good. For example, see 
> https://pipelines.actions.githubusercontent.com/gik0C3if0ep5i8iNpgFlcJRQk9UyifmoD6XvJANMVttkEP5xje/_apis/pipelines/1/runs/564/signedlogcontent/4?urlExpires=2020-07-09T14%3A05%3A52.5110439Z&urlSigningMethod=HMACV1&urlSignature=gMGczJ8vtNPeQFE0GpjMxSS1BGq14RJLXUfjsLnaX7s%3D
> Maybe we should have a way to report the results in an easy way to read. For 
> example, Jenkins test report-like feature.
> We should maybe also take a look for 
> https://github.com/check-run-reporter/action.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32587) SPARK SQL writing to JDBC target with bit datatype using Dataframe is writing NULL values

2020-08-13 Thread ZhouDaHong (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177465#comment-17177465
 ] 

ZhouDaHong edited comment on SPARK-32587 at 8/14/20, 3:54 AM:
--

Sorry, I don't really understand your problem.

Do you mean that data cannot be written to the SQL Server database when there 
is a null value column?

If this is the case, please check the structure of the table to see if the 
error reporting field is defined as "not null" in the database?


was (Author: zdh):
Sorry, I don't really understand your problem. Do you mean that data cannot be 
written to the SQL Server database when there is a null value column? If this 
is the case, please check the structure of the table to see if the error 
reporting field is defined as "not null" in the database?

> SPARK SQL writing to JDBC target with bit datatype using Dataframe is writing 
> NULL values
> -
>
> Key: SPARK-32587
> URL: https://issues.apache.org/jira/browse/SPARK-32587
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Mohit Dave
>Priority: Major
>
> While writing to a target in SQL Server using Microsoft's SQL Server driver 
> using dataframe.write API the target is storing NULL values for BIT columns.
>  
> Table definition
> Azure SQL DB 
> 1)Create 2 tables with column type as bit
> 2)Insert some record into 1 table
> Create a SPARK job 
> 1)Create a Dataframe using spark.read with the following query
> select  from 
> 2)Write the dataframe to a target table with bit type  as column.
>  
> Observation : Bit type is getting converted to NULL at the target
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32587) SPARK SQL writing to JDBC target with bit datatype using Dataframe is writing NULL values

2020-08-13 Thread ZhouDaHong (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177465#comment-17177465
 ] 

ZhouDaHong edited comment on SPARK-32587 at 8/14/20, 3:54 AM:
--

Sorry, I don't really understand your problem. Do you mean that data cannot be 
written to the SQL Server database when there is a null value column? If this 
is the case, please check the structure of the table to see if the error 
reporting field is defined as "not null" in the database?


was (Author: zdh):
抱歉，我不是特别明白你的问题。你是不是说数据存在空值列的时候，无法写入到sql 
server数据库？如果是这样的话，请查看待写入的表的结构，查看报错字段是否在数据库中定义了“not null”？

> SPARK SQL writing to JDBC target with bit datatype using Dataframe is writing 
> NULL values
> -
>
> Key: SPARK-32587
> URL: https://issues.apache.org/jira/browse/SPARK-32587
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Mohit Dave
>Priority: Major
>
> While writing to a target in SQL Server using Microsoft's SQL Server driver 
> using dataframe.write API the target is storing NULL values for BIT columns.
>  
> Table definition
> Azure SQL DB 
> 1)Create 2 tables with column type as bit
> 2)Insert some record into 1 table
> Create a SPARK job 
> 1)Create a Dataframe using spark.read with the following query
> select  from 
> 2)Write the dataframe to a target table with bit type  as column.
>  
> Observation : Bit type is getting converted to NULL at the target
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32587) SPARK SQL writing to JDBC target with bit datatype using Dataframe is writing NULL values

2020-08-13 Thread ZhouDaHong (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177465#comment-17177465
 ] 

ZhouDaHong commented on SPARK-32587:


抱歉，我不是特别明白你的问题。你是不是说数据存在空值列的时候，无法写入到sql 
server数据库？如果是这样的话，请查看待写入的表的结构，查看报错字段是否在数据库中定义了“not null”？

> SPARK SQL writing to JDBC target with bit datatype using Dataframe is writing 
> NULL values
> -
>
> Key: SPARK-32587
> URL: https://issues.apache.org/jira/browse/SPARK-32587
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Mohit Dave
>Priority: Major
>
> While writing to a target in SQL Server using Microsoft's SQL Server driver 
> using dataframe.write API the target is storing NULL values for BIT columns.
>  
> Table definition
> Azure SQL DB 
> 1)Create 2 tables with column type as bit
> 2)Insert some record into 1 table
> Create a SPARK job 
> 1)Create a Dataframe using spark.read with the following query
> select  from 
> 2)Write the dataframe to a target table with bit type  as column.
>  
> Observation : Bit type is getting converted to NULL at the target
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type

2020-08-13 Thread ZhouDaHong (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177461#comment-17177461
 ] 

ZhouDaHong commented on SPARK-21774:


Hello, you compare the value of a field of string type with the 0 in your sql. 
Due to the different data types, (the 0 may be judged as boolean type, or 0 as 
int type).

Therefore, the SQL statement [ select a, B from TB where a = 0 ] cannot get the 
result you expect.

It is suggested to change to [ select a, B from TB where a ='0' ]

> The rule PromoteStrings cast string to a wrong data type
> 
>
> Key: SPARK-21774
> URL: https://issues.apache.org/jira/browse/SPARK-21774
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StanZhai
>Priority: Critical
>  Labels: correctness
>
> Data
> {code}
> create temporary view tb as select * from values
> ("0", 1),
> ("-0.1", 2),
> ("1", 3)
> as grouping(a, b)
> {code}
> SQL:
> {code}
> select a, b from tb where a=0
> {code}
> The result which is wrong:
> {code}
> ++---+
> |   a|  b|
> ++---+
> |   0|  1|
> |-0.1|  2|
> ++---+
> {code}
> Logical Plan:
> {code}
> == Parsed Logical Plan ==
> 'Project ['a]
> +- 'Filter ('a = 0)
>+- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> a: string
> Project [a#8528]
> +- Filter (cast(a#8528 as int) = 0)
>+- SubqueryAlias src
>   +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529]
>  +- LocalRelation [_1#8525, _2#8526]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9182) filter and groupBy on DataFrames are not passed through to jdbc source

2020-08-13 Thread ZhouDaHong (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-9182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177455#comment-17177455
 ] 

ZhouDaHong commented on SPARK-9182:
---

Hello, it seems that the problem is that the "Sal" field is of numerical type, 
but in the actual SQL process, it is impossible to match the numeric value non 
equivalently. Try changing the "Sal" field to int or double.

> filter and groupBy on DataFrames are not passed through to jdbc source
> --
>
> Key: SPARK-9182
> URL: https://issues.apache.org/jira/browse/SPARK-9182
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Greg Rahn
>Assignee: Yijie Shen
>Priority: Critical
>
> When running all of these API calls, the only one that passes the filter 
> through to the backend jdbc source is equality.  All filters in these 
> commands should be able to be passed through to the jdbc database source.
> {code}
> val url="jdbc:postgresql:grahn"
> val prop = new java.util.Properties
> val emp = sqlContext.read.jdbc(url, "emp", prop)
> emp.filter(emp("sal") === 5000).show()
> emp.filter(emp("sal") < 5000).show()
> emp.filter("sal = 3000").show()
> emp.filter("sal > 2500").show()
> emp.filter("sal >= 2500").show()
> emp.filter("sal < 2500").show()
> emp.filter("sal <= 2500").show()
> emp.filter("sal != 3000").show()
> emp.filter("sal between 3000 and 5000").show()
> emp.filter("ename in ('SCOTT','BLAKE')").show()
> {code}
> We see from the PostgreSQL query log the following is run, and see that only 
> equality predicates are passed through.
> {code}
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp WHERE 
> sal = 5000
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp WHERE 
> sal = 3000
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32609) Incorrect exchange reuse with DataSourceV2

2020-08-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32609:
-
Shepherd:   (was: Reynold Xin)

> Incorrect exchange reuse with DataSourceV2
> --
>
> Key: SPARK-32609
> URL: https://issues.apache.org/jira/browse/SPARK-32609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Mingjia Liu
>Priority: Major
>  Labels: correctness
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
>  
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
> df.createOrReplaceTempView(table)
> 
> df = spark.sql(""" 
> WITH t1 AS (
> SELECT 
> d_year, d_month_seq
> FROM (
> SELECT t1.d_year , t2.d_month_seq  
> FROM 
> date_dim t1
> cross join
> date_dim t2
> where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
> and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
> )
> GROUP BY d_year, d_month_seq)
>
>  SELECT
> prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> df.show(){code}
>  
> the repro query :
> A. defines a temp table t1  
> B. cross join t1 (year 2002)  and  t2 (year 2001)
> With reuse exchange enabled, the plan incorrectly "decides" to re-use 
> persisted shuffle writes of A filtering on year 2002 , for year 2001.
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS 
> FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L])
> +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS 
> year#24367L, d_month_seq#24371L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#23551L]
>   :   :  +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] 
> (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), 
> isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#24371L]
>   : +- *(2) ScanV2 
> BigQueryDataSourceV2[d_month_seq#24371L] (Filters: 
> [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), 
> isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], 
> functions=[])
>  +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange 
> hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code}
>  
>  
> And the result is obviously incorrect because prev_year should be 2001
> {code:java}
> +-++---+
> |prev_year|year|d_month_seq|
> +-++---+
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> +-++---+
> only showing top 20 rows
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32604) Bug in ALSModel Python Documentation

2020-08-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177439#comment-17177439
 ] 

Hyukjin Kwon commented on SPARK-32604:
--

Oh right. So I guess it will be fixed in Spark 3.1 correctly?

> Bug in ALSModel Python Documentation
> 
>
> Key: SPARK-32604
> URL: https://issues.apache.org/jira/browse/SPARK-32604
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Zach Cahoone
>Priority: Minor
>
> In the ALSModel documentation 
> ([https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]), 
> there is a bug which causes data frame creation to fail with the following 
> error:
> {code:java}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 15, 10.0.0.133, executor 10): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, 
> in main
> process()
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, 
> in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 390, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 1354, in takeUpToNumLeft
> yield next(iterator)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in 
> wrapper
> return f(*args, **kwargs)
>   File "", line 24, in 
> NameError: name 'long' is not defined
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.

[jira] [Commented] (SPARK-32053) pyspark save of serialized model is failing for windows.

2020-08-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177437#comment-17177437
 ] 

Hyukjin Kwon commented on SPARK-32053:
--

Seems like Hadoop and Windows issue assuming from the log:

{code}
Caused by: java.io.IOException: (null) entry in command string: null chmod 0644 
C:\Users\Administrator\AppData\Roaming\IBM Watson 
Studio\projects\librarydAYYEkp5bh6TSjn106hb\metadata_temporary\0_temporary\attempt_20200813063114_0049_m_00_0\part-0
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:225)
at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:209)
at 
org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:3
{code}

> pyspark save of serialized model is failing for windows.
> 
>
> Key: SPARK-32053
> URL: https://issues.apache.org/jira/browse/SPARK-32053
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Kayal
>Priority: Blocker
> Attachments: image-2020-06-22-18-19-32-236.png, 
> image-2020-08-13-20-24-57-309.png, image-2020-08-13-20-25-57-585.png, 
> image-2020-08-13-20-28-28-779.png, image-2020-08-13-20-29-40-555.png, 
> screenshot-1.png
>
>
> {color:#172b4d}Hi, {color}
> {color:#172b4d}We are using spark functionality to save the serialized model 
> to disk . On windows platform we are seeing save of the serialized model is 
> failing with the error:  o288.save() failed. {color}
>  
>  
>  
> !image-2020-06-22-18-19-32-236.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32053) pyspark save of serialized model is failing for windows.

2020-08-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32053:
-
Affects Version/s: 3.0.0

> pyspark save of serialized model is failing for windows.
> 
>
> Key: SPARK-32053
> URL: https://issues.apache.org/jira/browse/SPARK-32053
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0, 3.0.0
>Reporter: Kayal
>Priority: Minor
> Attachments: image-2020-06-22-18-19-32-236.png, 
> image-2020-08-13-20-24-57-309.png, image-2020-08-13-20-25-57-585.png, 
> image-2020-08-13-20-28-28-779.png, image-2020-08-13-20-29-40-555.png, 
> screenshot-1.png
>
>
> {color:#172b4d}Hi, {color}
> {color:#172b4d}We are using spark functionality to save the serialized model 
> to disk . On windows platform we are seeing save of the serialized model is 
> failing with the error:  o288.save() failed. {color}
>  
>  
>  
> !image-2020-06-22-18-19-32-236.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32053) pyspark save of serialized model is failing for windows.

2020-08-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32053:
-
Component/s: Windows

> pyspark save of serialized model is failing for windows.
> 
>
> Key: SPARK-32053
> URL: https://issues.apache.org/jira/browse/SPARK-32053
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Windows
>Affects Versions: 2.3.0, 3.0.0
>Reporter: Kayal
>Priority: Minor
> Attachments: image-2020-06-22-18-19-32-236.png, 
> image-2020-08-13-20-24-57-309.png, image-2020-08-13-20-25-57-585.png, 
> image-2020-08-13-20-28-28-779.png, image-2020-08-13-20-29-40-555.png, 
> screenshot-1.png
>
>
> {color:#172b4d}Hi, {color}
> {color:#172b4d}We are using spark functionality to save the serialized model 
> to disk . On windows platform we are seeing save of the serialized model is 
> failing with the error:  o288.save() failed. {color}
>  
>  
>  
> !image-2020-06-22-18-19-32-236.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32053) pyspark save of serialized model is failing for windows.

2020-08-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32053:
-
Priority: Minor  (was: Blocker)

> pyspark save of serialized model is failing for windows.
> 
>
> Key: SPARK-32053
> URL: https://issues.apache.org/jira/browse/SPARK-32053
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Kayal
>Priority: Minor
> Attachments: image-2020-06-22-18-19-32-236.png, 
> image-2020-08-13-20-24-57-309.png, image-2020-08-13-20-25-57-585.png, 
> image-2020-08-13-20-28-28-779.png, image-2020-08-13-20-29-40-555.png, 
> screenshot-1.png
>
>
> {color:#172b4d}Hi, {color}
> {color:#172b4d}We are using spark functionality to save the serialized model 
> to disk . On windows platform we are seeing save of the serialized model is 
> failing with the error:  o288.save() failed. {color}
>  
>  
>  
> !image-2020-06-22-18-19-32-236.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31947) Solve string value error about Date/Timestamp in ScriptTransform

2020-08-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177436#comment-17177436
 ] 

Hyukjin Kwon commented on SPARK-31947:
--

[~angerszhuuu] can you update which PR fixed this JIRA?


> Solve string value error about Date/Timestamp in ScriptTransform
> 
>
> Key: SPARK-31947
> URL: https://issues.apache.org/jira/browse/SPARK-31947
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> For test case
>  
> {code:java}
>  test("SPARK-25990: TRANSFORM should handle different data types correctly") {
> assume(TestUtils.testCommandAvailable("python"))
> val scriptFilePath = getTestResourcePath("test_script.py")
> withTempView("v") {
>   val df = Seq(
> (1, "1", 1.0, BigDecimal(1.0), new Timestamp(1), 
> Date.valueOf("2015-05-21")),
> (2, "2", 2.0, BigDecimal(2.0), new Timestamp(2), 
> Date.valueOf("2015-05-22")),
> (3, "3", 3.0, BigDecimal(3.0), new Timestamp(3), 
> Date.valueOf("2015-05-23"))
>   ).toDF("a", "b", "c", "d", "e", "f") // Note column d's data type is 
> Decimal(38, 18)
>   df.createTempView("v")  val query = sql(
> s"""
>|SELECT
>|TRANSFORM(a, b, c, d, e, f)
>|USING 'python $scriptFilePath' AS (a, b, c, d, e, f)
>|FROM v
> """.stripMargin)  val decimalToString: Column => Column = c => 
> c.cast("string")  checkAnswer(query, identity, df.select(
> 'a.cast("string"),
> 'b.cast("string"),
> 'c.cast("string"),
> decimalToString('d),
> 'e.cast("string"),
> 'f.cast("string")).collect())
> }
>   }
> {code}
>  
>  
> Get wrong result
> {code:java}
> [info] - SPARK-25990: TRANSFORM should handle different data types correctly 
> *** FAILED *** (4 seconds, 997 milliseconds)
> [info]   Results do not match for Spark plan:
> [info]ScriptTransformation [a#19, b#20, c#21, d#22, e#23, f#24], python 
> /Users/angerszhu/Documents/project/AngersZhu/spark/sql/core/target/scala-2.12/test-classes/test_script.py,
>  [a#31, b#32, c#33, d#34, e#35, f#36], 
> org.apache.spark.sql.execution.script.ScriptTransformIOSchema@1ad5a29c
> [info]   +- Project [_1#6 AS a#19, _2#7 AS b#20, _3#8 AS c#21, _4#9 AS d#22, 
> _5#10 AS e#23, _6#11 AS f#24]
> [info]  +- LocalTableScan [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11]
> [info]
> [info]
> [info]== Results ==
> [info]!== Expected Answer - 3 ==  
>  == Actual Answer - 3 ==
> [info]   ![1,1,1.0,1.00,1970-01-01 08:00:00.001,2015-05-21]   
> [1,1,1.0,1.00,1000,16576]
> [info]   ![2,2,2.0,2.00,1970-01-01 08:00:00.002,2015-05-22]   
> [2,2,2.0,2.00,2000,16577]
> [info]   ![3,3,3.0,3.00,1970-01-01 08:00:00.003,2015-05-23]   
> [3,3,3.0,3.00,3000,16578] (SparkPlanTest.scala:95)
> [
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2020-08-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-12312:


Assignee: Gabor Somogyi

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Assignee: Gabor Somogyi
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32598) Not able to see driver logs in spark history server in standalone mode

2020-08-13 Thread Lantao Jin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177429#comment-17177429
 ] 

Lantao Jin commented on SPARK-32598:


[~sriramgr] PullRequest is welcome. Please commit on the master branch. Thanks.

> Not able to see driver logs in spark history server in standalone mode
> --
>
> Key: SPARK-32598
> URL: https://issues.apache.org/jira/browse/SPARK-32598
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.4
>Reporter: Sriram Ganesh
>Priority: Major
> Attachments: image-2020-08-12-11-50-01-899.png
>
>
> Driver logs are not coming in history server in spark standalone mode. 
> Checked in the spark events logs it is not there. Is this by design or can I 
> fix it by creating a patch?. Not able to see any proper documentation 
> regarding this.
>  
> !image-2020-08-12-11-50-01-899.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32608) Script Transform DELIMIT value error

2020-08-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32608:


Assignee: (was: Apache Spark)

> Script Transform DELIMIT  value error
> -
>
> Key: SPARK-32608
> URL: https://issues.apache.org/jira/browse/SPARK-32608
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> For SQL
>  
> {code:java}
> SELECT TRANSFORM(a, b, c)
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
>   LINES TERMINATED BY '\n'
>   NULL DEFINED AS 'null'
>   USING 'cat' AS (a, b, c)
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
>   LINES TERMINATED BY '\n'
>   NULL DEFINED AS 'NULL'
> FROM testData
> {code}
> The correct 
> TOK_TABLEROWFORMATFIELD should be , nut actually  ','
> TOK_TABLEROWFORMATLINES should be \n  but actually '\n'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32608) Script Transform DELIMIT value error

2020-08-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32608:


Assignee: Apache Spark

> Script Transform DELIMIT  value error
> -
>
> Key: SPARK-32608
> URL: https://issues.apache.org/jira/browse/SPARK-32608
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> For SQL
>  
> {code:java}
> SELECT TRANSFORM(a, b, c)
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
>   LINES TERMINATED BY '\n'
>   NULL DEFINED AS 'null'
>   USING 'cat' AS (a, b, c)
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
>   LINES TERMINATED BY '\n'
>   NULL DEFINED AS 'NULL'
> FROM testData
> {code}
> The correct 
> TOK_TABLEROWFORMATFIELD should be , nut actually  ','
> TOK_TABLEROWFORMATLINES should be \n  but actually '\n'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32608) Script Transform DELIMIT value error

2020-08-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177420#comment-17177420
 ] 

Apache Spark commented on SPARK-32608:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29428

> Script Transform DELIMIT  value error
> -
>
> Key: SPARK-32608
> URL: https://issues.apache.org/jira/browse/SPARK-32608
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> For SQL
>  
> {code:java}
> SELECT TRANSFORM(a, b, c)
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
>   LINES TERMINATED BY '\n'
>   NULL DEFINED AS 'null'
>   USING 'cat' AS (a, b, c)
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
>   LINES TERMINATED BY '\n'
>   NULL DEFINED AS 'NULL'
> FROM testData
> {code}
> The correct 
> TOK_TABLEROWFORMATFIELD should be , nut actually  ','
> TOK_TABLEROWFORMATLINES should be \n  but actually '\n'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-08-13 Thread Dustin Smith (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177401#comment-17177401
 ] 

Dustin Smith edited comment on SPARK-32046 at 8/14/20, 12:51 AM:
-

[~maropu] the definition of current timestamp is as follows:
{code:java}
current_timestamp() - Returns the current timestamp at the start of query 
evaluation.
{code}
The question when does a query evaluation start and stop? Do mutual exclusive 
dataframes being processed consist of the same query evaluation? If yes, then 
current timestamp's behavior in spark shell is correct; however, as user, that 
would be extremely undesirable behavior. I would rather cache the current 
timestamp and call it again for a new time. 

Now if a query evaluation stops once it is executed and starts anew when 
another dataframe or action is called, then the behavior in shell and notebooks 
are incorrect. The notebooks are only correct for a few runs and then default 
to not changing.

[https://spark.apache.org/docs/2.3.0/api/sql/index.html#current_timestamp]

Additionally, whatever behavior is correct or should be correct is not 
consistent and more robust testing should occur in my opinion.

 

As an after thought, the name current timestamp doesn't make sense if the time 
is supposed to freeze after one call. Really it is current timestamp once and 
beyond that call it is no longer current. 


was (Author: dustin.smith.tdg):
[~maropu] the definition of current timestamp is as follows:
{code:java}
current_timestamp() - Returns the current timestamp at the start of query 
evaluation.
{code}
The question when does a query evaluation start and stop? Do mutual exclusive 
dataframes being processed consist of the same query evaluation? If yes, then 
current timestamp's behavior in spark shell is correct; however, as user, that 
would be extremely undesirable behavior. I would rather cache the current 
timestamp and call it again for a new time. 

Now if a query evaluation stops once it is executed and starts anew when 
another dataframe or action is called, then the behavior in shell and notebooks 
are incorrect. The notebooks are only correct for a few runs and then default 
to not changing.

[https://spark.apache.org/docs/2.3.0/api/sql/index.html#current_timestamp]

Additionally, whatever behavior is correct or should be correct is not 
consistent and more robust testing should occur in my opinion.

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-08-13 Thread Dustin Smith (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177401#comment-17177401
 ] 

Dustin Smith edited comment on SPARK-32046 at 8/14/20, 12:49 AM:
-

[~maropu] the definition of current timestamp is as follows:
{code:java}
current_timestamp() - Returns the current timestamp at the start of query 
evaluation.
{code}
The question when does a query evaluation start and stop? Do mutual exclusive 
dataframes being processed consist of the same query evaluation? If yes, then 
current timestamp's behavior in spark shell is correct; however, as user, that 
would be extremely undesirable behavior. I would rather cache the current 
timestamp and call it again for a new time. 

Now if a query evaluation stops once it is executed and starts anew when 
another dataframe or action is called, then the behavior in shell and notebooks 
are incorrect. The notebooks are only correct for a few runs and then default 
to not changing.

[https://spark.apache.org/docs/2.3.0/api/sql/index.html#current_timestamp]

Additionally, whatever behavior is correct or should be correct is not 
consistent and more robust testing should occur in my opinion.


was (Author: dustin.smith.tdg):
[~maropu] the definition of current timestamp is as follows:
{code:java}
current_timestamp() - Returns the current timestamp at the start of query 
evaluation.
{code}
The question when does a query evaluation start and stop? Do mutual exclusive 
dataframes being processed consist of the same query evaluation? If yes, then 
current timestamp's behavior in spark shell is correct; however, as user, that 
would be extremely undesirable behavior. I would rather cache the current 
timestamp and call it again for a new time. 

Now if a query evaluation stops once it is executed and starts anew when 
another dataframe or action is called, then the behavior in shell and notebooks 
are incorrect. The notebooks are only correct for a few runs and then default 
to not changing.

[https://spark.apache.org/docs/2.3.0/api/sql/index.html#current_timestamp]

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-08-13 Thread Dustin Smith (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177401#comment-17177401
 ] 

Dustin Smith commented on SPARK-32046:
--

[~maropu] the definition of current timestamp is as follows:
{code:java}
current_timestamp() - Returns the current timestamp at the start of query 
evaluation.
{code}
The question when does a query evaluation start and stop? Do mutual exclusive 
dataframes being processed consist of the same query evaluation? If yes, then 
current timestamp's behavior in spark shell is correct; however, as user, that 
would be extremely undesirable behavior. I would rather cache the current 
timestamp and call it again for a new time. 

Now if a query evaluation stops once it is executed and starts anew when 
another dataframe or action is called, then the behavior in shell and notebooks 
are incorrect. The notebooks are only correct for a few runs and then default 
to not changing.

[https://spark.apache.org/docs/2.3.0/api/sql/index.html#current_timestamp]

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-08-13 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177398#comment-17177398
 ] 

Takeshi Yamamuro commented on SPARK-32046:
--

This issue depends on ZP or Jupyter? If so, I think it is hard to reproduce 
this issue...  Non-deterministic exprs can change output if cache broken in the 
case. So, I think it is not a good idea for applications to depend on those 
kinds of values, either way.

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32613) DecommissionWorkerSuite has started failing sporadically again

2020-08-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32613:


Assignee: Apache Spark

> DecommissionWorkerSuite has started failing sporadically again
> --
>
> Key: SPARK-32613
> URL: https://issues.apache.org/jira/browse/SPARK-32613
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Devesh Agrawal
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.1.0
>
>
> Test "decommission workers ensure that fetch failures lead to rerun" is 
> failing: 
>  
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127357/testReport/org.apache.spark.deploy/DecommissionWorkerSuite/decommission_workers_ensure_that_fetch_failures_lead_to_rerun/]
> https://github.com/apache/spark/pull/29367/checks?check_run_id=972990200#step:14:13579
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32613) DecommissionWorkerSuite has started failing sporadically again

2020-08-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32613:


Assignee: (was: Apache Spark)

> DecommissionWorkerSuite has started failing sporadically again
> --
>
> Key: SPARK-32613
> URL: https://issues.apache.org/jira/browse/SPARK-32613
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Devesh Agrawal
>Priority: Major
> Fix For: 3.1.0
>
>
> Test "decommission workers ensure that fetch failures lead to rerun" is 
> failing: 
>  
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127357/testReport/org.apache.spark.deploy/DecommissionWorkerSuite/decommission_workers_ensure_that_fetch_failures_lead_to_rerun/]
> https://github.com/apache/spark/pull/29367/checks?check_run_id=972990200#step:14:13579
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32613) DecommissionWorkerSuite has started failing sporadically again

2020-08-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177368#comment-17177368
 ] 

Apache Spark commented on SPARK-32613:
--

User 'agrawaldevesh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29422

> DecommissionWorkerSuite has started failing sporadically again
> --
>
> Key: SPARK-32613
> URL: https://issues.apache.org/jira/browse/SPARK-32613
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Devesh Agrawal
>Priority: Major
> Fix For: 3.1.0
>
>
> Test "decommission workers ensure that fetch failures lead to rerun" is 
> failing: 
>  
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127357/testReport/org.apache.spark.deploy/DecommissionWorkerSuite/decommission_workers_ensure_that_fetch_failures_lead_to_rerun/]
> https://github.com/apache/spark/pull/29367/checks?check_run_id=972990200#step:14:13579
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32613) DecommissionWorkerSuite has started failing sporadically again

2020-08-13 Thread Devesh Agrawal (Jira)

Devesh Agrawal created SPARK-32613:
--

 Summary: DecommissionWorkerSuite has started failing sporadically 
again
 Key: SPARK-32613
 URL: https://issues.apache.org/jira/browse/SPARK-32613
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Devesh Agrawal
 Fix For: 3.1.0


Test "decommission workers ensure that fetch failures lead to rerun" is 
failing: 

 

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127357/testReport/org.apache.spark.deploy/DecommissionWorkerSuite/decommission_workers_ensure_that_fetch_failures_lead_to_rerun/]

https://github.com/apache/spark/pull/29367/checks?check_run_id=972990200#step:14:13579

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32611) Querying ORC table in Spark3 using spark.sql.orc.impl=hive produces incorrect when timestamp is present in predicate

2020-08-13 Thread Sumeet (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet updated SPARK-32611:
---
Description: 
*How to reproduce this behavior?*
 * TZ="America/Los_Angeles" ./bin/spark-shell
 * sql("set spark.sql.hive.convertMetastoreOrc=true")
 * sql("set spark.sql.orc.impl=hive")
 * sql("create table t_spark(col timestamp) stored as orc;")
 * sql("insert into t_spark values (cast('2100-01-01 
01:33:33.123America/Los_Angeles' as timestamp));")
 * sql("select col, date_format(col, 'DD') from t_spark where col = 
cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
 *This will return empty results, which is incorrect.*
 * sql("set spark.sql.orc.impl=native")
 * sql("select col, date_format(col, 'DD') from t_spark where col = 
cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
 *This will return 1 row, which is the expected output.*

 

The above query using (True, hive) returns *correct results if pushdown filters 
are turned off*. 
 * sql("set spark.sql.orc.filterPushdown=false")
 * sql("select col, date_format(col, 'DD') from t_spark where col = 
cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
 *This will return 1 row, which is the expected output.*

  was:
*How to reproduce this behavior?*
 * TZ="America/Los_Angeles" ./bin/spark-shell --conf 
spark.sql.catalogImplementation=hive
 * sql("set spark.sql.hive.convertMetastoreOrc=true")
 * sql("set spark.sql.orc.impl=hive")
 * sql("create table t_spark(col timestamp) stored as orc;")
 * sql("insert into t_spark values (cast('2100-01-01 
01:33:33.123America/Los_Angeles' as timestamp));")
 * sql("select col, date_format(col, 'DD') from t_spark where col = 
cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
 *This will return empty results, which is incorrect.*
 * sql("set spark.sql.orc.impl=native")
 * sql("select col, date_format(col, 'DD') from t_spark where col = 
cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
 *This will return 1 row, which is the expected output.*

 

The above query using (True, hive) returns *correct results if pushdown filters 
are turned off*. 
 * sql("set spark.sql.orc.filterPushdown=false")
 * sql("select col, date_format(col, 'DD') from t_spark where col = 
cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
 *This will return 1 row, which is the expected output.*


> Querying ORC table in Spark3 using spark.sql.orc.impl=hive produces incorrect 
> when timestamp is present in predicate
> 
>
> Key: SPARK-32611
> URL: https://issues.apache.org/jira/browse/SPARK-32611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Sumeet
>Priority: Major
>
> *How to reproduce this behavior?*
>  * TZ="America/Los_Angeles" ./bin/spark-shell
>  * sql("set spark.sql.hive.convertMetastoreOrc=true")
>  * sql("set spark.sql.orc.impl=hive")
>  * sql("create table t_spark(col timestamp) stored as orc;")
>  * sql("insert into t_spark values (cast('2100-01-01 
> 01:33:33.123America/Los_Angeles' as timestamp));")
>  * sql("select col, date_format(col, 'DD') from t_spark where col = 
> cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
>  *This will return empty results, which is incorrect.*
>  * sql("set spark.sql.orc.impl=native")
>  * sql("select col, date_format(col, 'DD') from t_spark where col = 
> cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
>  *This will return 1 row, which is the expected output.*
>  
> The above query using (True, hive) returns *correct results if pushdown 
> filters are turned off*. 
>  * sql("set spark.sql.orc.filterPushdown=false")
>  * sql("select col, date_format(col, 'DD') from t_spark where col = 
> cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
>  *This will return 1 row, which is the expected output.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32604) Bug in ALSModel Python Documentation

2020-08-13 Thread Zach Cahoone (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177315#comment-17177315
 ] 

Zach Cahoone commented on SPARK-32604:
--

The code here 
([https://github.com/apache/spark/blob/master/examples/src/main/python/ml/als_example.py])
 is correct. I believe the documentation webpage just needs to be updated. 

[https://spark.apache.org/docs/latest/ml-collaborative-filtering.html#cold-start-strategy]

> Bug in ALSModel Python Documentation
> 
>
> Key: SPARK-32604
> URL: https://issues.apache.org/jira/browse/SPARK-32604
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Zach Cahoone
>Priority: Minor
>
> In the ALSModel documentation 
> ([https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]), 
> there is a bug which causes data frame creation to fail with the following 
> error:
> {code:java}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 15, 10.0.0.133, executor 10): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, 
> in main
> process()
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, 
> in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 390, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 1354, in takeUpToNumLeft
> yield next(iterator)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in 
> wrapper
> return f(*args, **kwargs)
>   File "", line 24, in 
> NameError: name 'long' is not defined
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
>   at 
> org.apache.spark.sched

[jira] [Created] (SPARK-32612) int columns produce inconsistent results on pandas UDFs

2020-08-13 Thread Robert Joseph Evans (Jira)

Robert Joseph Evans created SPARK-32612:
---

 Summary: int columns produce inconsistent results on pandas UDFs
 Key: SPARK-32612
 URL: https://issues.apache.org/jira/browse/SPARK-32612
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Robert Joseph Evans


This is similar to SPARK-30187 but I personally consider this data corruption.

If I have a simple pandas UDF
{code}
 >>> def add(a, b):
return a + b
 >>> my_udf = pandas_udf(add, returnType=LongType())
{code}

And I want to process some data with it, say 32 bit ints
{code}
>>> df = spark.createDataFrame([(1037694399, 1204615848),(3,4)], 
>>> StructType([StructField("a", IntegerType()), StructField("b", 
>>> IntegerType())]))

>>> df.select(my_udf(col("a") - 3, col("b")).show()
+--+--+---+
| a| b|add((a - 3), b)|
+--+--+---+
|1037694399|1204615848|-2052657052|
| 3| 4|  4|
+--+--+---+
{code}

I get an integer overflow for the data as I would expect.  But as soon as I add 
a {{None}} to the data, even on a different row the result I get back is 
totally different.

{code}
>>> df = spark.createDataFrame([(1037694399, 1204615848),(3,None)], 
>>> StructType([StructField("a", IntegerType()), StructField("b", 
>>> IntegerType())]))

>>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show()
+--+--+---+
| a| b|add((a - 3), b)|
+--+--+---+
|1037694399|1204615848| 2242310244|
| 3|  null|   null|
+--+--+---+
{code}

The integer overflow disappears.  This is because arrow and/or pandas changes 
the data type to a float in order to be able to store the null value.  So then 
the processing is being done on floating point there is no overflow.  This in 
and of itself is annoying but understandable because it is dealing with a 
limitation in pandas. 

Where it becomes a bug is that this happens on a per batch basis.  This means 
that I can have the same two rows in different parts of my data set and get 
different results depending on their proximity to a null value.

{code}
>>> df = spark.createDataFrame([(1037694399, 1204615848),(3,None),(1037694399, 
>>> 1204615848),(3,4)], StructType([StructField("a", IntegerType()), 
>>> StructField("b", IntegerType())]))
>>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show()
+--+--+---+
| a| b|add((a - 3), b)|
+--+--+---+
|1037694399|1204615848| 2242310244|
| 3|  null|   null|
|1037694399|1204615848| 2242310244|
| 3| 4|  4|
+--+--+---+

>>> spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', '2')
>>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show()
+--+--+---+
| a| b|add((a - 3), b)|
+--+--+---+
|1037694399|1204615848| 2242310244|
| 3|  null|   null|
|1037694399|1204615848|-2052657052|
| 3| 4|  4|
+--+--+---+

{code}

For me personally I would prefer to have all nullable integer columns upgraded 
to float all the time, that way it is at least consistent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25557) ORC predicate pushdown for nested fields

2020-08-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177301#comment-17177301
 ] 

Apache Spark commented on SPARK-25557:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/29427

> ORC predicate pushdown for nested fields
> 
>
> Key: SPARK-25557
> URL: https://issues.apache.org/jira/browse/SPARK-25557
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25557) ORC predicate pushdown for nested fields

2020-08-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177300#comment-17177300
 ] 

Apache Spark commented on SPARK-25557:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/29427

> ORC predicate pushdown for nested fields
> 
>
> Key: SPARK-25557
> URL: https://issues.apache.org/jira/browse/SPARK-25557
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31776) Literal lit() supports lists and numpy arrays

2020-08-13 Thread Zirui Xu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177281#comment-17177281
 ] 

Zirui Xu commented on SPARK-31776:
--

Hi [~mengxr], is this ticket suggesting adding support for Python list and 
numpy arrays in both lit() and fillna()? I understand that fillna() does not 
support Array type so adding Python list support to fillna() might involve 
adding Array support from the SQL side first.

> Literal lit() supports lists and numpy arrays
> -
>
> Key: SPARK-31776
> URL: https://issues.apache.org/jira/browse/SPARK-31776
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In ML workload, it is common to replace null feature vectors with some 
> default value. However, lit() does not support Python list and numpy arrays 
> at input. Users cannot simply use fillna() to get the job done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32611) Querying ORC table in Spark3 using spark.sql.orc.impl=hive produces incorrect when timestamp is present in predicate

2020-08-13 Thread Sumeet (Jira)

Sumeet created SPARK-32611:
--

 Summary: Querying ORC table in Spark3 using 
spark.sql.orc.impl=hive produces incorrect when timestamp is present in 
predicate
 Key: SPARK-32611
 URL: https://issues.apache.org/jira/browse/SPARK-32611
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.0.1
Reporter: Sumeet


*How to reproduce this behavior?*
 * TZ="America/Los_Angeles" ./bin/spark-shell --conf 
spark.sql.catalogImplementation=hive
 * sql("set spark.sql.hive.convertMetastoreOrc=true")
 * sql("set spark.sql.orc.impl=hive")
 * sql("create table t_spark(col timestamp) stored as orc;")
 * sql("insert into t_spark values (cast('2100-01-01 
01:33:33.123America/Los_Angeles' as timestamp));")
 * sql("select col, date_format(col, 'DD') from t_spark where col = 
cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
 *This will return empty results, which is incorrect.*
 * sql("set spark.sql.orc.impl=native")
 * sql("select col, date_format(col, 'DD') from t_spark where col = 
cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
 *This will return 1 row, which is the expected output.*

 

The above query using (True, hive) returns *correct results if pushdown filters 
are turned off*. 
 * sql("set spark.sql.orc.filterPushdown=false")
 * sql("select col, date_format(col, 'DD') from t_spark where col = 
cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false)
 *This will return 1 row, which is the expected output.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32609) Incorrect exchange reuse with DataSourceV2

2020-08-13 Thread Rohit Mishra (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177246#comment-17177246
 ] 

Rohit Mishra edited comment on SPARK-32609 at 8/13/20, 6:28 PM:


[~mingjial], Please refrain from adding Priority as "Critical". These are 
reserved for committers. Changing it to "Major".

Also please don't populate Target and Fix version field. These are also set by 
committers.


was (Author: rohitmishr1484):
[~mingjial], Please refrain from adding Priority as "Critical". These are 
reserved for committers. Changing it to "Major".

> Incorrect exchange reuse with DataSourceV2
> --
>
> Key: SPARK-32609
> URL: https://issues.apache.org/jira/browse/SPARK-32609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Mingjia Liu
>Priority: Major
>  Labels: correctness
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
>  
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
> df.createOrReplaceTempView(table)
> 
> df = spark.sql(""" 
> WITH t1 AS (
> SELECT 
> d_year, d_month_seq
> FROM (
> SELECT t1.d_year , t2.d_month_seq  
> FROM 
> date_dim t1
> cross join
> date_dim t2
> where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
> and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
> )
> GROUP BY d_year, d_month_seq)
>
>  SELECT
> prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> df.show(){code}
>  
> the repro query :
> A. defines a temp table t1  
> B. cross join t1 (year 2002)  and  t2 (year 2001)
> With reuse exchange enabled, the plan incorrectly "decides" to re-use 
> persisted shuffle writes of A filtering on year 2002 , for year 2001.
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS 
> FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L])
> +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS 
> year#24367L, d_month_seq#24371L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#23551L]
>   :   :  +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] 
> (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), 
> isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#24371L]
>   : +- *(2) ScanV2 
> BigQueryDataSourceV2[d_month_seq#24371L] (Filters: 
> [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), 
> isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], 
> functions=[])
>  +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange 
> hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code}
>  
>  
> And the result is obviously incorrect because prev_year should be 2001
> {code:java}
> +-++---+
> |prev_year|year|d_month_seq|
> +-++---+
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> +-++---+
> only showing top 20 rows
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32609) Incorrect exchange reuse with DataSourceV2

2020-08-13 Thread Rohit Mishra (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohit Mishra updated SPARK-32609:
-
Target Version/s:   (was: 2.4.5, 2.4.7)

> Incorrect exchange reuse with DataSourceV2
> --
>
> Key: SPARK-32609
> URL: https://issues.apache.org/jira/browse/SPARK-32609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Mingjia Liu
>Priority: Major
>  Labels: correctness
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
>  
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
> df.createOrReplaceTempView(table)
> 
> df = spark.sql(""" 
> WITH t1 AS (
> SELECT 
> d_year, d_month_seq
> FROM (
> SELECT t1.d_year , t2.d_month_seq  
> FROM 
> date_dim t1
> cross join
> date_dim t2
> where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
> and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
> )
> GROUP BY d_year, d_month_seq)
>
>  SELECT
> prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> df.show(){code}
>  
> the repro query :
> A. defines a temp table t1  
> B. cross join t1 (year 2002)  and  t2 (year 2001)
> With reuse exchange enabled, the plan incorrectly "decides" to re-use 
> persisted shuffle writes of A filtering on year 2002 , for year 2001.
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS 
> FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L])
> +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS 
> year#24367L, d_month_seq#24371L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#23551L]
>   :   :  +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] 
> (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), 
> isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#24371L]
>   : +- *(2) ScanV2 
> BigQueryDataSourceV2[d_month_seq#24371L] (Filters: 
> [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), 
> isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], 
> functions=[])
>  +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange 
> hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code}
>  
>  
> And the result is obviously incorrect because prev_year should be 2001
> {code:java}
> +-++---+
> |prev_year|year|d_month_seq|
> +-++---+
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> +-++---+
> only showing top 20 rows
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32609) Incorrect exchange reuse with DataSourceV2

2020-08-13 Thread Rohit Mishra (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177246#comment-17177246
 ] 

Rohit Mishra commented on SPARK-32609:
--

[~mingjial], Please refrain from adding Priority as "Critical". These are 
reserved for committers. Changing it to "Major".

> Incorrect exchange reuse with DataSourceV2
> --
>
> Key: SPARK-32609
> URL: https://issues.apache.org/jira/browse/SPARK-32609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Mingjia Liu
>Priority: Critical
>  Labels: correctness
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
>  
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
> df.createOrReplaceTempView(table)
> 
> df = spark.sql(""" 
> WITH t1 AS (
> SELECT 
> d_year, d_month_seq
> FROM (
> SELECT t1.d_year , t2.d_month_seq  
> FROM 
> date_dim t1
> cross join
> date_dim t2
> where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
> and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
> )
> GROUP BY d_year, d_month_seq)
>
>  SELECT
> prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> df.show(){code}
>  
> the repro query :
> A. defines a temp table t1  
> B. cross join t1 (year 2002)  and  t2 (year 2001)
> With reuse exchange enabled, the plan incorrectly "decides" to re-use 
> persisted shuffle writes of A filtering on year 2002 , for year 2001.
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS 
> FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L])
> +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS 
> year#24367L, d_month_seq#24371L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#23551L]
>   :   :  +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] 
> (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), 
> isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#24371L]
>   : +- *(2) ScanV2 
> BigQueryDataSourceV2[d_month_seq#24371L] (Filters: 
> [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), 
> isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], 
> functions=[])
>  +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange 
> hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code}
>  
>  
> And the result is obviously incorrect because prev_year should be 2001
> {code:java}
> +-++---+
> |prev_year|year|d_month_seq|
> +-++---+
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> +-++---+
> only showing top 20 rows
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32609) Incorrect exchange reuse with DataSourceV2

2020-08-13 Thread Rohit Mishra (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohit Mishra updated SPARK-32609:
-
Priority: Major  (was: Critical)

> Incorrect exchange reuse with DataSourceV2
> --
>
> Key: SPARK-32609
> URL: https://issues.apache.org/jira/browse/SPARK-32609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Mingjia Liu
>Priority: Major
>  Labels: correctness
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
>  
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
> df.createOrReplaceTempView(table)
> 
> df = spark.sql(""" 
> WITH t1 AS (
> SELECT 
> d_year, d_month_seq
> FROM (
> SELECT t1.d_year , t2.d_month_seq  
> FROM 
> date_dim t1
> cross join
> date_dim t2
> where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
> and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
> )
> GROUP BY d_year, d_month_seq)
>
>  SELECT
> prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> df.show(){code}
>  
> the repro query :
> A. defines a temp table t1  
> B. cross join t1 (year 2002)  and  t2 (year 2001)
> With reuse exchange enabled, the plan incorrectly "decides" to re-use 
> persisted shuffle writes of A filtering on year 2002 , for year 2001.
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS 
> FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L])
> +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS 
> year#24367L, d_month_seq#24371L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#23551L]
>   :   :  +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] 
> (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), 
> isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#24371L]
>   : +- *(2) ScanV2 
> BigQueryDataSourceV2[d_month_seq#24371L] (Filters: 
> [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), 
> isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], 
> functions=[])
>  +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange 
> hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code}
>  
>  
> And the result is obviously incorrect because prev_year should be 2001
> {code:java}
> +-++---+
> |prev_year|year|d_month_seq|
> +-++---+
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> +-++---+
> only showing top 20 rows
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32609) Incorrect exchange reuse with DataSourceV2

2020-08-13 Thread Mingjia Liu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177231#comment-17177231
 ] 

Mingjia Liu commented on SPARK-32609:
-

I am currently working on a fix &  unit test

> Incorrect exchange reuse with DataSourceV2
> --
>
> Key: SPARK-32609
> URL: https://issues.apache.org/jira/browse/SPARK-32609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Mingjia Liu
>Priority: Critical
>  Labels: correctness
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
>  
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
> df.createOrReplaceTempView(table)
> 
> df = spark.sql(""" 
> WITH t1 AS (
> SELECT 
> d_year, d_month_seq
> FROM (
> SELECT t1.d_year , t2.d_month_seq  
> FROM 
> date_dim t1
> cross join
> date_dim t2
> where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
> and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
> )
> GROUP BY d_year, d_month_seq)
>
>  SELECT
> prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> df.show(){code}
>  
> the repro query :
> A. defines a temp table t1  
> B. cross join t1 (year 2002)  and  t2 (year 2001)
> With reuse exchange enabled, the plan incorrectly "decides" to re-use 
> persisted shuffle writes of A filtering on year 2002 , for year 2001.
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS 
> FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L])
> +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS 
> year#24367L, d_month_seq#24371L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#23551L]
>   :   :  +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] 
> (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), 
> isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#24371L]
>   : +- *(2) ScanV2 
> BigQueryDataSourceV2[d_month_seq#24371L] (Filters: 
> [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), 
> isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], 
> functions=[])
>  +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange 
> hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code}
>  
>  
> And the result is obviously incorrect because prev_year should be 2001
> {code:java}
> +-++---+
> |prev_year|year|d_month_seq|
> +-++---+
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> +-++---+
> only showing top 20 rows
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31198) Use graceful decommissioning as part of dynamic scaling

2020-08-13 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-31198.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

> Use graceful decommissioning as part of dynamic scaling
> ---
>
> Key: SPARK-31198
> URL: https://issues.apache.org/jira/browse/SPARK-31198
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32610) Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper version

2020-08-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177204#comment-17177204
 ] 

Apache Spark commented on SPARK-32610:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/29426

> Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper 
> version
> --
>
> Key: SPARK-32610
> URL: https://issues.apache.org/jira/browse/SPARK-32610
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> There are links to metrics.dropwizard.io in monitoring.md but the link 
> targets refer the version 3.1.0, while we use 4.1.1.
> Now that users can create their own metrics using the dropwizard library, 
> it's better to fix the links to refer the proper version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32610) Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper version

2020-08-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32610:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper 
> version
> --
>
> Key: SPARK-32610
> URL: https://issues.apache.org/jira/browse/SPARK-32610
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> There are links to metrics.dropwizard.io in monitoring.md but the link 
> targets refer the version 3.1.0, while we use 4.1.1.
> Now that users can create their own metrics using the dropwizard library, 
> it's better to fix the links to refer the proper version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32610) Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper version

2020-08-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32610:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper 
> version
> --
>
> Key: SPARK-32610
> URL: https://issues.apache.org/jira/browse/SPARK-32610
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> There are links to metrics.dropwizard.io in monitoring.md but the link 
> targets refer the version 3.1.0, while we use 4.1.1.
> Now that users can create their own metrics using the dropwizard library, 
> it's better to fix the links to refer the proper version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32610) Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper version

2020-08-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177203#comment-17177203
 ] 

Apache Spark commented on SPARK-32610:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/29426

> Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper 
> version
> --
>
> Key: SPARK-32610
> URL: https://issues.apache.org/jira/browse/SPARK-32610
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> There are links to metrics.dropwizard.io in monitoring.md but the link 
> targets refer the version 3.1.0, while we use 4.1.1.
> Now that users can create their own metrics using the dropwizard library, 
> it's better to fix the links to refer the proper version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32609) Incorrect exchange reuse with DataSourceV2

2020-08-13 Thread Mingjia Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingjia Liu updated SPARK-32609:

Target Version/s: 2.4.5, 2.4.7  (was: 2.4.5)

> Incorrect exchange reuse with DataSourceV2
> --
>
> Key: SPARK-32609
> URL: https://issues.apache.org/jira/browse/SPARK-32609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Mingjia Liu
>Priority: Critical
>  Labels: correctness
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
>  
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
> df.createOrReplaceTempView(table)
> 
> df = spark.sql(""" 
> WITH t1 AS (
> SELECT 
> d_year, d_month_seq
> FROM (
> SELECT t1.d_year , t2.d_month_seq  
> FROM 
> date_dim t1
> cross join
> date_dim t2
> where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
> and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
> )
> GROUP BY d_year, d_month_seq)
>
>  SELECT
> prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> df.show(){code}
>  
> the repro query :
> A. defines a temp table t1  
> B. cross join t1 (year 2002)  and  t2 (year 2001)
> With reuse exchange enabled, the plan incorrectly "decides" to re-use 
> persisted shuffle writes of A filtering on year 2002 , for year 2001.
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS 
> FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L])
> +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS 
> year#24367L, d_month_seq#24371L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#23551L]
>   :   :  +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] 
> (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), 
> isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#24371L]
>   : +- *(2) ScanV2 
> BigQueryDataSourceV2[d_month_seq#24371L] (Filters: 
> [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), 
> isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], 
> functions=[])
>  +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange 
> hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code}
>  
>  
> And the result is obviously incorrect because prev_year should be 2001
> {code:java}
> +-++---+
> |prev_year|year|d_month_seq|
> +-++---+
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> +-++---+
> only showing top 20 rows
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2020-08-13 Thread Ruslan Dautkhanov (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177202#comment-17177202
 ] 

Ruslan Dautkhanov commented on SPARK-28367:
---

[~gsomogyi] thanks! yep would be great to learn how this is done on the Flink 
side. 

> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0, 3.1.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).
> I've created a small standalone application to test it and the alternatives: 
> https://github.com/gaborgsomogyi/kafka-get-assignment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32609) Incorrect exchange reuse with DataSourceV2

2020-08-13 Thread Mingjia Liu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177199#comment-17177199
 ] 

Mingjia Liu commented on SPARK-32609:
-

Mitigation:

Turn off  spark.sql.exchange.reuse. eg: 
spark.conf.set("spark.sql.exchange.reuse", "false")

 

Root cause: 

bug at 
[https://github.com/apache/spark/blob/e5bef51826dc2ff4020879e35ae7eb9019aa7fcd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala#L48]

 

Fix:

Add pushedfilters comparison in equals function. verified that applying the fix 
brings right plan and result.

 

> Incorrect exchange reuse with DataSourceV2
> --
>
> Key: SPARK-32609
> URL: https://issues.apache.org/jira/browse/SPARK-32609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Mingjia Liu
>Priority: Critical
>  Labels: correctness
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
>  
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
> df.createOrReplaceTempView(table)
> 
> df = spark.sql(""" 
> WITH t1 AS (
> SELECT 
> d_year, d_month_seq
> FROM (
> SELECT t1.d_year , t2.d_month_seq  
> FROM 
> date_dim t1
> cross join
> date_dim t2
> where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
> and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
> )
> GROUP BY d_year, d_month_seq)
>
>  SELECT
> prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> df.show(){code}
>  
> the repro query :
> A. defines a temp table t1  
> B. cross join t1 (year 2002)  and  t2 (year 2001)
> With reuse exchange enabled, the plan incorrectly "decides" to re-use 
> persisted shuffle writes of A filtering on year 2002 , for year 2001.
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS 
> FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L])
> +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS 
> year#24367L, d_month_seq#24371L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#23551L]
>   :   :  +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] 
> (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), 
> isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#24371L]
>   : +- *(2) ScanV2 
> BigQueryDataSourceV2[d_month_seq#24371L] (Filters: 
> [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), 
> isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], 
> functions=[])
>  +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange 
> hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code}
>  
>  
> And the result is obviously incorrect because prev_year should be 2001
> {code:java}
> +-++---+
> |prev_year|year|d_month_seq|
> +-++---+
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> +-++---+
> only showing top 20 rows
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32610) Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper version

2020-08-13 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-32610:
--

 Summary: Fix the link to metrics.dropwizard.io in monitoring.md to 
refer the proper version
 Key: SPARK-32610
 URL: https://issues.apache.org/jira/browse/SPARK-32610
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.0.0, 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


There are links to metrics.dropwizard.io in monitoring.md but the link targets 
refer the version 3.1.0, while we use 4.1.1.

Now that users can create their own metrics using the dropwizard library, it's 
better to fix the links to refer the proper version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32609) Incorrect exchange reuse with DataSourceV2

2020-08-13 Thread Mingjia Liu (Jira)

Mingjia Liu created SPARK-32609:
---

 Summary: Incorrect exchange reuse with DataSourceV2
 Key: SPARK-32609
 URL: https://issues.apache.org/jira/browse/SPARK-32609
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5
Reporter: Mingjia Liu


 
{code:java}
spark.conf.set("spark.sql.exchange.reuse","true")
spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
df.createOrReplaceTempView(table)

df = spark.sql(""" 
WITH t1 AS (
SELECT 
d_year, d_month_seq
FROM (
SELECT t1.d_year , t2.d_month_seq  
FROM 
date_dim t1
cross join
date_dim t2
where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
)
GROUP BY d_year, d_month_seq)
   
 SELECT
prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
 FROM t1 curr_yr cross join t1 prev_yr
 WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1
 ORDER BY d_month_seq
 LIMIT 100
 
 """)

df.explain()
df.show(){code}
 

the repro query :
A. defines a temp table t1  
B. cross join t1 (year 2002)  and  t2 (year 2001)

With reuse exchange enabled, the plan incorrectly "decides" to re-use persisted 
shuffle writes of A filtering on year 2002 , for year 2001.
{code:java}
== Physical Plan ==
TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS FIRST], 
output=[prev_year#24366L,year#24367L,d_month_seq#24371L])
+- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS 
year#24367L, d_month_seq#24371L]
   +- CartesianProduct
  :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
functions=[])
  :  +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200)
  : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
functions=[])
  :+- BroadcastNestedLoopJoin BuildRight, Cross
  :   :- *(1) Project [d_year#23551L]
  :   :  +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] 
(Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), 
isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: 
[table=tpcds_1G.date_dim,paths=[]])
  :   +- BroadcastExchange IdentityBroadcastMode
  :  +- *(2) Project [d_month_seq#24371L]
  : +- *(2) ScanV2 BigQueryDataSourceV2[d_month_seq#24371L] 
(Filters: [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), 
isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: 
[table=tpcds_1G.date_dim,paths=[]])
  +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], 
functions=[])
 +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange 
hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code}
 

 

And the result is obviously incorrect because prev_year should be 2001
{code:java}
+-++---+
|prev_year|year|d_month_seq|
+-++---+
| 2002|2002|   1212|
| 2002|2002|   1212|
| 2002|2002|   1212|
| 2002|2002|   1212|
| 2002|2002|   1212|
| 2002|2002|   1212|
| 2002|2002|   1212|
| 2002|2002|   1212|
| 2002|2002|   1212|
| 2002|2002|   1212|
| 2002|2002|   1212|
| 2002|2002|   1212|
| 2002|2002|   1212|
| 2002|2002|   1212|
| 2002|2002|   1212|
| 2002|2002|   1212|
| 2002|2002|   1212|
| 2002|2002|   1212|
| 2002|2002|   1212|
| 2002|2002|   1212|
+-++---+
only showing top 20 rows
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32350) Add batch write support on LevelDB to improve performance of HybridStore

2020-08-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177112#comment-17177112
 ] 

Apache Spark commented on SPARK-32350:
--

User 'baohe-zhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29425

> Add batch write support on LevelDB to improve performance of HybridStore
> 
>
> Key: SPARK-32350
> URL: https://issues.apache.org/jira/browse/SPARK-32350
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Baohe Zhang
>Assignee: Baohe Zhang
>Priority: Major
> Fix For: 3.1.0
>
>
> The idea is to improve the performance of HybridStore by adding batch write 
> support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 
> introduces HybridStore. HybridStore will write data to InMemoryStore at first 
> and use a background thread to dump data to LevelDB once the writing to 
> InMemoryStore is completed. In the comments section of 
> [https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned 
> using batch writing can improve the performance of this dumping process and 
> he wrote the code of writeAll().
> I did the comparison of the HybridStore switching time between one-by-one 
> write and batch write on an HDD disk. When the disk is free, the batch-write 
> has around 25% improvement, and when the disk is 100% busy, the batch-write 
> has 7x - 10x improvement.
> when the disk is at 0% utilization:
>  
> ||log size, jobs and tasks per job||original switching time, with 
> write()||switching time with writeAll()||
> |133m, 400 jobs, 100 tasks per job|16s|13s|
> |265m, 400 jobs, 200 tasks per job|30s|23s|
> |1.3g, 1000 jobs, 400 tasks per job|136s|108s|
>  
> when the disk is at 100% utilization:
> ||log size, jobs and tasks per job||original switching time, with 
> write()||switching time with writeAll()||
> |133m, 400 jobs, 100 tasks per job|116s|17s|
> |265m, 400 jobs, 200 tasks per job|251s|26s|
> I also ran some write related benchmarking tests on LevelDBBenchmark.java and 
> measured the total time of writing 1024 objects.
> when the disk is at 0% utilization:
>  
> ||Benchmark test||with write(), ms||with writeAll(), ms ||
> |randomUpdatesIndexed|213.060|157.356|
> |randomUpdatesNoIndex|57.869|35.439|
> |randomWritesIndexed|298.854|229.274|
> |randomWritesNoIndex|66.764|38.361|
> |sequentialUpdatesIndexed|87.019|56.219|
> |sequentialUpdatesNoIndex|61.851|41.942|
> |sequentialWritesIndexed|94.044|56.534|
> |sequentialWritesNoIndex|118.345|66.483|
>  
> when the disk is at 50% utilization:
> ||Benchmark test||with write(), ms||with writeAll(), ms||
> |randomUpdatesIndexed|230.386|180.817|
> |randomUpdatesNoIndex|58.935|50.113|
> |randomWritesIndexed|315.241|254.400|
> |randomWritesNoIndex|96.709|41.164|
> |sequentialUpdatesIndexed|89.971|70.387|
> |sequentialUpdatesNoIndex|72.021|53.769|
> |sequentialWritesIndexed|103.052|67.358|
> |sequentialWritesNoIndex|76.194|99.037|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32350) Add batch write support on LevelDB to improve performance of HybridStore

2020-08-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177110#comment-17177110
 ] 

Apache Spark commented on SPARK-32350:
--

User 'baohe-zhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29425

> Add batch write support on LevelDB to improve performance of HybridStore
> 
>
> Key: SPARK-32350
> URL: https://issues.apache.org/jira/browse/SPARK-32350
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Baohe Zhang
>Assignee: Baohe Zhang
>Priority: Major
> Fix For: 3.1.0
>
>
> The idea is to improve the performance of HybridStore by adding batch write 
> support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 
> introduces HybridStore. HybridStore will write data to InMemoryStore at first 
> and use a background thread to dump data to LevelDB once the writing to 
> InMemoryStore is completed. In the comments section of 
> [https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned 
> using batch writing can improve the performance of this dumping process and 
> he wrote the code of writeAll().
> I did the comparison of the HybridStore switching time between one-by-one 
> write and batch write on an HDD disk. When the disk is free, the batch-write 
> has around 25% improvement, and when the disk is 100% busy, the batch-write 
> has 7x - 10x improvement.
> when the disk is at 0% utilization:
>  
> ||log size, jobs and tasks per job||original switching time, with 
> write()||switching time with writeAll()||
> |133m, 400 jobs, 100 tasks per job|16s|13s|
> |265m, 400 jobs, 200 tasks per job|30s|23s|
> |1.3g, 1000 jobs, 400 tasks per job|136s|108s|
>  
> when the disk is at 100% utilization:
> ||log size, jobs and tasks per job||original switching time, with 
> write()||switching time with writeAll()||
> |133m, 400 jobs, 100 tasks per job|116s|17s|
> |265m, 400 jobs, 200 tasks per job|251s|26s|
> I also ran some write related benchmarking tests on LevelDBBenchmark.java and 
> measured the total time of writing 1024 objects.
> when the disk is at 0% utilization:
>  
> ||Benchmark test||with write(), ms||with writeAll(), ms ||
> |randomUpdatesIndexed|213.060|157.356|
> |randomUpdatesNoIndex|57.869|35.439|
> |randomWritesIndexed|298.854|229.274|
> |randomWritesNoIndex|66.764|38.361|
> |sequentialUpdatesIndexed|87.019|56.219|
> |sequentialUpdatesNoIndex|61.851|41.942|
> |sequentialWritesIndexed|94.044|56.534|
> |sequentialWritesNoIndex|118.345|66.483|
>  
> when the disk is at 50% utilization:
> ||Benchmark test||with write(), ms||with writeAll(), ms||
> |randomUpdatesIndexed|230.386|180.817|
> |randomUpdatesNoIndex|58.935|50.113|
> |randomWritesIndexed|315.241|254.400|
> |randomWritesNoIndex|96.709|41.164|
> |sequentialUpdatesIndexed|89.971|70.387|
> |sequentialUpdatesNoIndex|72.021|53.769|
> |sequentialWritesIndexed|103.052|67.358|
> |sequentialWritesNoIndex|76.194|99.037|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32608) Script Transform DELIMIT value error

2020-08-13 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-32608:
--
Description: 
For SQL

 
{code:java}
SELECT TRANSFORM(a, b, c)
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
  NULL DEFINED AS 'null'
  USING 'cat' AS (a, b, c)
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
  NULL DEFINED AS 'NULL'
FROM testData
{code}
The correct 

TOK_TABLEROWFORMATFIELD should be , nut actually  ','

TOK_TABLEROWFORMATLINES should be \n  but actually '\n'

> Script Transform DELIMIT  value error
> -
>
> Key: SPARK-32608
> URL: https://issues.apache.org/jira/browse/SPARK-32608
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> For SQL
>  
> {code:java}
> SELECT TRANSFORM(a, b, c)
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
>   LINES TERMINATED BY '\n'
>   NULL DEFINED AS 'null'
>   USING 'cat' AS (a, b, c)
>   ROW FORMAT DELIMITED
>   FIELDS TERMINATED BY ','
>   LINES TERMINATED BY '\n'
>   NULL DEFINED AS 'NULL'
> FROM testData
> {code}
> The correct 
> TOK_TABLEROWFORMATFIELD should be , nut actually  ','
> TOK_TABLEROWFORMATLINES should be \n  but actually '\n'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32608) Script Transform DELIMIT value error

2020-08-13 Thread angerszhu (Jira)

angerszhu created SPARK-32608:
-

 Summary: Script Transform DELIMIT  value error
 Key: SPARK-32608
 URL: https://issues.apache.org/jira/browse/SPARK-32608
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32607) Script Transformation no-serde read line should respect `TOK_TABLEROWFORMATLINES`

2020-08-13 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-32607:
--
Description: 
reader.readLine() should respect `TOK_TABLEROWFORMATLINES`
{code:java}
protected def createOutputIteratorWithoutSerde(
writerThread: BaseScriptTransformationWriterThread,
inputStream: InputStream,
proc: Process,
stderrBuffer: CircularBuffer): Iterator[InternalRow] = {
  new Iterator[InternalRow] {
var curLine: String = null
val reader = new BufferedReader(new InputStreamReader(inputStream, 
StandardCharsets.UTF_8))

val outputRowFormat = ioschema.outputRowFormatMap("TOK_TABLEROWFORMATFIELD")
val processRowWithoutSerde = if (!ioschema.schemaLess) {
  prevLine: String =>
new GenericInternalRow(
  prevLine.split(outputRowFormat)
.zip(outputFieldWriters)
.map { case (data, writer) => writer(data) })
} else {
  // In schema less mode, hive default serde will choose first two output 
column as output
  // if output column size less then 2, it will throw 
ArrayIndexOutOfBoundsException.
  // Here we change spark's behavior same as hive's default serde.
  // But in hive, TRANSFORM with schema less behavior like origin spark, we 
will fix this
  // to keep spark and hive behavior same in SPARK-32388
  val kvWriter = 
CatalystTypeConverters.createToCatalystConverter(StringType)
  prevLine: String =>
new GenericInternalRow(
  prevLine.split(outputRowFormat).slice(0, 2)
.map(kvWriter))
}

override def hasNext: Boolean = {
  try {
if (curLine == null) {
  curLine = reader.readLine()
  if (curLine == null) {
checkFailureAndPropagate(writerThread, null, proc, stderrBuffer)
return false
  }
}
true
  } catch {
case NonFatal(e) =>
  // If this exception is due to abrupt / unclean termination of `proc`,
  // then detect it and propagate a better exception message for end 
users
  checkFailureAndPropagate(writerThread, e, proc, stderrBuffer)

  throw e
  }
}

override def next(): InternalRow = {
  if (!hasNext) {
throw new NoSuchElementException
  }
  val prevLine = curLine
  curLine = reader.readLine()
  processRowWithoutSerde(prevLine)
}
  }
}
{code}

  was:
reader.readLine() should respect ``
{code:java}
protected def createOutputIteratorWithoutSerde(
writerThread: BaseScriptTransformationWriterThread,
inputStream: InputStream,
proc: Process,
stderrBuffer: CircularBuffer): Iterator[InternalRow] = {
  new Iterator[InternalRow] {
var curLine: String = null
val reader = new BufferedReader(new InputStreamReader(inputStream, 
StandardCharsets.UTF_8))

val outputRowFormat = ioschema.outputRowFormatMap("TOK_TABLEROWFORMATFIELD")
val processRowWithoutSerde = if (!ioschema.schemaLess) {
  prevLine: String =>
new GenericInternalRow(
  prevLine.split(outputRowFormat)
.zip(outputFieldWriters)
.map { case (data, writer) => writer(data) })
} else {
  // In schema less mode, hive default serde will choose first two output 
column as output
  // if output column size less then 2, it will throw 
ArrayIndexOutOfBoundsException.
  // Here we change spark's behavior same as hive's default serde.
  // But in hive, TRANSFORM with schema less behavior like origin spark, we 
will fix this
  // to keep spark and hive behavior same in SPARK-32388
  val kvWriter = 
CatalystTypeConverters.createToCatalystConverter(StringType)
  prevLine: String =>
new GenericInternalRow(
  prevLine.split(outputRowFormat).slice(0, 2)
.map(kvWriter))
}

override def hasNext: Boolean = {
  try {
if (curLine == null) {
  curLine = reader.readLine()
  if (curLine == null) {
checkFailureAndPropagate(writerThread, null, proc, stderrBuffer)
return false
  }
}
true
  } catch {
case NonFatal(e) =>
  // If this exception is due to abrupt / unclean termination of `proc`,
  // then detect it and propagate a better exception message for end 
users
  checkFailureAndPropagate(writerThread, e, proc, stderrBuffer)

  throw e
  }
}

override def next(): InternalRow = {
  if (!hasNext) {
throw new NoSuchElementException
  }
  val prevLine = curLine
  curLine = reader.readLine()
  processRowWithoutSerde(prevLine)
}
  }
}
{code}


> Script Transformation no-serde read line should respect 
> `TOK_TABLEROWFORMATLINES`
> -
>
> Key: SPARK-32607
> URL: https://issues.apache

[jira] [Updated] (SPARK-32607) Script Transformation no-serde read line should respect `TOK_TABLEROWFORMATLINES`

2020-08-13 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-32607:
--
Description: 
reader.readLine() should respect ``
{code:java}
protected def createOutputIteratorWithoutSerde(
writerThread: BaseScriptTransformationWriterThread,
inputStream: InputStream,
proc: Process,
stderrBuffer: CircularBuffer): Iterator[InternalRow] = {
  new Iterator[InternalRow] {
var curLine: String = null
val reader = new BufferedReader(new InputStreamReader(inputStream, 
StandardCharsets.UTF_8))

val outputRowFormat = ioschema.outputRowFormatMap("TOK_TABLEROWFORMATFIELD")
val processRowWithoutSerde = if (!ioschema.schemaLess) {
  prevLine: String =>
new GenericInternalRow(
  prevLine.split(outputRowFormat)
.zip(outputFieldWriters)
.map { case (data, writer) => writer(data) })
} else {
  // In schema less mode, hive default serde will choose first two output 
column as output
  // if output column size less then 2, it will throw 
ArrayIndexOutOfBoundsException.
  // Here we change spark's behavior same as hive's default serde.
  // But in hive, TRANSFORM with schema less behavior like origin spark, we 
will fix this
  // to keep spark and hive behavior same in SPARK-32388
  val kvWriter = 
CatalystTypeConverters.createToCatalystConverter(StringType)
  prevLine: String =>
new GenericInternalRow(
  prevLine.split(outputRowFormat).slice(0, 2)
.map(kvWriter))
}

override def hasNext: Boolean = {
  try {
if (curLine == null) {
  curLine = reader.readLine()
  if (curLine == null) {
checkFailureAndPropagate(writerThread, null, proc, stderrBuffer)
return false
  }
}
true
  } catch {
case NonFatal(e) =>
  // If this exception is due to abrupt / unclean termination of `proc`,
  // then detect it and propagate a better exception message for end 
users
  checkFailureAndPropagate(writerThread, e, proc, stderrBuffer)

  throw e
  }
}

override def next(): InternalRow = {
  if (!hasNext) {
throw new NoSuchElementException
  }
  val prevLine = curLine
  curLine = reader.readLine()
  processRowWithoutSerde(prevLine)
}
  }
}
{code}

> Script Transformation no-serde read line should respect 
> `TOK_TABLEROWFORMATLINES`
> -
>
> Key: SPARK-32607
> URL: https://issues.apache.org/jira/browse/SPARK-32607
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> reader.readLine() should respect ``
> {code:java}
> protected def createOutputIteratorWithoutSerde(
> writerThread: BaseScriptTransformationWriterThread,
> inputStream: InputStream,
> proc: Process,
> stderrBuffer: CircularBuffer): Iterator[InternalRow] = {
>   new Iterator[InternalRow] {
> var curLine: String = null
> val reader = new BufferedReader(new InputStreamReader(inputStream, 
> StandardCharsets.UTF_8))
> val outputRowFormat = 
> ioschema.outputRowFormatMap("TOK_TABLEROWFORMATFIELD")
> val processRowWithoutSerde = if (!ioschema.schemaLess) {
>   prevLine: String =>
> new GenericInternalRow(
>   prevLine.split(outputRowFormat)
> .zip(outputFieldWriters)
> .map { case (data, writer) => writer(data) })
> } else {
>   // In schema less mode, hive default serde will choose first two output 
> column as output
>   // if output column size less then 2, it will throw 
> ArrayIndexOutOfBoundsException.
>   // Here we change spark's behavior same as hive's default serde.
>   // But in hive, TRANSFORM with schema less behavior like origin spark, 
> we will fix this
>   // to keep spark and hive behavior same in SPARK-32388
>   val kvWriter = 
> CatalystTypeConverters.createToCatalystConverter(StringType)
>   prevLine: String =>
> new GenericInternalRow(
>   prevLine.split(outputRowFormat).slice(0, 2)
> .map(kvWriter))
> }
> override def hasNext: Boolean = {
>   try {
> if (curLine == null) {
>   curLine = reader.readLine()
>   if (curLine == null) {
> checkFailureAndPropagate(writerThread, null, proc, stderrBuffer)
> return false
>   }
> }
> true
>   } catch {
> case NonFatal(e) =>
>   // If this exception is due to abrupt / unclean termination of 
> `proc`,
>   // then detect it and propagate a better exception message for end 
> users
>   checkFailureAndPropagate(writerThread, e, proc,

[jira] [Created] (SPARK-32607) Script Transformation no-serde read line should respect `TOK_TABLEROWFORMATLINES`

2020-08-13 Thread angerszhu (Jira)

angerszhu created SPARK-32607:
-

 Summary: Script Transformation no-serde read line should respect 
`TOK_TABLEROWFORMATLINES`
 Key: SPARK-32607
 URL: https://issues.apache.org/jira/browse/SPARK-32607
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32053) pyspark save of serialized model is failing for windows.

2020-08-13 Thread Kayal (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177084#comment-17177084
 ] 

Kayal commented on SPARK-32053:
---

~\AppData\Local\IBMWS\miniconda3\envs\desktop\lib\site-packages\py4j\protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
 327 "An error occurred while calling \{0}{1}\{2}.\n".
--> 328 format(target_id, ".", name), value)
 329 else:

Py4JJavaError: An error occurred while calling o289.save.
: org.apache.spark.SparkException: Job aborted.
 at 
org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:100)
 at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopDataset$1(PairRDDFunctions.scala:1090)
 at 
org.apache.spark.rdd.PairRDDFunctions$$Lambda$2417.1D19A7B0.apply$mcV$sp(Unknown
 Source)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1088)
 at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$4(PairRDDFunctions.scala:1061)
 at 
org.apache.spark.rdd.PairRDDFunctions$$Lambda$2415.0FE34B70.apply$mcV$sp(Unknown
 Source)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1026)
 at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$3(PairRDDFunctions.scala:1008)
 at 
org.apache.spark.rdd.PairRDDFunctions$$Lambda$2414.1CBB0D40.apply$mcV$sp(Unknown
 Source)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1007)
 at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$2(PairRDDFunctions.scala:964)
 at 
org.apache.spark.rdd.PairRDDFunctions$$Lambda$2413.1D196EA0.apply$mcV$sp(Unknown
 Source)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:962)
 at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$2(RDD.scala:1552)
 at org.apache.spark.rdd.RDD$$Lambda$2411.18FEB4E0.apply$mcV$sp(Unknown 
Source)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1552)
 at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$1(RDD.scala:1538)
 at org.apache.spark.rdd.RDD$$Lambda$2410.1CA30180.apply$mcV$sp(Unknown 
Source)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1538)
 at 
org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:413)
 at 
org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1(Pipeline.scala:250)
 at 
org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1$adapted(Pipeline.scala:247)
 at 
org.apache.spark.ml.Pipeline$SharedReadWrite$$$Lambda$2397.190AB010.apply(Unknown
 Source)
 at 
org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
 at 
org.apache.spark.ml.util.Instrumentation$$$Lambda$1390.18680E40.apply(Unknown
 Source)
 at scala.util.Try$.apply(Try.scala:213)
 at 
org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
 at org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:247)
 at org.apache.spark.ml.Pipeline$PipelineWriter.saveImpl(Pipeline.scala:206)
 at org.apache.spark.ml.util.MLWriter.save(ReadWrite.

[jira] [Commented] (SPARK-32053) pyspark save of serialized model is failing for windows.

2020-08-13 Thread Kayal (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177082#comment-17177082
 ] 

Kayal commented on SPARK-32053:
---

!image-2020-08-13-20-29-40-555.png!

 

 

> pyspark save of serialized model is failing for windows.
> 
>
> Key: SPARK-32053
> URL: https://issues.apache.org/jira/browse/SPARK-32053
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Kayal
>Priority: Blocker
> Attachments: image-2020-06-22-18-19-32-236.png, 
> image-2020-08-13-20-24-57-309.png, image-2020-08-13-20-25-57-585.png, 
> image-2020-08-13-20-28-28-779.png, image-2020-08-13-20-29-40-555.png, 
> screenshot-1.png
>
>
> {color:#172b4d}Hi, {color}
> {color:#172b4d}We are using spark functionality to save the serialized model 
> to disk . On windows platform we are seeing save of the serialized model is 
> failing with the error:  o288.save() failed. {color}
>  
>  
>  
> !image-2020-06-22-18-19-32-236.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32053) pyspark save of serialized model is failing for windows.

2020-08-13 Thread Kayal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kayal updated SPARK-32053:
--
Attachment: image-2020-08-13-20-29-40-555.png

> pyspark save of serialized model is failing for windows.
> 
>
> Key: SPARK-32053
> URL: https://issues.apache.org/jira/browse/SPARK-32053
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Kayal
>Priority: Blocker
> Attachments: image-2020-06-22-18-19-32-236.png, 
> image-2020-08-13-20-24-57-309.png, image-2020-08-13-20-25-57-585.png, 
> image-2020-08-13-20-28-28-779.png, image-2020-08-13-20-29-40-555.png, 
> screenshot-1.png
>
>
> {color:#172b4d}Hi, {color}
> {color:#172b4d}We are using spark functionality to save the serialized model 
> to disk . On windows platform we are seeing save of the serialized model is 
> failing with the error:  o288.save() failed. {color}
>  
>  
>  
> !image-2020-06-22-18-19-32-236.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32053) pyspark save of serialized model is failing for windows.

2020-08-13 Thread Kayal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kayal updated SPARK-32053:
--
Attachment: image-2020-08-13-20-28-28-779.png

> pyspark save of serialized model is failing for windows.
> 
>
> Key: SPARK-32053
> URL: https://issues.apache.org/jira/browse/SPARK-32053
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Kayal
>Priority: Blocker
> Attachments: image-2020-06-22-18-19-32-236.png, 
> image-2020-08-13-20-24-57-309.png, image-2020-08-13-20-25-57-585.png, 
> image-2020-08-13-20-28-28-779.png, screenshot-1.png
>
>
> {color:#172b4d}Hi, {color}
> {color:#172b4d}We are using spark functionality to save the serialized model 
> to disk . On windows platform we are seeing save of the serialized model is 
> failing with the error:  o288.save() failed. {color}
>  
>  
>  
> !image-2020-06-22-18-19-32-236.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32053) pyspark save of serialized model is failing for windows.

2020-08-13 Thread Kayal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kayal updated SPARK-32053:
--
Attachment: screenshot-1.png

> pyspark save of serialized model is failing for windows.
> 
>
> Key: SPARK-32053
> URL: https://issues.apache.org/jira/browse/SPARK-32053
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Kayal
>Priority: Blocker
> Attachments: image-2020-06-22-18-19-32-236.png, 
> image-2020-08-13-20-24-57-309.png, image-2020-08-13-20-25-57-585.png, 
> screenshot-1.png
>
>
> {color:#172b4d}Hi, {color}
> {color:#172b4d}We are using spark functionality to save the serialized model 
> to disk . On windows platform we are seeing save of the serialized model is 
> failing with the error:  o288.save() failed. {color}
>  
>  
>  
> !image-2020-06-22-18-19-32-236.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32053) pyspark save of serialized model is failing for windows.

2020-08-13 Thread Kayal (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177079#comment-17177079
 ] 

Kayal commented on SPARK-32053:
---

!image-2020-08-13-20-25-57-585.png!

 

> pyspark save of serialized model is failing for windows.
> 
>
> Key: SPARK-32053
> URL: https://issues.apache.org/jira/browse/SPARK-32053
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Kayal
>Priority: Blocker
> Attachments: image-2020-06-22-18-19-32-236.png, 
> image-2020-08-13-20-24-57-309.png
>
>
> {color:#172b4d}Hi, {color}
> {color:#172b4d}We are using spark functionality to save the serialized model 
> to disk . On windows platform we are seeing save of the serialized model is 
> failing with the error:  o288.save() failed. {color}
>  
>  
>  
> !image-2020-06-22-18-19-32-236.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32053) pyspark save of serialized model is failing for windows.

2020-08-13 Thread Kayal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kayal updated SPARK-32053:
--
Attachment: image-2020-08-13-20-25-57-585.png

> pyspark save of serialized model is failing for windows.
> 
>
> Key: SPARK-32053
> URL: https://issues.apache.org/jira/browse/SPARK-32053
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Kayal
>Priority: Blocker
> Attachments: image-2020-06-22-18-19-32-236.png, 
> image-2020-08-13-20-24-57-309.png, image-2020-08-13-20-25-57-585.png
>
>
> {color:#172b4d}Hi, {color}
> {color:#172b4d}We are using spark functionality to save the serialized model 
> to disk . On windows platform we are seeing save of the serialized model is 
> failing with the error:  o288.save() failed. {color}
>  
>  
>  
> !image-2020-06-22-18-19-32-236.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32053) pyspark save of serialized model is failing for windows.

2020-08-13 Thread Kayal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kayal updated SPARK-32053:
--
Attachment: image-2020-08-13-20-24-57-309.png

> pyspark save of serialized model is failing for windows.
> 
>
> Key: SPARK-32053
> URL: https://issues.apache.org/jira/browse/SPARK-32053
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Kayal
>Priority: Blocker
> Attachments: image-2020-06-22-18-19-32-236.png, 
> image-2020-08-13-20-24-57-309.png
>
>
> {color:#172b4d}Hi, {color}
> {color:#172b4d}We are using spark functionality to save the serialized model 
> to disk . On windows platform we are seeing save of the serialized model is 
> failing with the error:  o288.save() failed. {color}
>  
>  
>  
> !image-2020-06-22-18-19-32-236.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32053) pyspark save of serialized model is failing for windows.

2020-08-13 Thread Kayal (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177077#comment-17177077
 ] 

Kayal commented on SPARK-32053:
---

!image-2020-08-13-20-24-57-309.png!

> pyspark save of serialized model is failing for windows.
> 
>
> Key: SPARK-32053
> URL: https://issues.apache.org/jira/browse/SPARK-32053
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Kayal
>Priority: Blocker
> Attachments: image-2020-06-22-18-19-32-236.png
>
>
> {color:#172b4d}Hi, {color}
> {color:#172b4d}We are using spark functionality to save the serialized model 
> to disk . On windows platform we are seeing save of the serialized model is 
> failing with the error:  o288.save() failed. {color}
>  
>  
>  
> !image-2020-06-22-18-19-32-236.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32053) pyspark save of serialized model is failing for windows.

2020-08-13 Thread Kayal (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177073#comment-17177073
 ] 

Kayal commented on SPARK-32053:
---

The code to reproduce the issue on windows jupyter notebook:

import pyspark
#from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext("local", "First App")
from pyspark.sql import SparkSession
sess = SparkSession(sc)

training = sess.createDataFrame([
 ("0L", "a b c d e WML", 1.0),
 ("1L", "b d", 0.0),
 ("2L", "WML f g h", 1.0),
 ("3L", "hadoop mapreduce", 0.0)], ["id", "text", "label"])

evaluation = sess.createDataFrame([
 ("4L", "a b c WML", 1.0),
 ("5L", "l m n o p", 0.0),
 ("6L", "WML g h i k", 1.0),
 ("7L", "apache hadoop zuzu", 0.0)], ["id", "text", "label"])

testing = sess.createDataFrame([
 ("4L", "a b c z WML"),
 ("5L", "l m n"),
 ("6L", "WML g h i j k"),
 ("7L", "apache hadoop")], ["id", "text"])
import traceback
from pyspark.ml.pipeline import Pipeline


from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import SQLContext as sql_context

tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)
stages=[tokenizer, hashingTF, lr]
pipeline = Pipeline(stages=stages)
model = pipeline.fit(training)
test_result = model.transform(testing)

pipeline.write().overwrite().save("tempfile")

 

The write operation is failing with the error that I mentioned above. This is 
blocking our product delivery.  could consider this with high priority blocker 
issue. Is there a work around for this ?  sparkml is supported on windows 
pyspark ? 

I also noticed the same error with 

pipline.save() method.

 

 

> pyspark save of serialized model is failing for windows.
> 
>
> Key: SPARK-32053
> URL: https://issues.apache.org/jira/browse/SPARK-32053
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Kayal
>Priority: Major
> Attachments: image-2020-06-22-18-19-32-236.png
>
>
> {color:#172b4d}Hi, {color}
> {color:#172b4d}We are using spark functionality to save the serialized model 
> to disk . On windows platform we are seeing save of the serialized model is 
> failing with the error:  o288.save() failed. {color}
>  
>  
>  
> !image-2020-06-22-18-19-32-236.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32053) pyspark save of serialized model is failing for windows.

2020-08-13 Thread Kayal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kayal updated SPARK-32053:
--
Priority: Blocker  (was: Major)

> pyspark save of serialized model is failing for windows.
> 
>
> Key: SPARK-32053
> URL: https://issues.apache.org/jira/browse/SPARK-32053
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Kayal
>Priority: Blocker
> Attachments: image-2020-06-22-18-19-32-236.png
>
>
> {color:#172b4d}Hi, {color}
> {color:#172b4d}We are using spark functionality to save the serialized model 
> to disk . On windows platform we are seeing save of the serialized model is 
> failing with the error:  o288.save() failed. {color}
>  
>  
>  
> !image-2020-06-22-18-19-32-236.png!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-32053) pyspark save of serialized model is failing for windows.

2020-08-13 Thread Kayal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kayal reopened SPARK-32053:
---

Hi,

I have verified the issue in spark latest version 3.0.0 , the issue seems to be 
still there on windows.

 

The problem is on windows when we try to  
pipline.write().overwrite().save(temp_dir) is failing with

~\AppData\Local\IBMWS\miniconda3\envs\desktop\lib\site-packages\pyspark\ml\util.py
 in save(self, path)
 173 if not isinstance(path, basestring):
 174 raise TypeError("path should be a basestring, got type %s" % type(path))
--> 175 self._jwrite.save(path)
 176 
 177 def overwrite(self):

~\AppData\Local\IBMWS\miniconda3\envs\desktop\lib\site-packages\py4j\java_gateway.py
 in __call__(self, *args)
 1303 answer = self.gateway_client.send_command(command)
 1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
 1306 
 1307 for temp_arg in temp_args:

~\AppData\Local\IBMWS\miniconda3\envs\desktop\lib\site-packages\pyspark\sql\utils.py
 in deco(*a, **kw)
 129 def deco(*a, **kw):
 130 try:
--> 131 return f(*a, **kw)
 132 except py4j.protocol.Py4JJavaError as e:
 133 converted = convert_exception(e.java_exception)

~\AppData\Local\IBMWS\miniconda3\envs\desktop\lib\site-packages\py4j\protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
 326 raise Py4JJavaError(
 327 "An error occurred while calling \{0}{1}\{2}.\n".
--> 328 format(target_id, ".", name), value)
 329 else:
 330 raise Py4JError(

Py4JJavaError: An error occurred while calling o662.save.
: org.apache.spark.SparkException: Job aborted.
 at 
org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:100)
 at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopDataset$1(PairRDDFunctions.scala:1090)
 at 
org.apache.spark.rdd.PairRDDFunctions$$Lambda$2417.1D19A7B0.apply$mcV$sp(Unknown
 Source)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1088)
 at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$4(PairRDDFunctions.scala:1061)
 at 
org.apache.spark.rdd.PairRDDFunctions$$Lambda$2415.0FE34B70.apply$mcV$sp(Unknown
 Source)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1026)
 at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$3(PairRDDFunctions.scala:1008)
 at 
org.apache.spark.rdd.PairRDDFunctions$$Lambda$2414.1CBB0D40.apply$mcV$sp(Unknown
 Source)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1007)
 at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$2(PairRDDFunctions.scala:964)
 at 
org.apache.spark.rdd.PairRDDFunctions$$Lambda$2413.1D196EA0.apply$mcV$sp(Unknown
 Source)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:962)
 at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$2(RDD.scala:1552)
 at org.apache.spark.rdd.RDD$$Lambda$2411.18FEB4E0.apply$mcV$sp(Unknown 
Source)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1552)
 at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$1(RDD.scala:1538)
 at org.apache.spark.rdd.RDD$$Lambda$2410.1CA30180.apply$mcV$sp(Unknown 
Source)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperat

[jira] [Commented] (SPARK-32587) SPARK SQL writing to JDBC target with bit datatype using Dataframe is writing NULL values

2020-08-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176996#comment-17176996
 ] 

Hyukjin Kwon commented on SPARK-32587:
--

Could you share the reproducible codes and the actual output?

> SPARK SQL writing to JDBC target with bit datatype using Dataframe is writing 
> NULL values
> -
>
> Key: SPARK-32587
> URL: https://issues.apache.org/jira/browse/SPARK-32587
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Mohit Dave
>Priority: Major
>
> While writing to a target in SQL Server using Microsoft's SQL Server driver 
> using dataframe.write API the target is storing NULL values for BIT columns.
>  
> Table definition
> Azure SQL DB 
> 1)Create 2 tables with column type as bit
> 2)Insert some record into 1 table
> Create a SPARK job 
> 1)Create a Dataframe using spark.read with the following query
> select  from 
> 2)Write the dataframe to a target table with bit type  as column.
>  
> Observation : Bit type is getting converted to NULL at the target
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32591) Add better api docs for stage level scheduling Resources

2020-08-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176994#comment-17176994
 ] 

Hyukjin Kwon commented on SPARK-32591:
--

+100!!

> Add better api docs for stage level scheduling Resources
> 
>
> Key: SPARK-32591
> URL: https://issues.apache.org/jira/browse/SPARK-32591
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> A question came up when we added offheap memory to be able to set in a 
> ResourceProfile executor resources.
> [https://github.com/apache/spark/pull/28972/]
> Based on that discussion we should add better api docs to explain what each 
> one does. Perhaps point to the corresponding configuration .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32601) Issue in converting an RDD of Arrow RecordBatches in v3.0.0

2020-08-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176993#comment-17176993
 ] 

Hyukjin Kwon commented on SPARK-32601:
--

{{ArrowSerializer}} isn't supposed to be an API. Can this be reproduced by 
using regular APIs in Spark?

> Issue in converting an RDD of Arrow RecordBatches in v3.0.0
> ---
>
> Key: SPARK-32601
> URL: https://issues.apache.org/jira/browse/SPARK-32601
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Tanveer
>Priority: Major
>
> The following simple code snippet for converting an RDD of Arrow 
> RecordBatches works perfectly in Spark v2.3.4.
>  
> {code:java}
> // code placeholder
> from pyspark.sql import SparkSession
> import pyspark
> import pyarrow as pa
> from pyspark.serializers import ArrowSerializer
> def _arrow_record_batch_dumps(rb):
> # Fix for interoperability between pyarrow version >=0.15 and Spark's 
> arrow version
> # Streaming message protocol has changed, remove setting when upgrading 
> spark.
> import os
> os.environ['ARROW_PRE_0_15_IPC_FORMAT'] = '1'
> 
> return bytearray(ArrowSerializer().dumps(rb))
> def rb_return(ardd):
> data = [
> pa.array(range(5), type='int16'),
> pa.array([-10, -5, 0, None, 10], type='int32')
> ]
> schema = pa.schema([pa.field('c0', pa.int16()),
> pa.field('c1', pa.int32())],
>metadata={b'foo': b'bar'})
> return pa.RecordBatch.from_arrays(data, schema=schema)
> if __name__ == '__main__':
> spark = SparkSession \
> .builder \
> .appName("Python Arrow-in-Spark example") \
> .getOrCreate()
> # Enable Arrow-based columnar data transfers
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> sc = spark.sparkContext
> ardd = spark.sparkContext.parallelize([0,1,2], 3)
> ardd = ardd.map(rb_return)
> from pyspark.sql.types import from_arrow_schema
> from pyspark.sql.dataframe import DataFrame
> from pyspark.serializers import ArrowSerializer, PickleSerializer, 
> AutoBatchedSerializer
> # Filter out and cache arrow record batches 
> ardd = ardd.filter(lambda x: isinstance(x, pa.RecordBatch)).cache()
> ardd = ardd.map(_arrow_record_batch_dumps)
> schema = pa.schema([pa.field('c0', pa.int16()),
> pa.field('c1', pa.int32())],
>metadata={b'foo': b'bar'})
> schema = from_arrow_schema(schema)
> jrdd = ardd._to_java_object_rdd()
> jdf = spark._jvm.PythonSQLUtils.arrowPayloadToDataFrame(jrdd, 
> schema.json(), spark._wrapped._jsqlContext)
> df = DataFrame(jdf, spark._wrapped)
> df._schema = schema
> df.show()
> {code}
>  
> But after updating to Spark to v3.0.0, the same functionality with just 
> changing  arrowPayloadToDataFrame() -> toDataFrame() doesn't work.
>  
> {code:java}
> // code placeholder
> from pyspark.sql import SparkSession
> import pyspark
> import pyarrow as pa
> #from pyspark.serializers import ArrowSerializerdef dumps(batch):
> import pyarrow as pa
> import io
> sink = io.BytesIO()
> writer = pa.RecordBatchFileWriter(sink, batch.schema)
> writer.write_batch(batch)
> writer.close()
> return sink.getvalue()def _arrow_record_batch_dumps(rb):
> # Fix for interoperability between pyarrow version >=0.15 and Spark's 
> arrow version
> # Streaming message protocol has changed, remove setting when upgrading 
> spark.
> #import os
> #os.environ['ARROW_PRE_0_15_IPC_FORMAT'] = '1'#return 
> bytearray(ArrowSerializer().dumps(rb))
> return bytearray(dumps(rb))
> def rb_return(ardd):
> data = [
> pa.array(range(5), type='int16'),
> pa.array([-10, -5, 0, None, 10], type='int32')
> ]
> schema = pa.schema([pa.field('c0', pa.int16()),
> pa.field('c1', pa.int32())],
>metadata={b'foo': b'bar'})
> return pa.RecordBatch.from_arrays(data, schema=schema)if __name__ == 
> '__main__':
> spark = SparkSession \
> .builder \
> .appName("Python Arrow-in-Spark example") \
> .getOrCreate()# Enable Arrow-based columnar data transfers
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> sc = spark.sparkContextardd = spark.sparkContext.parallelize([0,1,2], 
> 3)
> ardd = ardd.map(rb_return)from pyspark.sql.pandas.types import 
> from_arrow_schema
> from pyspark.sql.dataframe import DataFrame# Filter out and cache 
> arrow record batches 
> ardd = ardd.filter(lambda x: isinstance(x, pa.RecordBatch)).cache()
> ardd = ardd.map(_arrow_record_batch_dumps)schema = 
> pa.schema([pa.field('c0', p

[jira] [Commented] (SPARK-32604) Bug in ALSModel Python Documentation

2020-08-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176992#comment-17176992
 ] 

Hyukjin Kwon commented on SPARK-32604:
--

Are you interested in submitting a PR to fix?

> Bug in ALSModel Python Documentation
> 
>
> Key: SPARK-32604
> URL: https://issues.apache.org/jira/browse/SPARK-32604
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Zach Cahoone
>Priority: Minor
>
> In the ALSModel documentation 
> ([https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]), 
> there is a bug which causes data frame creation to fail with the following 
> error:
> {code:java}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 15, 10.0.0.133, executor 10): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, 
> in main
> process()
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, 
> in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 390, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 1354, in takeUpToNumLeft
> yield next(iterator)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in 
> wrapper
> return f(*args, **kwargs)
>   File "", line 24, in 
> NameError: name 'long' is not defined
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuf

[jira] [Updated] (SPARK-32604) Bug in ALSModel Python Documentation

2020-08-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32604:
-
Component/s: PySpark

> Bug in ALSModel Python Documentation
> 
>
> Key: SPARK-32604
> URL: https://issues.apache.org/jira/browse/SPARK-32604
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Zach Cahoone
>Priority: Minor
>
> In the ALSModel documentation 
> ([https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]), 
> there is a bug which causes data frame creation to fail with the following 
> error:
> {code:java}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 15, 10.0.0.133, executor 10): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, 
> in main
> process()
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, 
> in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 390, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 1354, in takeUpToNumLeft
> yield next(iterator)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in 
> wrapper
> return f(*args, **kwargs)
>   File "", line 24, in 
> NameError: name 'long' is not defined
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(

[jira] [Updated] (SPARK-32604) Bug in ALSModel Python Documentation

2020-08-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32604:
-
Target Version/s:   (was: 3.0.0)

> Bug in ALSModel Python Documentation
> 
>
> Key: SPARK-32604
> URL: https://issues.apache.org/jira/browse/SPARK-32604
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Zach Cahoone
>Priority: Minor
>
> In the ALSModel documentation 
> ([https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]), 
> there is a bug which causes data frame creation to fail with the following 
> error:
> {code:java}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 15, 10.0.0.133, executor 10): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, 
> in main
> process()
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, 
> in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 390, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 1354, in takeUpToNumLeft
> yield next(iterator)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in 
> wrapper
> return f(*args, **kwargs)
>   File "", line 24, in 
> NameError: name 'long' is not defined
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortSta

[jira] [Commented] (SPARK-32604) Bug in ALSModel Python Documentation

2020-08-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176990#comment-17176990
 ] 

Hyukjin Kwon commented on SPARK-32604:
--

Please avoid setting target version which is reserved for committers in general.

> Bug in ALSModel Python Documentation
> 
>
> Key: SPARK-32604
> URL: https://issues.apache.org/jira/browse/SPARK-32604
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Zach Cahoone
>Priority: Minor
>
> In the ALSModel documentation 
> ([https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]), 
> there is a bug which causes data frame creation to fail with the following 
> error:
> {code:java}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 15, 10.0.0.133, executor 10): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, 
> in main
> process()
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, 
> in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 390, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 1354, in takeUpToNumLeft
> yield next(iterator)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in 
> wrapper
> return f(*args, **kwargs)
>   File "", line 24, in 
> NameError: name 'long' is not defined
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.Ar

[jira] [Commented] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeachBatch in PySpark

2020-08-13 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176966#comment-17176966
 ] 

JinxinTang commented on SPARK-32500:


Kindly ping [~hyukjin.kwon] It's my pleasure :  )

> Query and Batch Id not set for Structured Streaming Jobs in case of 
> ForeachBatch in PySpark
> ---
>
> Key: SPARK-32500
> URL: https://issues.apache.org/jira/browse/SPARK-32500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Abhishek Dixit
>Priority: Major
> Attachments: Screen Shot 2020-07-26 at 6.50.39 PM.png, Screen Shot 
> 2020-07-30 at 9.04.21 PM.png, image-2020-08-01-10-21-51-246.png
>
>
> Query Id and Batch Id information is not available for jobs started by 
> structured streaming query when _foreachBatch_ API is used in PySpark.
> This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
> fine, and also other structured streaming sinks in pyspark work fine. I am 
> attaching a screenshot of jobs pages.
> I think job group is not set properly when _foreachBatch_ is used via 
> pyspark. I have a framework that depends on the _queryId_ and _batchId_ 
> information available in the job properties and so my framework doesn't work 
> for pyspark-foreachBatch use case.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32018) Fix UnsafeRow set overflowed decimal

2020-08-13 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176955#comment-17176955
 ] 

Wenchen Fan commented on SPARK-32018:
-

[~Gengliang.Wang] we should create a new JIRA ticket for the new fix. The new 
fix is not applicable to 2.4 as 2.4 does not have ANSI mode.

> Fix UnsafeRow set overflowed decimal
> 
>
> Key: SPARK-32018
> URL: https://issues.apache.org/jira/browse/SPARK-32018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Allison Wang
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> There is a bug that writing an overflowed decimal into UnsafeRow is fine but 
> reading it out will throw ArithmeticException. This exception is thrown when 
> calling {{getDecimal}} in UnsafeRow with input decimal's precision greater 
> than the input precision. Setting the value of the overflowed decimal to null 
> when writing into UnsafeRow should fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32244) Build and run the Spark with test cases in Github Actions

2020-08-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32244:


Assignee: Hyukjin Kwon

> Build and run the Spark with test cases in Github Actions
> -
>
> Key: SPARK-32244
> URL: https://issues.apache.org/jira/browse/SPARK-32244
> Project: Spark
>  Issue Type: Umbrella
>  Components: Project Infra
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> Last week and onwards, the Jenkins machines became very unstable for some 
> reasons. 
>   - Apparently, the machines became extremely slow. Almost all tests can't 
> pass.
>   - One machine (worker 4) started to have the corrupt .m2 which fails the 
> build.
>   - Documentation build fails time to time for an unknown reason in Jenkins 
> machine specifically.
> Almost all PRs are basically blocked by this instability currently.
> This JIRA aims to run the tests in Github Actions.
>   - To avoid depending on few persons who can access to the cluster. 
>   - To reduce the elapsed time in the build - we could split the tests (e.g., 
> SQL, ML, CORE), and run them in parallel so the total build time will 
> significantly reduce.
>   - To control the environment more flexibly.
>   - Other contributors can test and propose to fix Github Actions 
> configurations so we can distribute this build management cost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32606) Remove the fork of action-download-artifact in test_report.yml

2020-08-13 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-32606:


 Summary: Remove the fork of action-download-artifact in 
test_report.yml
 Key: SPARK-32606
 URL: https://issues.apache.org/jira/browse/SPARK-32606
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


https://github.com/HyukjinKwon/action-download-artifact/commit/750b71af351aba467757d7be6924199bb08db4ed
in order to add the support to download all artifacts. It should be contributed 
back to the original
plugin and avoid using the fork.
Alternatively, we can use the official actions/download-artifact once they 
support to download artifacts
between different workloads, see also 
https://github.com/actions/download-artifact/issues/3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 105 matches

Mail list logo