[jira] [Created] (SPARK-32615) AQE aggregateMetrics java.util.NoSuchElementException
Leanken.Lin created SPARK-32615: --- Summary: AQE aggregateMetrics java.util.NoSuchElementException Key: SPARK-32615 URL: https://issues.apache.org/jira/browse/SPARK-32615 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Leanken.Lin Reproduce Step {code:java} //代码占位符 sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys" {code} {code:java} //代码占位符 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-workerjava.util.NoSuchElementException: key not found: 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 milliseconds) {code} This issue is mainly because during AQE, while sub-plan changed, the metrics update is overwrite. for example, in this UT, change from BroadcastHashJoinExec into a LocalTableScanExec, and in the onExecutionEnd action it will try aggregate all metrics during the execution, which will cause NoSuchElementException -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32609) Incorrect exchange reuse with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32609: Assignee: Apache Spark > Incorrect exchange reuse with DataSourceV2 > -- > > Key: SPARK-32609 > URL: https://issues.apache.org/jira/browse/SPARK-32609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Mingjia Liu >Assignee: Apache Spark >Priority: Major > Labels: correctness > Original Estimate: 48h > Remaining Estimate: 48h > > > {code:java} > spark.conf.set("spark.sql.exchange.reuse","true") > spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load() > df.createOrReplaceTempView(table) > > df = spark.sql(""" > WITH t1 AS ( > SELECT > d_year, d_month_seq > FROM ( > SELECT t1.d_year , t2.d_month_seq > FROM > date_dim t1 > cross join > date_dim t2 > where t1.d_day_name = "Monday" and t1.d_fy_year > 2000 > and t2.d_day_name = "Monday" and t2.d_fy_year > 2000 > ) > GROUP BY d_year, d_month_seq) > > SELECT > prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq > FROM t1 curr_yr cross join t1 prev_yr > WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1 > ORDER BY d_month_seq > LIMIT 100 > > """) > df.explain() > df.show(){code} > > the repro query : > A. defines a temp table t1 > B. cross join t1 (year 2002) and t2 (year 2001) > With reuse exchange enabled, the plan incorrectly "decides" to re-use > persisted shuffle writes of A filtering on year 2002 , for year 2001. > {code:java} > == Physical Plan == > TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS > FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L]) > +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS > year#24367L, d_month_seq#24371L] >+- CartesianProduct > :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > : +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200) > : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > :+- BroadcastNestedLoopJoin BuildRight, Cross > : :- *(1) Project [d_year#23551L] > : : +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] > (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), > isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > : +- BroadcastExchange IdentityBroadcastMode > : +- *(2) Project [d_month_seq#24371L] > : +- *(2) ScanV2 > BigQueryDataSourceV2[d_month_seq#24371L] (Filters: > [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), > isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], > functions=[]) > +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange > hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code} > > > And the result is obviously incorrect because prev_year should be 2001 > {code:java} > +-++---+ > |prev_year|year|d_month_seq| > +-++---+ > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > +-++---+ > only showing top 20 rows > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32609) Incorrect exchange reuse with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177528#comment-17177528 ] Apache Spark commented on SPARK-32609: -- User 'mingjialiu' has created a pull request for this issue: https://github.com/apache/spark/pull/29430 > Incorrect exchange reuse with DataSourceV2 > -- > > Key: SPARK-32609 > URL: https://issues.apache.org/jira/browse/SPARK-32609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Mingjia Liu >Priority: Major > Labels: correctness > Original Estimate: 48h > Remaining Estimate: 48h > > > {code:java} > spark.conf.set("spark.sql.exchange.reuse","true") > spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load() > df.createOrReplaceTempView(table) > > df = spark.sql(""" > WITH t1 AS ( > SELECT > d_year, d_month_seq > FROM ( > SELECT t1.d_year , t2.d_month_seq > FROM > date_dim t1 > cross join > date_dim t2 > where t1.d_day_name = "Monday" and t1.d_fy_year > 2000 > and t2.d_day_name = "Monday" and t2.d_fy_year > 2000 > ) > GROUP BY d_year, d_month_seq) > > SELECT > prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq > FROM t1 curr_yr cross join t1 prev_yr > WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1 > ORDER BY d_month_seq > LIMIT 100 > > """) > df.explain() > df.show(){code} > > the repro query : > A. defines a temp table t1 > B. cross join t1 (year 2002) and t2 (year 2001) > With reuse exchange enabled, the plan incorrectly "decides" to re-use > persisted shuffle writes of A filtering on year 2002 , for year 2001. > {code:java} > == Physical Plan == > TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS > FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L]) > +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS > year#24367L, d_month_seq#24371L] >+- CartesianProduct > :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > : +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200) > : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > :+- BroadcastNestedLoopJoin BuildRight, Cross > : :- *(1) Project [d_year#23551L] > : : +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] > (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), > isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > : +- BroadcastExchange IdentityBroadcastMode > : +- *(2) Project [d_month_seq#24371L] > : +- *(2) ScanV2 > BigQueryDataSourceV2[d_month_seq#24371L] (Filters: > [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), > isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], > functions=[]) > +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange > hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code} > > > And the result is obviously incorrect because prev_year should be 2001 > {code:java} > +-++---+ > |prev_year|year|d_month_seq| > +-++---+ > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > +-++---+ > only showing top 20 rows > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32609) Incorrect exchange reuse with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32609: Assignee: (was: Apache Spark) > Incorrect exchange reuse with DataSourceV2 > -- > > Key: SPARK-32609 > URL: https://issues.apache.org/jira/browse/SPARK-32609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Mingjia Liu >Priority: Major > Labels: correctness > Original Estimate: 48h > Remaining Estimate: 48h > > > {code:java} > spark.conf.set("spark.sql.exchange.reuse","true") > spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load() > df.createOrReplaceTempView(table) > > df = spark.sql(""" > WITH t1 AS ( > SELECT > d_year, d_month_seq > FROM ( > SELECT t1.d_year , t2.d_month_seq > FROM > date_dim t1 > cross join > date_dim t2 > where t1.d_day_name = "Monday" and t1.d_fy_year > 2000 > and t2.d_day_name = "Monday" and t2.d_fy_year > 2000 > ) > GROUP BY d_year, d_month_seq) > > SELECT > prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq > FROM t1 curr_yr cross join t1 prev_yr > WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1 > ORDER BY d_month_seq > LIMIT 100 > > """) > df.explain() > df.show(){code} > > the repro query : > A. defines a temp table t1 > B. cross join t1 (year 2002) and t2 (year 2001) > With reuse exchange enabled, the plan incorrectly "decides" to re-use > persisted shuffle writes of A filtering on year 2002 , for year 2001. > {code:java} > == Physical Plan == > TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS > FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L]) > +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS > year#24367L, d_month_seq#24371L] >+- CartesianProduct > :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > : +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200) > : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > :+- BroadcastNestedLoopJoin BuildRight, Cross > : :- *(1) Project [d_year#23551L] > : : +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] > (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), > isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > : +- BroadcastExchange IdentityBroadcastMode > : +- *(2) Project [d_month_seq#24371L] > : +- *(2) ScanV2 > BigQueryDataSourceV2[d_month_seq#24371L] (Filters: > [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), > isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], > functions=[]) > +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange > hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code} > > > And the result is obviously incorrect because prev_year should be 2001 > {code:java} > +-++---+ > |prev_year|year|d_month_seq| > +-++---+ > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > +-++---+ > only showing top 20 rows > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32345) SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark client for Spark session
[ https://issues.apache.org/jira/browse/SPARK-32345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-32345. - Resolution: Invalid > SemanticException Failed to get a spark session: > org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark > client for Spark session > -- > > Key: SPARK-32345 > URL: https://issues.apache.org/jira/browse/SPARK-32345 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: 任建亭 >Priority: Blocker > > when using hive on spark engine: > FAILED: SemanticException Failed to get a spark session: > org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark > client for Spark session > hadoop version: 2.7.3 / hive version: 3.1.2 / spark version 3.0.0 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan updated SPARK-32614: Description: In most of the data ware housing scenarios files does not have comment records and every line needs to be treated as a valid record even though it starts with default comment character as \u or null character.Though user can set any comment character other than \u, but there is a chance the actual record can start with those characters. Currently for the below piece of code and the given testdata where first row starts with null \u character it will throw the below error. *eg: *val df = spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); df.show(false); *+TestData+* !screenshot-1.png! Internal state when error was thrown: line=1, column=0, record=0, charIndex=7 at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552) at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160) at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148) at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) *Note:* Though its the limitation of the univocity parser and the workaround is to provide any other comment character by mentioning .option("comment","#"), but if my actual data starts with this character then the particular row will be discarded. Currently I pushed the code in univocity parser to handle this scenario as part of the below PR https://github.com/uniVocity/univocity-parsers/pull/412 please accept the jira so that we can enable this feature in spark-csv by adding a parameter in spark csvoptions. was: In most of the data ware housing scenarios files does not have comment records and every line needs to be treated as a valid record even though it starts with default comment character as \u or null character.Though user can set any comment character other than \u, but there is a chance the actual record can start with those characters. Currently for the below piece of code and the given testdata where first row starts with null \u character it will throw the below error. Internal state when error was thrown: line=1, column=0, record=0, charIndex=7 at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552) at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160) at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148) at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) eg: val df = spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); df.show(false); *+TestData+* !screenshot-1.png! > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Chandan >Assignee: Jeff Evans >Priority: Major > Attachments: screenshot-1.png > > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u or null character.Though user > can set any comment character other than \u, but there is a chance the > actual record can start with those characters. > Currently for the below piece of code and the given testdata where first row > starts with null \u > character it will throw the below error. > *eg: *val df = > spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); > df.show(false); > *+TestData+* > > !screenshot-1.png! > Internal state when error was thrown: line=1, column=0, record=0, charIndex=7 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > co
[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan updated SPARK-32614: Description: In most of the data ware housing scenarios files does not have comment records and every line needs to be treated as a valid record even though it starts with default comment character as \u or null character.Though user can set any comment character other than \u, but there is a chance the actual record can start with those characters. Currently for the below piece of code and the given testdata where first row starts with null \u character it will throw the below error. Internal state when error was thrown: line=1, column=0, record=0, charIndex=7 at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552) at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160) at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148) at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) eg: val df = spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); df.show(false); *+TestData+* !screenshot-1.png! was: In most of the data ware housing scenarios files does not have comment records and every line needs to be treated as a valid record even though it starts with default comment character as \u or null character.Though user can set any comment character other than \u, but there is a chance the actual record can start with those characters. Currently for the below piece of code and the given testdata where first row starts with null \u character it will throw the eg: val df = spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); df.show(false); *+TestData+* !screenshot-1.png! > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Chandan >Assignee: Jeff Evans >Priority: Major > Attachments: screenshot-1.png > > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u or null character.Though user > can set any comment character other than \u, but there is a chance the > actual record can start with those characters. > Currently for the below piece of code and the given testdata where first row > starts with null \u > character it will throw the below error. > Internal state when error was thrown: line=1, column=0, record=0, charIndex=7 > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57) > eg: val df = > spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); > df.show(false); > *+TestData+* > > !screenshot-1.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan updated SPARK-32614: Description: In most of the data ware housing scenarios files does not have comment records and every line needs to be treated as a valid record even though it starts with default comment character as \u or null character.Though user can set any comment character other than \u, but there is a chance the actual record can start with those characters. Currently for the below piece of code and the given testdata where first row starts with null \u character it will throw the eg: val df = spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); df.show(false); *+TestData+* !screenshot-1.png! was: In most of the data ware housing scenarios files does not have comment records and every line needs to be treated as a valid record even though it starts with default comment character as \u or null character.Though user can set any comment character other than \u, but there is a chance the actual record can start with those characters. Currently for the below piece of code and the given testdata where first row starts with null \u character eg: val df = spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); df.show(false); *+TestData+* !screenshot-1.png! > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Chandan >Assignee: Jeff Evans >Priority: Major > Attachments: screenshot-1.png > > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u or null character.Though user > can set any comment character other than \u, but there is a chance the > actual record can start with those characters. > Currently for the below piece of code and the given testdata where first row > starts with null \u > character it will throw the > eg: val df = > spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); > df.show(false); > *+TestData+* > > !screenshot-1.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan updated SPARK-32614: Issue Type: Bug (was: New Feature) > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Chandan >Assignee: Jeff Evans >Priority: Major > Attachments: screenshot-1.png > > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u or null character.Though user > can set any comment character other than \u, but there is a chance the > actual record can start with those characters. > Currently for the below piece of code and the given testdata where first row > starts with null \u > character it will throw the > eg: val df = > spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); > df.show(false); > *+TestData+* > > !screenshot-1.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan updated SPARK-32614: Affects Version/s: 2.4.5 > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Chandan >Assignee: Jeff Evans >Priority: Major > Attachments: screenshot-1.png > > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u or null character.Though user > can set any comment character other than \u, but there is a chance the > actual record can start with those characters. > Currently for the below piece of code and the given testdata where first row > starts with null \u > character it will throw the > eg: val df = > spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); > df.show(false); > *+TestData+* > > !screenshot-1.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan updated SPARK-32614: Issue Type: New Feature (was: Improvement) > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Chandan >Assignee: Jeff Evans >Priority: Major > Attachments: screenshot-1.png > > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u or null character.Though user > can set any comment character other than \u, but there is a chance the > actual record can start with those characters. > Currently for the below piece of code and the given testdata where first row > starts with null \u > character > eg: val df = > spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); > df.show(false); > *+TestData+* > > !screenshot-1.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan updated SPARK-32614: Description: In most of the data ware housing scenarios files does not have comment records and every line needs to be treated as a valid record even though it starts with default comment character as \u or null character.Though user can set any comment character other than \u, but there is a chance the actual record can start with those characters. Currently for the below piece of code and the given testdata where first row starts with null \u character eg: val df = spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); df.show(false); *+TestData+* !screenshot-1.png! was: In most of the data ware housing scenarios files does not have comment records and every line needs to be treated as a valid record even though it starts with default comment character as \u or null character.Though user can set any comment character other than \u, but there is a chance the actual record can start with those characters. Currently eg: Dataset df = spark.read().option("inferSchema", "true") .option("header", "false") .option("delimiter", ",") .csv("/tmp/delimitedfile.dat) *+TestData+* > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Chandan >Assignee: Jeff Evans >Priority: Major > Attachments: screenshot-1.png > > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u or null character.Though user > can set any comment character other than \u, but there is a chance the > actual record can start with those characters. > Currently for the below piece of code and the given testdata where first row > starts with null \u > character > eg: val df = > spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat"); > df.show(false); > *+TestData+* > > !screenshot-1.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan updated SPARK-32614: Attachment: screenshot-1.png > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Chandan >Assignee: Jeff Evans >Priority: Major > Attachments: screenshot-1.png > > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u or null character.Though user > can set any comment character other than \u, but there is a chance the > actual record can start with those characters. > Currently > eg: Dataset df = spark.read().option("inferSchema", "true") > .option("header", > "false") > > .option("delimiter", ",") > > .csv("/tmp/delimitedfile.dat) > *+TestData+* > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan updated SPARK-32614: Description: In most of the data ware housing scenarios files does not have comment records and every line needs to be treated as a valid record even though it starts with default comment character as \u or null character.Though user can set any comment character other than \u, but there is a chance the actual record can start with those characters. Currently eg: Dataset df = spark.read().option("inferSchema", "true") .option("header", "false") .option("delimiter", ",") .csv("/tmp/delimitedfile.dat) *+TestData+* was: Currently, the delimiter option Spark 2.0 to read and split CSV files/data only support a single character delimiter. If we try to provide multiple delimiters, we observer the following error message. eg: Dataset df = spark.read().option("inferSchema", "true") .option("header", "false") .option("delimiter", ", ") .csv("C:\test.txt"); Exception in thread "main" java.lang.IllegalArgumentException: Delimiter cannot be more than one character: , at org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111) at org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83) at org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) Generally, the data to be processed contains multiple character delimiters and presently we need to do a manual data clean up on the source/input file, which doesn't work well in large applications which consumes numerous files. There seems to be work-around like reading data as text and using the split option, but this in my opinion defeats the purpose, advantage and efficiency of a direct read from CSV file. > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Chandan >Assignee: Jeff Evans >Priority: Major > > In most of the data ware housing scenarios files does not have comment > records and every line needs to be treated as a valid record even though it > starts with default comment character as \u or null character.Though user > can set any comment character other than \u, but there is a chance the > actual record can start with those characters. > Currently > eg: Dataset df = spark.read().option("inferSchema", "true") > .option("header", > "false") > > .option("delimiter", ",") > > .csv("/tmp/delimitedfile.dat) > *+TestData+* > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan updated SPARK-32614: Fix Version/s: (was: 3.0.0) > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chandan >Assignee: Jeff Evans >Priority: Minor > > Currently, the delimiter option Spark 2.0 to read and split CSV files/data > only support a single character delimiter. If we try to provide multiple > delimiters, we observer the following error message. > eg: Dataset df = spark.read().option("inferSchema", "true") > .option("header", > "false") > .option("delimiter", > ", ") > .csv("C:\test.txt"); > Exception in thread "main" java.lang.IllegalArgumentException: Delimiter > cannot be more than one character: , > at > org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) > > Generally, the data to be processed contains multiple character delimiters > and presently we need to do a manual data clean up on the source/input file, > which doesn't work well in large applications which consumes numerous files. > There seems to be work-around like reading data as text and using the split > option, but this in my opinion defeats the purpose, advantage and efficiency > of a direct read from CSV file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan updated SPARK-32614: Affects Version/s: (was: 2.3.1) 3.0.0 > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Chandan >Assignee: Jeff Evans >Priority: Major > > Currently, the delimiter option Spark 2.0 to read and split CSV files/data > only support a single character delimiter. If we try to provide multiple > delimiters, we observer the following error message. > eg: Dataset df = spark.read().option("inferSchema", "true") > .option("header", > "false") > .option("delimiter", > ", ") > .csv("C:\test.txt"); > Exception in thread "main" java.lang.IllegalArgumentException: Delimiter > cannot be more than one character: , > at > org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) > > Generally, the data to be processed contains multiple character delimiters > and presently we need to do a manual data clean up on the source/input file, > which doesn't work well in large applications which consumes numerous files. > There seems to be work-around like reading data as text and using the split > option, but this in my opinion defeats the purpose, advantage and efficiency > of a direct read from CSV file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan updated SPARK-32614: Priority: Major (was: Minor) > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chandan >Assignee: Jeff Evans >Priority: Major > > Currently, the delimiter option Spark 2.0 to read and split CSV files/data > only support a single character delimiter. If we try to provide multiple > delimiters, we observer the following error message. > eg: Dataset df = spark.read().option("inferSchema", "true") > .option("header", > "false") > .option("delimiter", > ", ") > .csv("C:\test.txt"); > Exception in thread "main" java.lang.IllegalArgumentException: Delimiter > cannot be more than one character: , > at > org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) > > Generally, the data to be processed contains multiple character delimiters > and presently we need to do a manual data clean up on the source/input file, > which doesn't work well in large applications which consumes numerous files. > There seems to be work-around like reading data as text and using the split > option, but this in my opinion defeats the purpose, advantage and efficiency > of a direct read from CSV file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
[ https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan updated SPARK-32614: Component/s: Spark Core > Support for treating the line as valid record if it starts with \u or > null character, or starts with any character mentioned as comment > --- > > Key: SPARK-32614 > URL: https://issues.apache.org/jira/browse/SPARK-32614 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Chandan >Assignee: Jeff Evans >Priority: Major > > Currently, the delimiter option Spark 2.0 to read and split CSV files/data > only support a single character delimiter. If we try to provide multiple > delimiters, we observer the following error message. > eg: Dataset df = spark.read().option("inferSchema", "true") > .option("header", > "false") > .option("delimiter", > ", ") > .csv("C:\test.txt"); > Exception in thread "main" java.lang.IllegalArgumentException: Delimiter > cannot be more than one character: , > at > org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) > > Generally, the data to be processed contains multiple character delimiters > and presently we need to do a manual data clean up on the source/input file, > which doesn't work well in large applications which consumes numerous files. > There seems to be work-around like reading data as text and using the split > option, but this in my opinion defeats the purpose, advantage and efficiency > of a direct read from CSV file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment
Chandan created SPARK-32614: --- Summary: Support for treating the line as valid record if it starts with \u or null character, or starts with any character mentioned as comment Key: SPARK-32614 URL: https://issues.apache.org/jira/browse/SPARK-32614 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Chandan Assignee: Jeff Evans Fix For: 3.0.0 Currently, the delimiter option Spark 2.0 to read and split CSV files/data only support a single character delimiter. If we try to provide multiple delimiters, we observer the following error message. eg: Dataset df = spark.read().option("inferSchema", "true") .option("header", "false") .option("delimiter", ", ") .csv("C:\test.txt"); Exception in thread "main" java.lang.IllegalArgumentException: Delimiter cannot be more than one character: , at org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111) at org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83) at org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) Generally, the data to be processed contains multiple character delimiters and presently we need to do a manual data clean up on the source/input file, which doesn't work well in large applications which consumes numerous files. There seems to be work-around like reading data as text and using the split option, but this in my opinion defeats the purpose, advantage and efficiency of a direct read from CSV file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32345) SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark client for Spark session
[ https://issues.apache.org/jira/browse/SPARK-32345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177468#comment-17177468 ] ZhouDaHong commented on SPARK-32345: If the cause of the version conflict is excluded. You can look at queue resources. If the queue resource reaches 100% and there is no free task resource released for creating spark session in a short time, the task will fail and this exception will be thrown. Solution: increase the connection time interval of hive client to 15 minutes; set hive.spark.client . server.connect.timeout=90 ; > SemanticException Failed to get a spark session: > org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark > client for Spark session > -- > > Key: SPARK-32345 > URL: https://issues.apache.org/jira/browse/SPARK-32345 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: 任建亭 >Priority: Blocker > > when using hive on spark engine: > FAILED: SemanticException Failed to get a spark session: > org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark > client for Spark session > hadoop version: 2.7.3 / hive version: 3.1.2 / spark version 3.0.0 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32357) Publish failed and succeeded test reports in GitHub Actions
[ https://issues.apache.org/jira/browse/SPARK-32357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32357: -- Summary: Publish failed and succeeded test reports in GitHub Actions (was: Investigate test result reporter integration) > Publish failed and succeeded test reports in GitHub Actions > --- > > Key: SPARK-32357 > URL: https://issues.apache.org/jira/browse/SPARK-32357 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > Currently, the readability in the logs are not really good. For example, see > https://pipelines.actions.githubusercontent.com/gik0C3if0ep5i8iNpgFlcJRQk9UyifmoD6XvJANMVttkEP5xje/_apis/pipelines/1/runs/564/signedlogcontent/4?urlExpires=2020-07-09T14%3A05%3A52.5110439Z&urlSigningMethod=HMACV1&urlSignature=gMGczJ8vtNPeQFE0GpjMxSS1BGq14RJLXUfjsLnaX7s%3D > Maybe we should have a way to report the results in an easy way to read. For > example, Jenkins test report-like feature. > We should maybe also take a look for > https://github.com/check-run-reporter/action. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32357) Investigate test result reporter integration
[ https://issues.apache.org/jira/browse/SPARK-32357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32357: - Assignee: Hyukjin Kwon > Investigate test result reporter integration > > > Key: SPARK-32357 > URL: https://issues.apache.org/jira/browse/SPARK-32357 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Currently, the readability in the logs are not really good. For example, see > https://pipelines.actions.githubusercontent.com/gik0C3if0ep5i8iNpgFlcJRQk9UyifmoD6XvJANMVttkEP5xje/_apis/pipelines/1/runs/564/signedlogcontent/4?urlExpires=2020-07-09T14%3A05%3A52.5110439Z&urlSigningMethod=HMACV1&urlSignature=gMGczJ8vtNPeQFE0GpjMxSS1BGq14RJLXUfjsLnaX7s%3D > Maybe we should have a way to report the results in an easy way to read. For > example, Jenkins test report-like feature. > We should maybe also take a look for > https://github.com/check-run-reporter/action. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32357) Investigate test result reporter integration
[ https://issues.apache.org/jira/browse/SPARK-32357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32357. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29333 [https://github.com/apache/spark/pull/29333] > Investigate test result reporter integration > > > Key: SPARK-32357 > URL: https://issues.apache.org/jira/browse/SPARK-32357 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > Currently, the readability in the logs are not really good. For example, see > https://pipelines.actions.githubusercontent.com/gik0C3if0ep5i8iNpgFlcJRQk9UyifmoD6XvJANMVttkEP5xje/_apis/pipelines/1/runs/564/signedlogcontent/4?urlExpires=2020-07-09T14%3A05%3A52.5110439Z&urlSigningMethod=HMACV1&urlSignature=gMGczJ8vtNPeQFE0GpjMxSS1BGq14RJLXUfjsLnaX7s%3D > Maybe we should have a way to report the results in an easy way to read. For > example, Jenkins test report-like feature. > We should maybe also take a look for > https://github.com/check-run-reporter/action. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32587) SPARK SQL writing to JDBC target with bit datatype using Dataframe is writing NULL values
[ https://issues.apache.org/jira/browse/SPARK-32587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177465#comment-17177465 ] ZhouDaHong edited comment on SPARK-32587 at 8/14/20, 3:54 AM: -- Sorry, I don't really understand your problem. Do you mean that data cannot be written to the SQL Server database when there is a null value column? If this is the case, please check the structure of the table to see if the error reporting field is defined as "not null" in the database? was (Author: zdh): Sorry, I don't really understand your problem. Do you mean that data cannot be written to the SQL Server database when there is a null value column? If this is the case, please check the structure of the table to see if the error reporting field is defined as "not null" in the database? > SPARK SQL writing to JDBC target with bit datatype using Dataframe is writing > NULL values > - > > Key: SPARK-32587 > URL: https://issues.apache.org/jira/browse/SPARK-32587 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Mohit Dave >Priority: Major > > While writing to a target in SQL Server using Microsoft's SQL Server driver > using dataframe.write API the target is storing NULL values for BIT columns. > > Table definition > Azure SQL DB > 1)Create 2 tables with column type as bit > 2)Insert some record into 1 table > Create a SPARK job > 1)Create a Dataframe using spark.read with the following query > select from > 2)Write the dataframe to a target table with bit type as column. > > Observation : Bit type is getting converted to NULL at the target > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32587) SPARK SQL writing to JDBC target with bit datatype using Dataframe is writing NULL values
[ https://issues.apache.org/jira/browse/SPARK-32587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177465#comment-17177465 ] ZhouDaHong edited comment on SPARK-32587 at 8/14/20, 3:54 AM: -- Sorry, I don't really understand your problem. Do you mean that data cannot be written to the SQL Server database when there is a null value column? If this is the case, please check the structure of the table to see if the error reporting field is defined as "not null" in the database? was (Author: zdh): 抱歉,我不是特别明白你的问题。你是不是说数据存在空值列的时候,无法写入到sql server数据库?如果是这样的话,请查看待写入的表的结构,查看报错字段是否在数据库中定义了“not null”? > SPARK SQL writing to JDBC target with bit datatype using Dataframe is writing > NULL values > - > > Key: SPARK-32587 > URL: https://issues.apache.org/jira/browse/SPARK-32587 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Mohit Dave >Priority: Major > > While writing to a target in SQL Server using Microsoft's SQL Server driver > using dataframe.write API the target is storing NULL values for BIT columns. > > Table definition > Azure SQL DB > 1)Create 2 tables with column type as bit > 2)Insert some record into 1 table > Create a SPARK job > 1)Create a Dataframe using spark.read with the following query > select from > 2)Write the dataframe to a target table with bit type as column. > > Observation : Bit type is getting converted to NULL at the target > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32587) SPARK SQL writing to JDBC target with bit datatype using Dataframe is writing NULL values
[ https://issues.apache.org/jira/browse/SPARK-32587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177465#comment-17177465 ] ZhouDaHong commented on SPARK-32587: 抱歉,我不是特别明白你的问题。你是不是说数据存在空值列的时候,无法写入到sql server数据库?如果是这样的话,请查看待写入的表的结构,查看报错字段是否在数据库中定义了“not null”? > SPARK SQL writing to JDBC target with bit datatype using Dataframe is writing > NULL values > - > > Key: SPARK-32587 > URL: https://issues.apache.org/jira/browse/SPARK-32587 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Mohit Dave >Priority: Major > > While writing to a target in SQL Server using Microsoft's SQL Server driver > using dataframe.write API the target is storing NULL values for BIT columns. > > Table definition > Azure SQL DB > 1)Create 2 tables with column type as bit > 2)Insert some record into 1 table > Create a SPARK job > 1)Create a Dataframe using spark.read with the following query > select from > 2)Write the dataframe to a target table with bit type as column. > > Observation : Bit type is getting converted to NULL at the target > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type
[ https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177461#comment-17177461 ] ZhouDaHong commented on SPARK-21774: Hello, you compare the value of a field of string type with the 0 in your sql. Due to the different data types, (the 0 may be judged as boolean type, or 0 as int type). Therefore, the SQL statement [ select a, B from TB where a = 0 ] cannot get the result you expect. It is suggested to change to [ select a, B from TB where a ='0' ] > The rule PromoteStrings cast string to a wrong data type > > > Key: SPARK-21774 > URL: https://issues.apache.org/jira/browse/SPARK-21774 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: StanZhai >Priority: Critical > Labels: correctness > > Data > {code} > create temporary view tb as select * from values > ("0", 1), > ("-0.1", 2), > ("1", 3) > as grouping(a, b) > {code} > SQL: > {code} > select a, b from tb where a=0 > {code} > The result which is wrong: > {code} > ++---+ > | a| b| > ++---+ > | 0| 1| > |-0.1| 2| > ++---+ > {code} > Logical Plan: > {code} > == Parsed Logical Plan == > 'Project ['a] > +- 'Filter ('a = 0) >+- 'UnresolvedRelation `src` > == Analyzed Logical Plan == > a: string > Project [a#8528] > +- Filter (cast(a#8528 as int) = 0) >+- SubqueryAlias src > +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529] > +- LocalRelation [_1#8525, _2#8526] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9182) filter and groupBy on DataFrames are not passed through to jdbc source
[ https://issues.apache.org/jira/browse/SPARK-9182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177455#comment-17177455 ] ZhouDaHong commented on SPARK-9182: --- Hello, it seems that the problem is that the "Sal" field is of numerical type, but in the actual SQL process, it is impossible to match the numeric value non equivalently. Try changing the "Sal" field to int or double. > filter and groupBy on DataFrames are not passed through to jdbc source > -- > > Key: SPARK-9182 > URL: https://issues.apache.org/jira/browse/SPARK-9182 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Greg Rahn >Assignee: Yijie Shen >Priority: Critical > > When running all of these API calls, the only one that passes the filter > through to the backend jdbc source is equality. All filters in these > commands should be able to be passed through to the jdbc database source. > {code} > val url="jdbc:postgresql:grahn" > val prop = new java.util.Properties > val emp = sqlContext.read.jdbc(url, "emp", prop) > emp.filter(emp("sal") === 5000).show() > emp.filter(emp("sal") < 5000).show() > emp.filter("sal = 3000").show() > emp.filter("sal > 2500").show() > emp.filter("sal >= 2500").show() > emp.filter("sal < 2500").show() > emp.filter("sal <= 2500").show() > emp.filter("sal != 3000").show() > emp.filter("sal between 3000 and 5000").show() > emp.filter("ename in ('SCOTT','BLAKE')").show() > {code} > We see from the PostgreSQL query log the following is run, and see that only > equality predicates are passed through. > {code} > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp WHERE > sal = 5000 > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp WHERE > sal = 3000 > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32609) Incorrect exchange reuse with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32609: - Shepherd: (was: Reynold Xin) > Incorrect exchange reuse with DataSourceV2 > -- > > Key: SPARK-32609 > URL: https://issues.apache.org/jira/browse/SPARK-32609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Mingjia Liu >Priority: Major > Labels: correctness > Original Estimate: 48h > Remaining Estimate: 48h > > > {code:java} > spark.conf.set("spark.sql.exchange.reuse","true") > spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load() > df.createOrReplaceTempView(table) > > df = spark.sql(""" > WITH t1 AS ( > SELECT > d_year, d_month_seq > FROM ( > SELECT t1.d_year , t2.d_month_seq > FROM > date_dim t1 > cross join > date_dim t2 > where t1.d_day_name = "Monday" and t1.d_fy_year > 2000 > and t2.d_day_name = "Monday" and t2.d_fy_year > 2000 > ) > GROUP BY d_year, d_month_seq) > > SELECT > prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq > FROM t1 curr_yr cross join t1 prev_yr > WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1 > ORDER BY d_month_seq > LIMIT 100 > > """) > df.explain() > df.show(){code} > > the repro query : > A. defines a temp table t1 > B. cross join t1 (year 2002) and t2 (year 2001) > With reuse exchange enabled, the plan incorrectly "decides" to re-use > persisted shuffle writes of A filtering on year 2002 , for year 2001. > {code:java} > == Physical Plan == > TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS > FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L]) > +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS > year#24367L, d_month_seq#24371L] >+- CartesianProduct > :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > : +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200) > : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > :+- BroadcastNestedLoopJoin BuildRight, Cross > : :- *(1) Project [d_year#23551L] > : : +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] > (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), > isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > : +- BroadcastExchange IdentityBroadcastMode > : +- *(2) Project [d_month_seq#24371L] > : +- *(2) ScanV2 > BigQueryDataSourceV2[d_month_seq#24371L] (Filters: > [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), > isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], > functions=[]) > +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange > hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code} > > > And the result is obviously incorrect because prev_year should be 2001 > {code:java} > +-++---+ > |prev_year|year|d_month_seq| > +-++---+ > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > +-++---+ > only showing top 20 rows > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32604) Bug in ALSModel Python Documentation
[ https://issues.apache.org/jira/browse/SPARK-32604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177439#comment-17177439 ] Hyukjin Kwon commented on SPARK-32604: -- Oh right. So I guess it will be fixed in Spark 3.1 correctly? > Bug in ALSModel Python Documentation > > > Key: SPARK-32604 > URL: https://issues.apache.org/jira/browse/SPARK-32604 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 2.4.0, 3.0.0 >Reporter: Zach Cahoone >Priority: Minor > > In the ALSModel documentation > ([https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]), > there is a bug which causes data frame creation to fail with the following > error: > {code:java} > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 > (TID 15, 10.0.0.133, executor 10): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, > in main > process() > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, > in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line > 390, in dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/usr/lib/spark/python/pyspark/rdd.py", line 1354, in takeUpToNumLeft > yield next(iterator) > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in > wrapper > return f(*args, **kwargs) > File "", line 24, in > NameError: name 'long' is not defined > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28) > at > org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153) > at > org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.
[jira] [Commented] (SPARK-32053) pyspark save of serialized model is failing for windows.
[ https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177437#comment-17177437 ] Hyukjin Kwon commented on SPARK-32053: -- Seems like Hadoop and Windows issue assuming from the log: {code} Caused by: java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\Administrator\AppData\Roaming\IBM Watson Studio\projects\librarydAYYEkp5bh6TSjn106hb\metadata_temporary\0_temporary\attempt_20200813063114_0049_m_00_0\part-0 at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773) at org.apache.hadoop.util.Shell.execCommand(Shell.java:869) at org.apache.hadoop.util.Shell.execCommand(Shell.java:852) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:225) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.(RawLocalFileSystem.java:209) at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:3 {code} > pyspark save of serialized model is failing for windows. > > > Key: SPARK-32053 > URL: https://issues.apache.org/jira/browse/SPARK-32053 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Kayal >Priority: Blocker > Attachments: image-2020-06-22-18-19-32-236.png, > image-2020-08-13-20-24-57-309.png, image-2020-08-13-20-25-57-585.png, > image-2020-08-13-20-28-28-779.png, image-2020-08-13-20-29-40-555.png, > screenshot-1.png > > > {color:#172b4d}Hi, {color} > {color:#172b4d}We are using spark functionality to save the serialized model > to disk . On windows platform we are seeing save of the serialized model is > failing with the error: o288.save() failed. {color} > > > > !image-2020-06-22-18-19-32-236.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32053) pyspark save of serialized model is failing for windows.
[ https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32053: - Affects Version/s: 3.0.0 > pyspark save of serialized model is failing for windows. > > > Key: SPARK-32053 > URL: https://issues.apache.org/jira/browse/SPARK-32053 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0, 3.0.0 >Reporter: Kayal >Priority: Minor > Attachments: image-2020-06-22-18-19-32-236.png, > image-2020-08-13-20-24-57-309.png, image-2020-08-13-20-25-57-585.png, > image-2020-08-13-20-28-28-779.png, image-2020-08-13-20-29-40-555.png, > screenshot-1.png > > > {color:#172b4d}Hi, {color} > {color:#172b4d}We are using spark functionality to save the serialized model > to disk . On windows platform we are seeing save of the serialized model is > failing with the error: o288.save() failed. {color} > > > > !image-2020-06-22-18-19-32-236.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32053) pyspark save of serialized model is failing for windows.
[ https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32053: - Component/s: Windows > pyspark save of serialized model is failing for windows. > > > Key: SPARK-32053 > URL: https://issues.apache.org/jira/browse/SPARK-32053 > Project: Spark > Issue Type: Bug > Components: PySpark, Windows >Affects Versions: 2.3.0, 3.0.0 >Reporter: Kayal >Priority: Minor > Attachments: image-2020-06-22-18-19-32-236.png, > image-2020-08-13-20-24-57-309.png, image-2020-08-13-20-25-57-585.png, > image-2020-08-13-20-28-28-779.png, image-2020-08-13-20-29-40-555.png, > screenshot-1.png > > > {color:#172b4d}Hi, {color} > {color:#172b4d}We are using spark functionality to save the serialized model > to disk . On windows platform we are seeing save of the serialized model is > failing with the error: o288.save() failed. {color} > > > > !image-2020-06-22-18-19-32-236.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32053) pyspark save of serialized model is failing for windows.
[ https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32053: - Priority: Minor (was: Blocker) > pyspark save of serialized model is failing for windows. > > > Key: SPARK-32053 > URL: https://issues.apache.org/jira/browse/SPARK-32053 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Kayal >Priority: Minor > Attachments: image-2020-06-22-18-19-32-236.png, > image-2020-08-13-20-24-57-309.png, image-2020-08-13-20-25-57-585.png, > image-2020-08-13-20-28-28-779.png, image-2020-08-13-20-29-40-555.png, > screenshot-1.png > > > {color:#172b4d}Hi, {color} > {color:#172b4d}We are using spark functionality to save the serialized model > to disk . On windows platform we are seeing save of the serialized model is > failing with the error: o288.save() failed. {color} > > > > !image-2020-06-22-18-19-32-236.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31947) Solve string value error about Date/Timestamp in ScriptTransform
[ https://issues.apache.org/jira/browse/SPARK-31947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177436#comment-17177436 ] Hyukjin Kwon commented on SPARK-31947: -- [~angerszhuuu] can you update which PR fixed this JIRA? > Solve string value error about Date/Timestamp in ScriptTransform > > > Key: SPARK-31947 > URL: https://issues.apache.org/jira/browse/SPARK-31947 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > For test case > > {code:java} > test("SPARK-25990: TRANSFORM should handle different data types correctly") { > assume(TestUtils.testCommandAvailable("python")) > val scriptFilePath = getTestResourcePath("test_script.py") > withTempView("v") { > val df = Seq( > (1, "1", 1.0, BigDecimal(1.0), new Timestamp(1), > Date.valueOf("2015-05-21")), > (2, "2", 2.0, BigDecimal(2.0), new Timestamp(2), > Date.valueOf("2015-05-22")), > (3, "3", 3.0, BigDecimal(3.0), new Timestamp(3), > Date.valueOf("2015-05-23")) > ).toDF("a", "b", "c", "d", "e", "f") // Note column d's data type is > Decimal(38, 18) > df.createTempView("v") val query = sql( > s""" >|SELECT >|TRANSFORM(a, b, c, d, e, f) >|USING 'python $scriptFilePath' AS (a, b, c, d, e, f) >|FROM v > """.stripMargin) val decimalToString: Column => Column = c => > c.cast("string") checkAnswer(query, identity, df.select( > 'a.cast("string"), > 'b.cast("string"), > 'c.cast("string"), > decimalToString('d), > 'e.cast("string"), > 'f.cast("string")).collect()) > } > } > {code} > > > Get wrong result > {code:java} > [info] - SPARK-25990: TRANSFORM should handle different data types correctly > *** FAILED *** (4 seconds, 997 milliseconds) > [info] Results do not match for Spark plan: > [info]ScriptTransformation [a#19, b#20, c#21, d#22, e#23, f#24], python > /Users/angerszhu/Documents/project/AngersZhu/spark/sql/core/target/scala-2.12/test-classes/test_script.py, > [a#31, b#32, c#33, d#34, e#35, f#36], > org.apache.spark.sql.execution.script.ScriptTransformIOSchema@1ad5a29c > [info] +- Project [_1#6 AS a#19, _2#7 AS b#20, _3#8 AS c#21, _4#9 AS d#22, > _5#10 AS e#23, _6#11 AS f#24] > [info] +- LocalTableScan [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11] > [info] > [info] > [info]== Results == > [info]!== Expected Answer - 3 == > == Actual Answer - 3 == > [info] ![1,1,1.0,1.00,1970-01-01 08:00:00.001,2015-05-21] > [1,1,1.0,1.00,1000,16576] > [info] ![2,2,2.0,2.00,1970-01-01 08:00:00.002,2015-05-22] > [2,2,2.0,2.00,2000,16577] > [info] ![3,3,3.0,3.00,1970-01-01 08:00:00.003,2015-05-23] > [3,3,3.0,3.00,3000,16578] (SparkPlanTest.scala:95) > [ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors
[ https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-12312: Assignee: Gabor Somogyi > JDBC connection to Kerberos secured databases fails on remote executors > --- > > Key: SPARK-12312 > URL: https://issues.apache.org/jira/browse/SPARK-12312 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 2.4.2 >Reporter: nabacg >Assignee: Gabor Somogyi >Priority: Minor > > When loading DataFrames from JDBC datasource with Kerberos authentication, > remote executors (yarn-client/cluster etc. modes) fail to establish a > connection due to lack of Kerberos ticket or ability to generate it. > This is a real issue when trying to ingest data from kerberized data sources > (SQL Server, Oracle) in enterprise environment where exposing simple > authentication access is not an option due to IT policy issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32598) Not able to see driver logs in spark history server in standalone mode
[ https://issues.apache.org/jira/browse/SPARK-32598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177429#comment-17177429 ] Lantao Jin commented on SPARK-32598: [~sriramgr] PullRequest is welcome. Please commit on the master branch. Thanks. > Not able to see driver logs in spark history server in standalone mode > -- > > Key: SPARK-32598 > URL: https://issues.apache.org/jira/browse/SPARK-32598 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.4 >Reporter: Sriram Ganesh >Priority: Major > Attachments: image-2020-08-12-11-50-01-899.png > > > Driver logs are not coming in history server in spark standalone mode. > Checked in the spark events logs it is not there. Is this by design or can I > fix it by creating a patch?. Not able to see any proper documentation > regarding this. > > !image-2020-08-12-11-50-01-899.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32608) Script Transform DELIMIT value error
[ https://issues.apache.org/jira/browse/SPARK-32608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32608: Assignee: (was: Apache Spark) > Script Transform DELIMIT value error > - > > Key: SPARK-32608 > URL: https://issues.apache.org/jira/browse/SPARK-32608 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > For SQL > > {code:java} > SELECT TRANSFORM(a, b, c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > LINES TERMINATED BY '\n' > NULL DEFINED AS 'null' > USING 'cat' AS (a, b, c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > LINES TERMINATED BY '\n' > NULL DEFINED AS 'NULL' > FROM testData > {code} > The correct > TOK_TABLEROWFORMATFIELD should be , nut actually ',' > TOK_TABLEROWFORMATLINES should be \n but actually '\n' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32608) Script Transform DELIMIT value error
[ https://issues.apache.org/jira/browse/SPARK-32608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32608: Assignee: Apache Spark > Script Transform DELIMIT value error > - > > Key: SPARK-32608 > URL: https://issues.apache.org/jira/browse/SPARK-32608 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > For SQL > > {code:java} > SELECT TRANSFORM(a, b, c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > LINES TERMINATED BY '\n' > NULL DEFINED AS 'null' > USING 'cat' AS (a, b, c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > LINES TERMINATED BY '\n' > NULL DEFINED AS 'NULL' > FROM testData > {code} > The correct > TOK_TABLEROWFORMATFIELD should be , nut actually ',' > TOK_TABLEROWFORMATLINES should be \n but actually '\n' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32608) Script Transform DELIMIT value error
[ https://issues.apache.org/jira/browse/SPARK-32608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177420#comment-17177420 ] Apache Spark commented on SPARK-32608: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/29428 > Script Transform DELIMIT value error > - > > Key: SPARK-32608 > URL: https://issues.apache.org/jira/browse/SPARK-32608 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > For SQL > > {code:java} > SELECT TRANSFORM(a, b, c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > LINES TERMINATED BY '\n' > NULL DEFINED AS 'null' > USING 'cat' AS (a, b, c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > LINES TERMINATED BY '\n' > NULL DEFINED AS 'NULL' > FROM testData > {code} > The correct > TOK_TABLEROWFORMATFIELD should be , nut actually ',' > TOK_TABLEROWFORMATLINES should be \n but actually '\n' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls
[ https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177401#comment-17177401 ] Dustin Smith edited comment on SPARK-32046 at 8/14/20, 12:51 AM: - [~maropu] the definition of current timestamp is as follows: {code:java} current_timestamp() - Returns the current timestamp at the start of query evaluation. {code} The question when does a query evaluation start and stop? Do mutual exclusive dataframes being processed consist of the same query evaluation? If yes, then current timestamp's behavior in spark shell is correct; however, as user, that would be extremely undesirable behavior. I would rather cache the current timestamp and call it again for a new time. Now if a query evaluation stops once it is executed and starts anew when another dataframe or action is called, then the behavior in shell and notebooks are incorrect. The notebooks are only correct for a few runs and then default to not changing. [https://spark.apache.org/docs/2.3.0/api/sql/index.html#current_timestamp] Additionally, whatever behavior is correct or should be correct is not consistent and more robust testing should occur in my opinion. As an after thought, the name current timestamp doesn't make sense if the time is supposed to freeze after one call. Really it is current timestamp once and beyond that call it is no longer current. was (Author: dustin.smith.tdg): [~maropu] the definition of current timestamp is as follows: {code:java} current_timestamp() - Returns the current timestamp at the start of query evaluation. {code} The question when does a query evaluation start and stop? Do mutual exclusive dataframes being processed consist of the same query evaluation? If yes, then current timestamp's behavior in spark shell is correct; however, as user, that would be extremely undesirable behavior. I would rather cache the current timestamp and call it again for a new time. Now if a query evaluation stops once it is executed and starts anew when another dataframe or action is called, then the behavior in shell and notebooks are incorrect. The notebooks are only correct for a few runs and then default to not changing. [https://spark.apache.org/docs/2.3.0/api/sql/index.html#current_timestamp] Additionally, whatever behavior is correct or should be correct is not consistent and more robust testing should occur in my opinion. > current_timestamp called in a cache dataframe freezes the time for all future > calls > --- > > Key: SPARK-32046 > URL: https://issues.apache.org/jira/browse/SPARK-32046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.4, 3.0.0 >Reporter: Dustin Smith >Priority: Minor > Labels: caching, sql, time > > If I call current_timestamp 3 times while caching the dataframe variable in > order to freeze that dataframe's time, the 3rd dataframe time and beyond > (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe > and the 2nd will differ in time but will become static on the 3rd usage and > beyond (when running on Zeppelin or Jupyter). > Additionally, caching only caused 2 dataframes to cache skipping the 3rd. > However, > {code:java} > val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache > df.count > // this can be run 3 times no issue. > // then later cast to TimestampType{code} > doesn't have this problem and all 3 dataframes cache with correct times > displaying. > Running the code in shell and Jupyter or Zeppelin (ZP) also produces > different results. In the shell, you only get 1 unique time no matter how > many times you run it, current_timestamp. However, in ZP or Jupyter I have > always received 2 unique times before it froze. > > {code:java} > val df1 = spark.range(1).select(current_timestamp as "datetime").cache > df1.count > df1.show(false) > Thread.sleep(9500) > val df2 = spark.range(1).select(current_timestamp as "datetime").cache > df2.count > df2.show(false) > Thread.sleep(9500) > val df3 = spark.range(1).select(current_timestamp as "datetime").cache > df3.count > df3.show(false){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls
[ https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177401#comment-17177401 ] Dustin Smith edited comment on SPARK-32046 at 8/14/20, 12:49 AM: - [~maropu] the definition of current timestamp is as follows: {code:java} current_timestamp() - Returns the current timestamp at the start of query evaluation. {code} The question when does a query evaluation start and stop? Do mutual exclusive dataframes being processed consist of the same query evaluation? If yes, then current timestamp's behavior in spark shell is correct; however, as user, that would be extremely undesirable behavior. I would rather cache the current timestamp and call it again for a new time. Now if a query evaluation stops once it is executed and starts anew when another dataframe or action is called, then the behavior in shell and notebooks are incorrect. The notebooks are only correct for a few runs and then default to not changing. [https://spark.apache.org/docs/2.3.0/api/sql/index.html#current_timestamp] Additionally, whatever behavior is correct or should be correct is not consistent and more robust testing should occur in my opinion. was (Author: dustin.smith.tdg): [~maropu] the definition of current timestamp is as follows: {code:java} current_timestamp() - Returns the current timestamp at the start of query evaluation. {code} The question when does a query evaluation start and stop? Do mutual exclusive dataframes being processed consist of the same query evaluation? If yes, then current timestamp's behavior in spark shell is correct; however, as user, that would be extremely undesirable behavior. I would rather cache the current timestamp and call it again for a new time. Now if a query evaluation stops once it is executed and starts anew when another dataframe or action is called, then the behavior in shell and notebooks are incorrect. The notebooks are only correct for a few runs and then default to not changing. [https://spark.apache.org/docs/2.3.0/api/sql/index.html#current_timestamp] > current_timestamp called in a cache dataframe freezes the time for all future > calls > --- > > Key: SPARK-32046 > URL: https://issues.apache.org/jira/browse/SPARK-32046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.4, 3.0.0 >Reporter: Dustin Smith >Priority: Minor > Labels: caching, sql, time > > If I call current_timestamp 3 times while caching the dataframe variable in > order to freeze that dataframe's time, the 3rd dataframe time and beyond > (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe > and the 2nd will differ in time but will become static on the 3rd usage and > beyond (when running on Zeppelin or Jupyter). > Additionally, caching only caused 2 dataframes to cache skipping the 3rd. > However, > {code:java} > val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache > df.count > // this can be run 3 times no issue. > // then later cast to TimestampType{code} > doesn't have this problem and all 3 dataframes cache with correct times > displaying. > Running the code in shell and Jupyter or Zeppelin (ZP) also produces > different results. In the shell, you only get 1 unique time no matter how > many times you run it, current_timestamp. However, in ZP or Jupyter I have > always received 2 unique times before it froze. > > {code:java} > val df1 = spark.range(1).select(current_timestamp as "datetime").cache > df1.count > df1.show(false) > Thread.sleep(9500) > val df2 = spark.range(1).select(current_timestamp as "datetime").cache > df2.count > df2.show(false) > Thread.sleep(9500) > val df3 = spark.range(1).select(current_timestamp as "datetime").cache > df3.count > df3.show(false){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls
[ https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177401#comment-17177401 ] Dustin Smith commented on SPARK-32046: -- [~maropu] the definition of current timestamp is as follows: {code:java} current_timestamp() - Returns the current timestamp at the start of query evaluation. {code} The question when does a query evaluation start and stop? Do mutual exclusive dataframes being processed consist of the same query evaluation? If yes, then current timestamp's behavior in spark shell is correct; however, as user, that would be extremely undesirable behavior. I would rather cache the current timestamp and call it again for a new time. Now if a query evaluation stops once it is executed and starts anew when another dataframe or action is called, then the behavior in shell and notebooks are incorrect. The notebooks are only correct for a few runs and then default to not changing. [https://spark.apache.org/docs/2.3.0/api/sql/index.html#current_timestamp] > current_timestamp called in a cache dataframe freezes the time for all future > calls > --- > > Key: SPARK-32046 > URL: https://issues.apache.org/jira/browse/SPARK-32046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.4, 3.0.0 >Reporter: Dustin Smith >Priority: Minor > Labels: caching, sql, time > > If I call current_timestamp 3 times while caching the dataframe variable in > order to freeze that dataframe's time, the 3rd dataframe time and beyond > (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe > and the 2nd will differ in time but will become static on the 3rd usage and > beyond (when running on Zeppelin or Jupyter). > Additionally, caching only caused 2 dataframes to cache skipping the 3rd. > However, > {code:java} > val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache > df.count > // this can be run 3 times no issue. > // then later cast to TimestampType{code} > doesn't have this problem and all 3 dataframes cache with correct times > displaying. > Running the code in shell and Jupyter or Zeppelin (ZP) also produces > different results. In the shell, you only get 1 unique time no matter how > many times you run it, current_timestamp. However, in ZP or Jupyter I have > always received 2 unique times before it froze. > > {code:java} > val df1 = spark.range(1).select(current_timestamp as "datetime").cache > df1.count > df1.show(false) > Thread.sleep(9500) > val df2 = spark.range(1).select(current_timestamp as "datetime").cache > df2.count > df2.show(false) > Thread.sleep(9500) > val df3 = spark.range(1).select(current_timestamp as "datetime").cache > df3.count > df3.show(false){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls
[ https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177398#comment-17177398 ] Takeshi Yamamuro commented on SPARK-32046: -- This issue depends on ZP or Jupyter? If so, I think it is hard to reproduce this issue... Non-deterministic exprs can change output if cache broken in the case. So, I think it is not a good idea for applications to depend on those kinds of values, either way. > current_timestamp called in a cache dataframe freezes the time for all future > calls > --- > > Key: SPARK-32046 > URL: https://issues.apache.org/jira/browse/SPARK-32046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.4, 3.0.0 >Reporter: Dustin Smith >Priority: Minor > Labels: caching, sql, time > > If I call current_timestamp 3 times while caching the dataframe variable in > order to freeze that dataframe's time, the 3rd dataframe time and beyond > (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe > and the 2nd will differ in time but will become static on the 3rd usage and > beyond (when running on Zeppelin or Jupyter). > Additionally, caching only caused 2 dataframes to cache skipping the 3rd. > However, > {code:java} > val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache > df.count > // this can be run 3 times no issue. > // then later cast to TimestampType{code} > doesn't have this problem and all 3 dataframes cache with correct times > displaying. > Running the code in shell and Jupyter or Zeppelin (ZP) also produces > different results. In the shell, you only get 1 unique time no matter how > many times you run it, current_timestamp. However, in ZP or Jupyter I have > always received 2 unique times before it froze. > > {code:java} > val df1 = spark.range(1).select(current_timestamp as "datetime").cache > df1.count > df1.show(false) > Thread.sleep(9500) > val df2 = spark.range(1).select(current_timestamp as "datetime").cache > df2.count > df2.show(false) > Thread.sleep(9500) > val df3 = spark.range(1).select(current_timestamp as "datetime").cache > df3.count > df3.show(false){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32613) DecommissionWorkerSuite has started failing sporadically again
[ https://issues.apache.org/jira/browse/SPARK-32613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32613: Assignee: Apache Spark > DecommissionWorkerSuite has started failing sporadically again > -- > > Key: SPARK-32613 > URL: https://issues.apache.org/jira/browse/SPARK-32613 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Devesh Agrawal >Assignee: Apache Spark >Priority: Major > Fix For: 3.1.0 > > > Test "decommission workers ensure that fetch failures lead to rerun" is > failing: > > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127357/testReport/org.apache.spark.deploy/DecommissionWorkerSuite/decommission_workers_ensure_that_fetch_failures_lead_to_rerun/] > https://github.com/apache/spark/pull/29367/checks?check_run_id=972990200#step:14:13579 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32613) DecommissionWorkerSuite has started failing sporadically again
[ https://issues.apache.org/jira/browse/SPARK-32613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32613: Assignee: (was: Apache Spark) > DecommissionWorkerSuite has started failing sporadically again > -- > > Key: SPARK-32613 > URL: https://issues.apache.org/jira/browse/SPARK-32613 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Devesh Agrawal >Priority: Major > Fix For: 3.1.0 > > > Test "decommission workers ensure that fetch failures lead to rerun" is > failing: > > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127357/testReport/org.apache.spark.deploy/DecommissionWorkerSuite/decommission_workers_ensure_that_fetch_failures_lead_to_rerun/] > https://github.com/apache/spark/pull/29367/checks?check_run_id=972990200#step:14:13579 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32613) DecommissionWorkerSuite has started failing sporadically again
[ https://issues.apache.org/jira/browse/SPARK-32613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177368#comment-17177368 ] Apache Spark commented on SPARK-32613: -- User 'agrawaldevesh' has created a pull request for this issue: https://github.com/apache/spark/pull/29422 > DecommissionWorkerSuite has started failing sporadically again > -- > > Key: SPARK-32613 > URL: https://issues.apache.org/jira/browse/SPARK-32613 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Devesh Agrawal >Priority: Major > Fix For: 3.1.0 > > > Test "decommission workers ensure that fetch failures lead to rerun" is > failing: > > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127357/testReport/org.apache.spark.deploy/DecommissionWorkerSuite/decommission_workers_ensure_that_fetch_failures_lead_to_rerun/] > https://github.com/apache/spark/pull/29367/checks?check_run_id=972990200#step:14:13579 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32613) DecommissionWorkerSuite has started failing sporadically again
Devesh Agrawal created SPARK-32613: -- Summary: DecommissionWorkerSuite has started failing sporadically again Key: SPARK-32613 URL: https://issues.apache.org/jira/browse/SPARK-32613 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.0 Reporter: Devesh Agrawal Fix For: 3.1.0 Test "decommission workers ensure that fetch failures lead to rerun" is failing: [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127357/testReport/org.apache.spark.deploy/DecommissionWorkerSuite/decommission_workers_ensure_that_fetch_failures_lead_to_rerun/] https://github.com/apache/spark/pull/29367/checks?check_run_id=972990200#step:14:13579 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32611) Querying ORC table in Spark3 using spark.sql.orc.impl=hive produces incorrect when timestamp is present in predicate
[ https://issues.apache.org/jira/browse/SPARK-32611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumeet updated SPARK-32611: --- Description: *How to reproduce this behavior?* * TZ="America/Los_Angeles" ./bin/spark-shell * sql("set spark.sql.hive.convertMetastoreOrc=true") * sql("set spark.sql.orc.impl=hive") * sql("create table t_spark(col timestamp) stored as orc;") * sql("insert into t_spark values (cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp));") * sql("select col, date_format(col, 'DD') from t_spark where col = cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false) *This will return empty results, which is incorrect.* * sql("set spark.sql.orc.impl=native") * sql("select col, date_format(col, 'DD') from t_spark where col = cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false) *This will return 1 row, which is the expected output.* The above query using (True, hive) returns *correct results if pushdown filters are turned off*. * sql("set spark.sql.orc.filterPushdown=false") * sql("select col, date_format(col, 'DD') from t_spark where col = cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false) *This will return 1 row, which is the expected output.* was: *How to reproduce this behavior?* * TZ="America/Los_Angeles" ./bin/spark-shell --conf spark.sql.catalogImplementation=hive * sql("set spark.sql.hive.convertMetastoreOrc=true") * sql("set spark.sql.orc.impl=hive") * sql("create table t_spark(col timestamp) stored as orc;") * sql("insert into t_spark values (cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp));") * sql("select col, date_format(col, 'DD') from t_spark where col = cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false) *This will return empty results, which is incorrect.* * sql("set spark.sql.orc.impl=native") * sql("select col, date_format(col, 'DD') from t_spark where col = cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false) *This will return 1 row, which is the expected output.* The above query using (True, hive) returns *correct results if pushdown filters are turned off*. * sql("set spark.sql.orc.filterPushdown=false") * sql("select col, date_format(col, 'DD') from t_spark where col = cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false) *This will return 1 row, which is the expected output.* > Querying ORC table in Spark3 using spark.sql.orc.impl=hive produces incorrect > when timestamp is present in predicate > > > Key: SPARK-32611 > URL: https://issues.apache.org/jira/browse/SPARK-32611 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Sumeet >Priority: Major > > *How to reproduce this behavior?* > * TZ="America/Los_Angeles" ./bin/spark-shell > * sql("set spark.sql.hive.convertMetastoreOrc=true") > * sql("set spark.sql.orc.impl=hive") > * sql("create table t_spark(col timestamp) stored as orc;") > * sql("insert into t_spark values (cast('2100-01-01 > 01:33:33.123America/Los_Angeles' as timestamp));") > * sql("select col, date_format(col, 'DD') from t_spark where col = > cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false) > *This will return empty results, which is incorrect.* > * sql("set spark.sql.orc.impl=native") > * sql("select col, date_format(col, 'DD') from t_spark where col = > cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false) > *This will return 1 row, which is the expected output.* > > The above query using (True, hive) returns *correct results if pushdown > filters are turned off*. > * sql("set spark.sql.orc.filterPushdown=false") > * sql("select col, date_format(col, 'DD') from t_spark where col = > cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false) > *This will return 1 row, which is the expected output.* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32604) Bug in ALSModel Python Documentation
[ https://issues.apache.org/jira/browse/SPARK-32604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177315#comment-17177315 ] Zach Cahoone commented on SPARK-32604: -- The code here ([https://github.com/apache/spark/blob/master/examples/src/main/python/ml/als_example.py]) is correct. I believe the documentation webpage just needs to be updated. [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html#cold-start-strategy] > Bug in ALSModel Python Documentation > > > Key: SPARK-32604 > URL: https://issues.apache.org/jira/browse/SPARK-32604 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 2.4.0, 3.0.0 >Reporter: Zach Cahoone >Priority: Minor > > In the ALSModel documentation > ([https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]), > there is a bug which causes data frame creation to fail with the following > error: > {code:java} > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 > (TID 15, 10.0.0.133, executor 10): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, > in main > process() > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, > in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line > 390, in dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/usr/lib/spark/python/pyspark/rdd.py", line 1354, in takeUpToNumLeft > yield next(iterator) > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in > wrapper > return f(*args, **kwargs) > File "", line 24, in > NameError: name 'long' is not defined > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28) > at > org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153) > at > org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878) > at > org.apache.spark.sched
[jira] [Created] (SPARK-32612) int columns produce inconsistent results on pandas UDFs
Robert Joseph Evans created SPARK-32612: --- Summary: int columns produce inconsistent results on pandas UDFs Key: SPARK-32612 URL: https://issues.apache.org/jira/browse/SPARK-32612 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.0.0 Reporter: Robert Joseph Evans This is similar to SPARK-30187 but I personally consider this data corruption. If I have a simple pandas UDF {code} >>> def add(a, b): return a + b >>> my_udf = pandas_udf(add, returnType=LongType()) {code} And I want to process some data with it, say 32 bit ints {code} >>> df = spark.createDataFrame([(1037694399, 1204615848),(3,4)], >>> StructType([StructField("a", IntegerType()), StructField("b", >>> IntegerType())])) >>> df.select(my_udf(col("a") - 3, col("b")).show() +--+--+---+ | a| b|add((a - 3), b)| +--+--+---+ |1037694399|1204615848|-2052657052| | 3| 4| 4| +--+--+---+ {code} I get an integer overflow for the data as I would expect. But as soon as I add a {{None}} to the data, even on a different row the result I get back is totally different. {code} >>> df = spark.createDataFrame([(1037694399, 1204615848),(3,None)], >>> StructType([StructField("a", IntegerType()), StructField("b", >>> IntegerType())])) >>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show() +--+--+---+ | a| b|add((a - 3), b)| +--+--+---+ |1037694399|1204615848| 2242310244| | 3| null| null| +--+--+---+ {code} The integer overflow disappears. This is because arrow and/or pandas changes the data type to a float in order to be able to store the null value. So then the processing is being done on floating point there is no overflow. This in and of itself is annoying but understandable because it is dealing with a limitation in pandas. Where it becomes a bug is that this happens on a per batch basis. This means that I can have the same two rows in different parts of my data set and get different results depending on their proximity to a null value. {code} >>> df = spark.createDataFrame([(1037694399, 1204615848),(3,None),(1037694399, >>> 1204615848),(3,4)], StructType([StructField("a", IntegerType()), >>> StructField("b", IntegerType())])) >>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show() +--+--+---+ | a| b|add((a - 3), b)| +--+--+---+ |1037694399|1204615848| 2242310244| | 3| null| null| |1037694399|1204615848| 2242310244| | 3| 4| 4| +--+--+---+ >>> spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', '2') >>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show() +--+--+---+ | a| b|add((a - 3), b)| +--+--+---+ |1037694399|1204615848| 2242310244| | 3| null| null| |1037694399|1204615848|-2052657052| | 3| 4| 4| +--+--+---+ {code} For me personally I would prefer to have all nullable integer columns upgraded to float all the time, that way it is at least consistent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25557) ORC predicate pushdown for nested fields
[ https://issues.apache.org/jira/browse/SPARK-25557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177301#comment-17177301 ] Apache Spark commented on SPARK-25557: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/29427 > ORC predicate pushdown for nested fields > > > Key: SPARK-25557 > URL: https://issues.apache.org/jira/browse/SPARK-25557 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25557) ORC predicate pushdown for nested fields
[ https://issues.apache.org/jira/browse/SPARK-25557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177300#comment-17177300 ] Apache Spark commented on SPARK-25557: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/29427 > ORC predicate pushdown for nested fields > > > Key: SPARK-25557 > URL: https://issues.apache.org/jira/browse/SPARK-25557 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31776) Literal lit() supports lists and numpy arrays
[ https://issues.apache.org/jira/browse/SPARK-31776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177281#comment-17177281 ] Zirui Xu commented on SPARK-31776: -- Hi [~mengxr], is this ticket suggesting adding support for Python list and numpy arrays in both lit() and fillna()? I understand that fillna() does not support Array type so adding Python list support to fillna() might involve adding Array support from the SQL side first. > Literal lit() supports lists and numpy arrays > - > > Key: SPARK-31776 > URL: https://issues.apache.org/jira/browse/SPARK-31776 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Xiangrui Meng >Priority: Major > > In ML workload, it is common to replace null feature vectors with some > default value. However, lit() does not support Python list and numpy arrays > at input. Users cannot simply use fillna() to get the job done. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32611) Querying ORC table in Spark3 using spark.sql.orc.impl=hive produces incorrect when timestamp is present in predicate
Sumeet created SPARK-32611: -- Summary: Querying ORC table in Spark3 using spark.sql.orc.impl=hive produces incorrect when timestamp is present in predicate Key: SPARK-32611 URL: https://issues.apache.org/jira/browse/SPARK-32611 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 3.0.1 Reporter: Sumeet *How to reproduce this behavior?* * TZ="America/Los_Angeles" ./bin/spark-shell --conf spark.sql.catalogImplementation=hive * sql("set spark.sql.hive.convertMetastoreOrc=true") * sql("set spark.sql.orc.impl=hive") * sql("create table t_spark(col timestamp) stored as orc;") * sql("insert into t_spark values (cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp));") * sql("select col, date_format(col, 'DD') from t_spark where col = cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false) *This will return empty results, which is incorrect.* * sql("set spark.sql.orc.impl=native") * sql("select col, date_format(col, 'DD') from t_spark where col = cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false) *This will return 1 row, which is the expected output.* The above query using (True, hive) returns *correct results if pushdown filters are turned off*. * sql("set spark.sql.orc.filterPushdown=false") * sql("select col, date_format(col, 'DD') from t_spark where col = cast('2100-01-01 01:33:33.123America/Los_Angeles' as timestamp);").show(false) *This will return 1 row, which is the expected output.* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32609) Incorrect exchange reuse with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177246#comment-17177246 ] Rohit Mishra edited comment on SPARK-32609 at 8/13/20, 6:28 PM: [~mingjial], Please refrain from adding Priority as "Critical". These are reserved for committers. Changing it to "Major". Also please don't populate Target and Fix version field. These are also set by committers. was (Author: rohitmishr1484): [~mingjial], Please refrain from adding Priority as "Critical". These are reserved for committers. Changing it to "Major". > Incorrect exchange reuse with DataSourceV2 > -- > > Key: SPARK-32609 > URL: https://issues.apache.org/jira/browse/SPARK-32609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Mingjia Liu >Priority: Major > Labels: correctness > Original Estimate: 48h > Remaining Estimate: 48h > > > {code:java} > spark.conf.set("spark.sql.exchange.reuse","true") > spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load() > df.createOrReplaceTempView(table) > > df = spark.sql(""" > WITH t1 AS ( > SELECT > d_year, d_month_seq > FROM ( > SELECT t1.d_year , t2.d_month_seq > FROM > date_dim t1 > cross join > date_dim t2 > where t1.d_day_name = "Monday" and t1.d_fy_year > 2000 > and t2.d_day_name = "Monday" and t2.d_fy_year > 2000 > ) > GROUP BY d_year, d_month_seq) > > SELECT > prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq > FROM t1 curr_yr cross join t1 prev_yr > WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1 > ORDER BY d_month_seq > LIMIT 100 > > """) > df.explain() > df.show(){code} > > the repro query : > A. defines a temp table t1 > B. cross join t1 (year 2002) and t2 (year 2001) > With reuse exchange enabled, the plan incorrectly "decides" to re-use > persisted shuffle writes of A filtering on year 2002 , for year 2001. > {code:java} > == Physical Plan == > TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS > FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L]) > +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS > year#24367L, d_month_seq#24371L] >+- CartesianProduct > :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > : +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200) > : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > :+- BroadcastNestedLoopJoin BuildRight, Cross > : :- *(1) Project [d_year#23551L] > : : +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] > (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), > isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > : +- BroadcastExchange IdentityBroadcastMode > : +- *(2) Project [d_month_seq#24371L] > : +- *(2) ScanV2 > BigQueryDataSourceV2[d_month_seq#24371L] (Filters: > [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), > isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], > functions=[]) > +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange > hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code} > > > And the result is obviously incorrect because prev_year should be 2001 > {code:java} > +-++---+ > |prev_year|year|d_month_seq| > +-++---+ > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > +-++---+ > only showing top 20 rows > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32609) Incorrect exchange reuse with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohit Mishra updated SPARK-32609: - Target Version/s: (was: 2.4.5, 2.4.7) > Incorrect exchange reuse with DataSourceV2 > -- > > Key: SPARK-32609 > URL: https://issues.apache.org/jira/browse/SPARK-32609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Mingjia Liu >Priority: Major > Labels: correctness > Original Estimate: 48h > Remaining Estimate: 48h > > > {code:java} > spark.conf.set("spark.sql.exchange.reuse","true") > spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load() > df.createOrReplaceTempView(table) > > df = spark.sql(""" > WITH t1 AS ( > SELECT > d_year, d_month_seq > FROM ( > SELECT t1.d_year , t2.d_month_seq > FROM > date_dim t1 > cross join > date_dim t2 > where t1.d_day_name = "Monday" and t1.d_fy_year > 2000 > and t2.d_day_name = "Monday" and t2.d_fy_year > 2000 > ) > GROUP BY d_year, d_month_seq) > > SELECT > prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq > FROM t1 curr_yr cross join t1 prev_yr > WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1 > ORDER BY d_month_seq > LIMIT 100 > > """) > df.explain() > df.show(){code} > > the repro query : > A. defines a temp table t1 > B. cross join t1 (year 2002) and t2 (year 2001) > With reuse exchange enabled, the plan incorrectly "decides" to re-use > persisted shuffle writes of A filtering on year 2002 , for year 2001. > {code:java} > == Physical Plan == > TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS > FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L]) > +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS > year#24367L, d_month_seq#24371L] >+- CartesianProduct > :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > : +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200) > : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > :+- BroadcastNestedLoopJoin BuildRight, Cross > : :- *(1) Project [d_year#23551L] > : : +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] > (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), > isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > : +- BroadcastExchange IdentityBroadcastMode > : +- *(2) Project [d_month_seq#24371L] > : +- *(2) ScanV2 > BigQueryDataSourceV2[d_month_seq#24371L] (Filters: > [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), > isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], > functions=[]) > +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange > hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code} > > > And the result is obviously incorrect because prev_year should be 2001 > {code:java} > +-++---+ > |prev_year|year|d_month_seq| > +-++---+ > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > +-++---+ > only showing top 20 rows > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32609) Incorrect exchange reuse with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177246#comment-17177246 ] Rohit Mishra commented on SPARK-32609: -- [~mingjial], Please refrain from adding Priority as "Critical". These are reserved for committers. Changing it to "Major". > Incorrect exchange reuse with DataSourceV2 > -- > > Key: SPARK-32609 > URL: https://issues.apache.org/jira/browse/SPARK-32609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Mingjia Liu >Priority: Critical > Labels: correctness > Original Estimate: 48h > Remaining Estimate: 48h > > > {code:java} > spark.conf.set("spark.sql.exchange.reuse","true") > spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load() > df.createOrReplaceTempView(table) > > df = spark.sql(""" > WITH t1 AS ( > SELECT > d_year, d_month_seq > FROM ( > SELECT t1.d_year , t2.d_month_seq > FROM > date_dim t1 > cross join > date_dim t2 > where t1.d_day_name = "Monday" and t1.d_fy_year > 2000 > and t2.d_day_name = "Monday" and t2.d_fy_year > 2000 > ) > GROUP BY d_year, d_month_seq) > > SELECT > prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq > FROM t1 curr_yr cross join t1 prev_yr > WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1 > ORDER BY d_month_seq > LIMIT 100 > > """) > df.explain() > df.show(){code} > > the repro query : > A. defines a temp table t1 > B. cross join t1 (year 2002) and t2 (year 2001) > With reuse exchange enabled, the plan incorrectly "decides" to re-use > persisted shuffle writes of A filtering on year 2002 , for year 2001. > {code:java} > == Physical Plan == > TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS > FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L]) > +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS > year#24367L, d_month_seq#24371L] >+- CartesianProduct > :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > : +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200) > : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > :+- BroadcastNestedLoopJoin BuildRight, Cross > : :- *(1) Project [d_year#23551L] > : : +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] > (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), > isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > : +- BroadcastExchange IdentityBroadcastMode > : +- *(2) Project [d_month_seq#24371L] > : +- *(2) ScanV2 > BigQueryDataSourceV2[d_month_seq#24371L] (Filters: > [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), > isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], > functions=[]) > +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange > hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code} > > > And the result is obviously incorrect because prev_year should be 2001 > {code:java} > +-++---+ > |prev_year|year|d_month_seq| > +-++---+ > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > +-++---+ > only showing top 20 rows > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32609) Incorrect exchange reuse with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohit Mishra updated SPARK-32609: - Priority: Major (was: Critical) > Incorrect exchange reuse with DataSourceV2 > -- > > Key: SPARK-32609 > URL: https://issues.apache.org/jira/browse/SPARK-32609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Mingjia Liu >Priority: Major > Labels: correctness > Original Estimate: 48h > Remaining Estimate: 48h > > > {code:java} > spark.conf.set("spark.sql.exchange.reuse","true") > spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load() > df.createOrReplaceTempView(table) > > df = spark.sql(""" > WITH t1 AS ( > SELECT > d_year, d_month_seq > FROM ( > SELECT t1.d_year , t2.d_month_seq > FROM > date_dim t1 > cross join > date_dim t2 > where t1.d_day_name = "Monday" and t1.d_fy_year > 2000 > and t2.d_day_name = "Monday" and t2.d_fy_year > 2000 > ) > GROUP BY d_year, d_month_seq) > > SELECT > prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq > FROM t1 curr_yr cross join t1 prev_yr > WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1 > ORDER BY d_month_seq > LIMIT 100 > > """) > df.explain() > df.show(){code} > > the repro query : > A. defines a temp table t1 > B. cross join t1 (year 2002) and t2 (year 2001) > With reuse exchange enabled, the plan incorrectly "decides" to re-use > persisted shuffle writes of A filtering on year 2002 , for year 2001. > {code:java} > == Physical Plan == > TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS > FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L]) > +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS > year#24367L, d_month_seq#24371L] >+- CartesianProduct > :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > : +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200) > : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > :+- BroadcastNestedLoopJoin BuildRight, Cross > : :- *(1) Project [d_year#23551L] > : : +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] > (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), > isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > : +- BroadcastExchange IdentityBroadcastMode > : +- *(2) Project [d_month_seq#24371L] > : +- *(2) ScanV2 > BigQueryDataSourceV2[d_month_seq#24371L] (Filters: > [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), > isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], > functions=[]) > +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange > hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code} > > > And the result is obviously incorrect because prev_year should be 2001 > {code:java} > +-++---+ > |prev_year|year|d_month_seq| > +-++---+ > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > +-++---+ > only showing top 20 rows > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32609) Incorrect exchange reuse with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177231#comment-17177231 ] Mingjia Liu commented on SPARK-32609: - I am currently working on a fix & unit test > Incorrect exchange reuse with DataSourceV2 > -- > > Key: SPARK-32609 > URL: https://issues.apache.org/jira/browse/SPARK-32609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Mingjia Liu >Priority: Critical > Labels: correctness > Original Estimate: 48h > Remaining Estimate: 48h > > > {code:java} > spark.conf.set("spark.sql.exchange.reuse","true") > spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load() > df.createOrReplaceTempView(table) > > df = spark.sql(""" > WITH t1 AS ( > SELECT > d_year, d_month_seq > FROM ( > SELECT t1.d_year , t2.d_month_seq > FROM > date_dim t1 > cross join > date_dim t2 > where t1.d_day_name = "Monday" and t1.d_fy_year > 2000 > and t2.d_day_name = "Monday" and t2.d_fy_year > 2000 > ) > GROUP BY d_year, d_month_seq) > > SELECT > prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq > FROM t1 curr_yr cross join t1 prev_yr > WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1 > ORDER BY d_month_seq > LIMIT 100 > > """) > df.explain() > df.show(){code} > > the repro query : > A. defines a temp table t1 > B. cross join t1 (year 2002) and t2 (year 2001) > With reuse exchange enabled, the plan incorrectly "decides" to re-use > persisted shuffle writes of A filtering on year 2002 , for year 2001. > {code:java} > == Physical Plan == > TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS > FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L]) > +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS > year#24367L, d_month_seq#24371L] >+- CartesianProduct > :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > : +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200) > : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > :+- BroadcastNestedLoopJoin BuildRight, Cross > : :- *(1) Project [d_year#23551L] > : : +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] > (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), > isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > : +- BroadcastExchange IdentityBroadcastMode > : +- *(2) Project [d_month_seq#24371L] > : +- *(2) ScanV2 > BigQueryDataSourceV2[d_month_seq#24371L] (Filters: > [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), > isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], > functions=[]) > +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange > hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code} > > > And the result is obviously incorrect because prev_year should be 2001 > {code:java} > +-++---+ > |prev_year|year|d_month_seq| > +-++---+ > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > +-++---+ > only showing top 20 rows > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31198) Use graceful decommissioning as part of dynamic scaling
[ https://issues.apache.org/jira/browse/SPARK-31198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-31198. -- Fix Version/s: 3.1.0 Resolution: Fixed > Use graceful decommissioning as part of dynamic scaling > --- > > Key: SPARK-31198 > URL: https://issues.apache.org/jira/browse/SPARK-31198 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32610) Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper version
[ https://issues.apache.org/jira/browse/SPARK-32610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177204#comment-17177204 ] Apache Spark commented on SPARK-32610: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/29426 > Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper > version > -- > > Key: SPARK-32610 > URL: https://issues.apache.org/jira/browse/SPARK-32610 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > There are links to metrics.dropwizard.io in monitoring.md but the link > targets refer the version 3.1.0, while we use 4.1.1. > Now that users can create their own metrics using the dropwizard library, > it's better to fix the links to refer the proper version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32610) Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper version
[ https://issues.apache.org/jira/browse/SPARK-32610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32610: Assignee: Kousuke Saruta (was: Apache Spark) > Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper > version > -- > > Key: SPARK-32610 > URL: https://issues.apache.org/jira/browse/SPARK-32610 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > There are links to metrics.dropwizard.io in monitoring.md but the link > targets refer the version 3.1.0, while we use 4.1.1. > Now that users can create their own metrics using the dropwizard library, > it's better to fix the links to refer the proper version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32610) Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper version
[ https://issues.apache.org/jira/browse/SPARK-32610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32610: Assignee: Apache Spark (was: Kousuke Saruta) > Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper > version > -- > > Key: SPARK-32610 > URL: https://issues.apache.org/jira/browse/SPARK-32610 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > There are links to metrics.dropwizard.io in monitoring.md but the link > targets refer the version 3.1.0, while we use 4.1.1. > Now that users can create their own metrics using the dropwizard library, > it's better to fix the links to refer the proper version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32610) Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper version
[ https://issues.apache.org/jira/browse/SPARK-32610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177203#comment-17177203 ] Apache Spark commented on SPARK-32610: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/29426 > Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper > version > -- > > Key: SPARK-32610 > URL: https://issues.apache.org/jira/browse/SPARK-32610 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > There are links to metrics.dropwizard.io in monitoring.md but the link > targets refer the version 3.1.0, while we use 4.1.1. > Now that users can create their own metrics using the dropwizard library, > it's better to fix the links to refer the proper version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32609) Incorrect exchange reuse with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mingjia Liu updated SPARK-32609: Target Version/s: 2.4.5, 2.4.7 (was: 2.4.5) > Incorrect exchange reuse with DataSourceV2 > -- > > Key: SPARK-32609 > URL: https://issues.apache.org/jira/browse/SPARK-32609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Mingjia Liu >Priority: Critical > Labels: correctness > Original Estimate: 48h > Remaining Estimate: 48h > > > {code:java} > spark.conf.set("spark.sql.exchange.reuse","true") > spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load() > df.createOrReplaceTempView(table) > > df = spark.sql(""" > WITH t1 AS ( > SELECT > d_year, d_month_seq > FROM ( > SELECT t1.d_year , t2.d_month_seq > FROM > date_dim t1 > cross join > date_dim t2 > where t1.d_day_name = "Monday" and t1.d_fy_year > 2000 > and t2.d_day_name = "Monday" and t2.d_fy_year > 2000 > ) > GROUP BY d_year, d_month_seq) > > SELECT > prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq > FROM t1 curr_yr cross join t1 prev_yr > WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1 > ORDER BY d_month_seq > LIMIT 100 > > """) > df.explain() > df.show(){code} > > the repro query : > A. defines a temp table t1 > B. cross join t1 (year 2002) and t2 (year 2001) > With reuse exchange enabled, the plan incorrectly "decides" to re-use > persisted shuffle writes of A filtering on year 2002 , for year 2001. > {code:java} > == Physical Plan == > TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS > FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L]) > +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS > year#24367L, d_month_seq#24371L] >+- CartesianProduct > :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > : +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200) > : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > :+- BroadcastNestedLoopJoin BuildRight, Cross > : :- *(1) Project [d_year#23551L] > : : +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] > (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), > isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > : +- BroadcastExchange IdentityBroadcastMode > : +- *(2) Project [d_month_seq#24371L] > : +- *(2) ScanV2 > BigQueryDataSourceV2[d_month_seq#24371L] (Filters: > [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), > isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], > functions=[]) > +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange > hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code} > > > And the result is obviously incorrect because prev_year should be 2001 > {code:java} > +-++---+ > |prev_year|year|d_month_seq| > +-++---+ > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > +-++---+ > only showing top 20 rows > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28367) Kafka connector infinite wait because metadata never updated
[ https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177202#comment-17177202 ] Ruslan Dautkhanov commented on SPARK-28367: --- [~gsomogyi] thanks! yep would be great to learn how this is done on the Flink side. > Kafka connector infinite wait because metadata never updated > > > Key: SPARK-28367 > URL: https://issues.apache.org/jira/browse/SPARK-28367 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0, 3.1.0 >Reporter: Gabor Somogyi >Priority: Critical > > Spark uses an old and deprecated API named poll(long) which never returns and > stays in live lock if metadata is not updated (for instance when broker > disappears at consumer creation). > I've created a small standalone application to test it and the alternatives: > https://github.com/gaborgsomogyi/kafka-get-assignment -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32609) Incorrect exchange reuse with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177199#comment-17177199 ] Mingjia Liu commented on SPARK-32609: - Mitigation: Turn off spark.sql.exchange.reuse. eg: spark.conf.set("spark.sql.exchange.reuse", "false") Root cause: bug at [https://github.com/apache/spark/blob/e5bef51826dc2ff4020879e35ae7eb9019aa7fcd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala#L48] Fix: Add pushedfilters comparison in equals function. verified that applying the fix brings right plan and result. > Incorrect exchange reuse with DataSourceV2 > -- > > Key: SPARK-32609 > URL: https://issues.apache.org/jira/browse/SPARK-32609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Mingjia Liu >Priority: Critical > Labels: correctness > Original Estimate: 48h > Remaining Estimate: 48h > > > {code:java} > spark.conf.set("spark.sql.exchange.reuse","true") > spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load() > df.createOrReplaceTempView(table) > > df = spark.sql(""" > WITH t1 AS ( > SELECT > d_year, d_month_seq > FROM ( > SELECT t1.d_year , t2.d_month_seq > FROM > date_dim t1 > cross join > date_dim t2 > where t1.d_day_name = "Monday" and t1.d_fy_year > 2000 > and t2.d_day_name = "Monday" and t2.d_fy_year > 2000 > ) > GROUP BY d_year, d_month_seq) > > SELECT > prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq > FROM t1 curr_yr cross join t1 prev_yr > WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1 > ORDER BY d_month_seq > LIMIT 100 > > """) > df.explain() > df.show(){code} > > the repro query : > A. defines a temp table t1 > B. cross join t1 (year 2002) and t2 (year 2001) > With reuse exchange enabled, the plan incorrectly "decides" to re-use > persisted shuffle writes of A filtering on year 2002 , for year 2001. > {code:java} > == Physical Plan == > TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS > FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L]) > +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS > year#24367L, d_month_seq#24371L] >+- CartesianProduct > :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > : +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200) > : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], > functions=[]) > :+- BroadcastNestedLoopJoin BuildRight, Cross > : :- *(1) Project [d_year#23551L] > : : +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] > (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), > isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > : +- BroadcastExchange IdentityBroadcastMode > : +- *(2) Project [d_month_seq#24371L] > : +- *(2) ScanV2 > BigQueryDataSourceV2[d_month_seq#24371L] (Filters: > [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), > isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: > [table=tpcds_1G.date_dim,paths=[]]) > +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], > functions=[]) > +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange > hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code} > > > And the result is obviously incorrect because prev_year should be 2001 > {code:java} > +-++---+ > |prev_year|year|d_month_seq| > +-++---+ > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > | 2002|2002| 1212| > +-++---+ > only showing top 20 rows > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32610) Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper version
Kousuke Saruta created SPARK-32610: -- Summary: Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper version Key: SPARK-32610 URL: https://issues.apache.org/jira/browse/SPARK-32610 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.0.0, 3.1.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta There are links to metrics.dropwizard.io in monitoring.md but the link targets refer the version 3.1.0, while we use 4.1.1. Now that users can create their own metrics using the dropwizard library, it's better to fix the links to refer the proper version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32609) Incorrect exchange reuse with DataSourceV2
Mingjia Liu created SPARK-32609: --- Summary: Incorrect exchange reuse with DataSourceV2 Key: SPARK-32609 URL: https://issues.apache.org/jira/browse/SPARK-32609 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5 Reporter: Mingjia Liu {code:java} spark.conf.set("spark.sql.exchange.reuse","true") spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load() df.createOrReplaceTempView(table) df = spark.sql(""" WITH t1 AS ( SELECT d_year, d_month_seq FROM ( SELECT t1.d_year , t2.d_month_seq FROM date_dim t1 cross join date_dim t2 where t1.d_day_name = "Monday" and t1.d_fy_year > 2000 and t2.d_day_name = "Monday" and t2.d_fy_year > 2000 ) GROUP BY d_year, d_month_seq) SELECT prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq FROM t1 curr_yr cross join t1 prev_yr WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1 ORDER BY d_month_seq LIMIT 100 """) df.explain() df.show(){code} the repro query : A. defines a temp table t1 B. cross join t1 (year 2002) and t2 (year 2001) With reuse exchange enabled, the plan incorrectly "decides" to re-use persisted shuffle writes of A filtering on year 2002 , for year 2001. {code:java} == Physical Plan == TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L]) +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS year#24367L, d_month_seq#24371L] +- CartesianProduct :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], functions=[]) : +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200) : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], functions=[]) :+- BroadcastNestedLoopJoin BuildRight, Cross : :- *(1) Project [d_year#23551L] : : +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: [table=tpcds_1G.date_dim,paths=[]]) : +- BroadcastExchange IdentityBroadcastMode : +- *(2) Project [d_month_seq#24371L] : +- *(2) ScanV2 BigQueryDataSourceV2[d_month_seq#24371L] (Filters: [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: [table=tpcds_1G.date_dim,paths=[]]) +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], functions=[]) +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code} And the result is obviously incorrect because prev_year should be 2001 {code:java} +-++---+ |prev_year|year|d_month_seq| +-++---+ | 2002|2002| 1212| | 2002|2002| 1212| | 2002|2002| 1212| | 2002|2002| 1212| | 2002|2002| 1212| | 2002|2002| 1212| | 2002|2002| 1212| | 2002|2002| 1212| | 2002|2002| 1212| | 2002|2002| 1212| | 2002|2002| 1212| | 2002|2002| 1212| | 2002|2002| 1212| | 2002|2002| 1212| | 2002|2002| 1212| | 2002|2002| 1212| | 2002|2002| 1212| | 2002|2002| 1212| | 2002|2002| 1212| | 2002|2002| 1212| +-++---+ only showing top 20 rows {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32350) Add batch write support on LevelDB to improve performance of HybridStore
[ https://issues.apache.org/jira/browse/SPARK-32350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177112#comment-17177112 ] Apache Spark commented on SPARK-32350: -- User 'baohe-zhang' has created a pull request for this issue: https://github.com/apache/spark/pull/29425 > Add batch write support on LevelDB to improve performance of HybridStore > > > Key: SPARK-32350 > URL: https://issues.apache.org/jira/browse/SPARK-32350 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.1, 3.1.0 >Reporter: Baohe Zhang >Assignee: Baohe Zhang >Priority: Major > Fix For: 3.1.0 > > > The idea is to improve the performance of HybridStore by adding batch write > support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 > introduces HybridStore. HybridStore will write data to InMemoryStore at first > and use a background thread to dump data to LevelDB once the writing to > InMemoryStore is completed. In the comments section of > [https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned > using batch writing can improve the performance of this dumping process and > he wrote the code of writeAll(). > I did the comparison of the HybridStore switching time between one-by-one > write and batch write on an HDD disk. When the disk is free, the batch-write > has around 25% improvement, and when the disk is 100% busy, the batch-write > has 7x - 10x improvement. > when the disk is at 0% utilization: > > ||log size, jobs and tasks per job||original switching time, with > write()||switching time with writeAll()|| > |133m, 400 jobs, 100 tasks per job|16s|13s| > |265m, 400 jobs, 200 tasks per job|30s|23s| > |1.3g, 1000 jobs, 400 tasks per job|136s|108s| > > when the disk is at 100% utilization: > ||log size, jobs and tasks per job||original switching time, with > write()||switching time with writeAll()|| > |133m, 400 jobs, 100 tasks per job|116s|17s| > |265m, 400 jobs, 200 tasks per job|251s|26s| > I also ran some write related benchmarking tests on LevelDBBenchmark.java and > measured the total time of writing 1024 objects. > when the disk is at 0% utilization: > > ||Benchmark test||with write(), ms||with writeAll(), ms || > |randomUpdatesIndexed|213.060|157.356| > |randomUpdatesNoIndex|57.869|35.439| > |randomWritesIndexed|298.854|229.274| > |randomWritesNoIndex|66.764|38.361| > |sequentialUpdatesIndexed|87.019|56.219| > |sequentialUpdatesNoIndex|61.851|41.942| > |sequentialWritesIndexed|94.044|56.534| > |sequentialWritesNoIndex|118.345|66.483| > > when the disk is at 50% utilization: > ||Benchmark test||with write(), ms||with writeAll(), ms|| > |randomUpdatesIndexed|230.386|180.817| > |randomUpdatesNoIndex|58.935|50.113| > |randomWritesIndexed|315.241|254.400| > |randomWritesNoIndex|96.709|41.164| > |sequentialUpdatesIndexed|89.971|70.387| > |sequentialUpdatesNoIndex|72.021|53.769| > |sequentialWritesIndexed|103.052|67.358| > |sequentialWritesNoIndex|76.194|99.037| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32350) Add batch write support on LevelDB to improve performance of HybridStore
[ https://issues.apache.org/jira/browse/SPARK-32350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177110#comment-17177110 ] Apache Spark commented on SPARK-32350: -- User 'baohe-zhang' has created a pull request for this issue: https://github.com/apache/spark/pull/29425 > Add batch write support on LevelDB to improve performance of HybridStore > > > Key: SPARK-32350 > URL: https://issues.apache.org/jira/browse/SPARK-32350 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.1, 3.1.0 >Reporter: Baohe Zhang >Assignee: Baohe Zhang >Priority: Major > Fix For: 3.1.0 > > > The idea is to improve the performance of HybridStore by adding batch write > support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 > introduces HybridStore. HybridStore will write data to InMemoryStore at first > and use a background thread to dump data to LevelDB once the writing to > InMemoryStore is completed. In the comments section of > [https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned > using batch writing can improve the performance of this dumping process and > he wrote the code of writeAll(). > I did the comparison of the HybridStore switching time between one-by-one > write and batch write on an HDD disk. When the disk is free, the batch-write > has around 25% improvement, and when the disk is 100% busy, the batch-write > has 7x - 10x improvement. > when the disk is at 0% utilization: > > ||log size, jobs and tasks per job||original switching time, with > write()||switching time with writeAll()|| > |133m, 400 jobs, 100 tasks per job|16s|13s| > |265m, 400 jobs, 200 tasks per job|30s|23s| > |1.3g, 1000 jobs, 400 tasks per job|136s|108s| > > when the disk is at 100% utilization: > ||log size, jobs and tasks per job||original switching time, with > write()||switching time with writeAll()|| > |133m, 400 jobs, 100 tasks per job|116s|17s| > |265m, 400 jobs, 200 tasks per job|251s|26s| > I also ran some write related benchmarking tests on LevelDBBenchmark.java and > measured the total time of writing 1024 objects. > when the disk is at 0% utilization: > > ||Benchmark test||with write(), ms||with writeAll(), ms || > |randomUpdatesIndexed|213.060|157.356| > |randomUpdatesNoIndex|57.869|35.439| > |randomWritesIndexed|298.854|229.274| > |randomWritesNoIndex|66.764|38.361| > |sequentialUpdatesIndexed|87.019|56.219| > |sequentialUpdatesNoIndex|61.851|41.942| > |sequentialWritesIndexed|94.044|56.534| > |sequentialWritesNoIndex|118.345|66.483| > > when the disk is at 50% utilization: > ||Benchmark test||with write(), ms||with writeAll(), ms|| > |randomUpdatesIndexed|230.386|180.817| > |randomUpdatesNoIndex|58.935|50.113| > |randomWritesIndexed|315.241|254.400| > |randomWritesNoIndex|96.709|41.164| > |sequentialUpdatesIndexed|89.971|70.387| > |sequentialUpdatesNoIndex|72.021|53.769| > |sequentialWritesIndexed|103.052|67.358| > |sequentialWritesNoIndex|76.194|99.037| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32608) Script Transform DELIMIT value error
[ https://issues.apache.org/jira/browse/SPARK-32608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-32608: -- Description: For SQL {code:java} SELECT TRANSFORM(a, b, c) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' NULL DEFINED AS 'null' USING 'cat' AS (a, b, c) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' NULL DEFINED AS 'NULL' FROM testData {code} The correct TOK_TABLEROWFORMATFIELD should be , nut actually ',' TOK_TABLEROWFORMATLINES should be \n but actually '\n' > Script Transform DELIMIT value error > - > > Key: SPARK-32608 > URL: https://issues.apache.org/jira/browse/SPARK-32608 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > For SQL > > {code:java} > SELECT TRANSFORM(a, b, c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > LINES TERMINATED BY '\n' > NULL DEFINED AS 'null' > USING 'cat' AS (a, b, c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > LINES TERMINATED BY '\n' > NULL DEFINED AS 'NULL' > FROM testData > {code} > The correct > TOK_TABLEROWFORMATFIELD should be , nut actually ',' > TOK_TABLEROWFORMATLINES should be \n but actually '\n' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32608) Script Transform DELIMIT value error
angerszhu created SPARK-32608: - Summary: Script Transform DELIMIT value error Key: SPARK-32608 URL: https://issues.apache.org/jira/browse/SPARK-32608 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32607) Script Transformation no-serde read line should respect `TOK_TABLEROWFORMATLINES`
[ https://issues.apache.org/jira/browse/SPARK-32607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-32607: -- Description: reader.readLine() should respect `TOK_TABLEROWFORMATLINES` {code:java} protected def createOutputIteratorWithoutSerde( writerThread: BaseScriptTransformationWriterThread, inputStream: InputStream, proc: Process, stderrBuffer: CircularBuffer): Iterator[InternalRow] = { new Iterator[InternalRow] { var curLine: String = null val reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8)) val outputRowFormat = ioschema.outputRowFormatMap("TOK_TABLEROWFORMATFIELD") val processRowWithoutSerde = if (!ioschema.schemaLess) { prevLine: String => new GenericInternalRow( prevLine.split(outputRowFormat) .zip(outputFieldWriters) .map { case (data, writer) => writer(data) }) } else { // In schema less mode, hive default serde will choose first two output column as output // if output column size less then 2, it will throw ArrayIndexOutOfBoundsException. // Here we change spark's behavior same as hive's default serde. // But in hive, TRANSFORM with schema less behavior like origin spark, we will fix this // to keep spark and hive behavior same in SPARK-32388 val kvWriter = CatalystTypeConverters.createToCatalystConverter(StringType) prevLine: String => new GenericInternalRow( prevLine.split(outputRowFormat).slice(0, 2) .map(kvWriter)) } override def hasNext: Boolean = { try { if (curLine == null) { curLine = reader.readLine() if (curLine == null) { checkFailureAndPropagate(writerThread, null, proc, stderrBuffer) return false } } true } catch { case NonFatal(e) => // If this exception is due to abrupt / unclean termination of `proc`, // then detect it and propagate a better exception message for end users checkFailureAndPropagate(writerThread, e, proc, stderrBuffer) throw e } } override def next(): InternalRow = { if (!hasNext) { throw new NoSuchElementException } val prevLine = curLine curLine = reader.readLine() processRowWithoutSerde(prevLine) } } } {code} was: reader.readLine() should respect `` {code:java} protected def createOutputIteratorWithoutSerde( writerThread: BaseScriptTransformationWriterThread, inputStream: InputStream, proc: Process, stderrBuffer: CircularBuffer): Iterator[InternalRow] = { new Iterator[InternalRow] { var curLine: String = null val reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8)) val outputRowFormat = ioschema.outputRowFormatMap("TOK_TABLEROWFORMATFIELD") val processRowWithoutSerde = if (!ioschema.schemaLess) { prevLine: String => new GenericInternalRow( prevLine.split(outputRowFormat) .zip(outputFieldWriters) .map { case (data, writer) => writer(data) }) } else { // In schema less mode, hive default serde will choose first two output column as output // if output column size less then 2, it will throw ArrayIndexOutOfBoundsException. // Here we change spark's behavior same as hive's default serde. // But in hive, TRANSFORM with schema less behavior like origin spark, we will fix this // to keep spark and hive behavior same in SPARK-32388 val kvWriter = CatalystTypeConverters.createToCatalystConverter(StringType) prevLine: String => new GenericInternalRow( prevLine.split(outputRowFormat).slice(0, 2) .map(kvWriter)) } override def hasNext: Boolean = { try { if (curLine == null) { curLine = reader.readLine() if (curLine == null) { checkFailureAndPropagate(writerThread, null, proc, stderrBuffer) return false } } true } catch { case NonFatal(e) => // If this exception is due to abrupt / unclean termination of `proc`, // then detect it and propagate a better exception message for end users checkFailureAndPropagate(writerThread, e, proc, stderrBuffer) throw e } } override def next(): InternalRow = { if (!hasNext) { throw new NoSuchElementException } val prevLine = curLine curLine = reader.readLine() processRowWithoutSerde(prevLine) } } } {code} > Script Transformation no-serde read line should respect > `TOK_TABLEROWFORMATLINES` > - > > Key: SPARK-32607 > URL: https://issues.apache
[jira] [Updated] (SPARK-32607) Script Transformation no-serde read line should respect `TOK_TABLEROWFORMATLINES`
[ https://issues.apache.org/jira/browse/SPARK-32607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-32607: -- Description: reader.readLine() should respect `` {code:java} protected def createOutputIteratorWithoutSerde( writerThread: BaseScriptTransformationWriterThread, inputStream: InputStream, proc: Process, stderrBuffer: CircularBuffer): Iterator[InternalRow] = { new Iterator[InternalRow] { var curLine: String = null val reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8)) val outputRowFormat = ioschema.outputRowFormatMap("TOK_TABLEROWFORMATFIELD") val processRowWithoutSerde = if (!ioschema.schemaLess) { prevLine: String => new GenericInternalRow( prevLine.split(outputRowFormat) .zip(outputFieldWriters) .map { case (data, writer) => writer(data) }) } else { // In schema less mode, hive default serde will choose first two output column as output // if output column size less then 2, it will throw ArrayIndexOutOfBoundsException. // Here we change spark's behavior same as hive's default serde. // But in hive, TRANSFORM with schema less behavior like origin spark, we will fix this // to keep spark and hive behavior same in SPARK-32388 val kvWriter = CatalystTypeConverters.createToCatalystConverter(StringType) prevLine: String => new GenericInternalRow( prevLine.split(outputRowFormat).slice(0, 2) .map(kvWriter)) } override def hasNext: Boolean = { try { if (curLine == null) { curLine = reader.readLine() if (curLine == null) { checkFailureAndPropagate(writerThread, null, proc, stderrBuffer) return false } } true } catch { case NonFatal(e) => // If this exception is due to abrupt / unclean termination of `proc`, // then detect it and propagate a better exception message for end users checkFailureAndPropagate(writerThread, e, proc, stderrBuffer) throw e } } override def next(): InternalRow = { if (!hasNext) { throw new NoSuchElementException } val prevLine = curLine curLine = reader.readLine() processRowWithoutSerde(prevLine) } } } {code} > Script Transformation no-serde read line should respect > `TOK_TABLEROWFORMATLINES` > - > > Key: SPARK-32607 > URL: https://issues.apache.org/jira/browse/SPARK-32607 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > reader.readLine() should respect `` > {code:java} > protected def createOutputIteratorWithoutSerde( > writerThread: BaseScriptTransformationWriterThread, > inputStream: InputStream, > proc: Process, > stderrBuffer: CircularBuffer): Iterator[InternalRow] = { > new Iterator[InternalRow] { > var curLine: String = null > val reader = new BufferedReader(new InputStreamReader(inputStream, > StandardCharsets.UTF_8)) > val outputRowFormat = > ioschema.outputRowFormatMap("TOK_TABLEROWFORMATFIELD") > val processRowWithoutSerde = if (!ioschema.schemaLess) { > prevLine: String => > new GenericInternalRow( > prevLine.split(outputRowFormat) > .zip(outputFieldWriters) > .map { case (data, writer) => writer(data) }) > } else { > // In schema less mode, hive default serde will choose first two output > column as output > // if output column size less then 2, it will throw > ArrayIndexOutOfBoundsException. > // Here we change spark's behavior same as hive's default serde. > // But in hive, TRANSFORM with schema less behavior like origin spark, > we will fix this > // to keep spark and hive behavior same in SPARK-32388 > val kvWriter = > CatalystTypeConverters.createToCatalystConverter(StringType) > prevLine: String => > new GenericInternalRow( > prevLine.split(outputRowFormat).slice(0, 2) > .map(kvWriter)) > } > override def hasNext: Boolean = { > try { > if (curLine == null) { > curLine = reader.readLine() > if (curLine == null) { > checkFailureAndPropagate(writerThread, null, proc, stderrBuffer) > return false > } > } > true > } catch { > case NonFatal(e) => > // If this exception is due to abrupt / unclean termination of > `proc`, > // then detect it and propagate a better exception message for end > users > checkFailureAndPropagate(writerThread, e, proc,
[jira] [Created] (SPARK-32607) Script Transformation no-serde read line should respect `TOK_TABLEROWFORMATLINES`
angerszhu created SPARK-32607: - Summary: Script Transformation no-serde read line should respect `TOK_TABLEROWFORMATLINES` Key: SPARK-32607 URL: https://issues.apache.org/jira/browse/SPARK-32607 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32053) pyspark save of serialized model is failing for windows.
[ https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177084#comment-17177084 ] Kayal commented on SPARK-32053: --- ~\AppData\Local\IBMWS\miniconda3\envs\desktop\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name) 327 "An error occurred while calling \{0}{1}\{2}.\n". --> 328 format(target_id, ".", name), value) 329 else: Py4JJavaError: An error occurred while calling o289.save. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:100) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopDataset$1(PairRDDFunctions.scala:1090) at org.apache.spark.rdd.PairRDDFunctions$$Lambda$2417.1D19A7B0.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1088) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$4(PairRDDFunctions.scala:1061) at org.apache.spark.rdd.PairRDDFunctions$$Lambda$2415.0FE34B70.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1026) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$3(PairRDDFunctions.scala:1008) at org.apache.spark.rdd.PairRDDFunctions$$Lambda$2414.1CBB0D40.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1007) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$2(PairRDDFunctions.scala:964) at org.apache.spark.rdd.PairRDDFunctions$$Lambda$2413.1D196EA0.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:962) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$2(RDD.scala:1552) at org.apache.spark.rdd.RDD$$Lambda$2411.18FEB4E0.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1552) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$1(RDD.scala:1538) at org.apache.spark.rdd.RDD$$Lambda$2410.1CA30180.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1538) at org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:413) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1(Pipeline.scala:250) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1$adapted(Pipeline.scala:247) at org.apache.spark.ml.Pipeline$SharedReadWrite$$$Lambda$2397.190AB010.apply(Unknown Source) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at org.apache.spark.ml.util.Instrumentation$$$Lambda$1390.18680E40.apply(Unknown Source) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:247) at org.apache.spark.ml.Pipeline$PipelineWriter.saveImpl(Pipeline.scala:206) at org.apache.spark.ml.util.MLWriter.save(ReadWrite.
[jira] [Commented] (SPARK-32053) pyspark save of serialized model is failing for windows.
[ https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177082#comment-17177082 ] Kayal commented on SPARK-32053: --- !image-2020-08-13-20-29-40-555.png! > pyspark save of serialized model is failing for windows. > > > Key: SPARK-32053 > URL: https://issues.apache.org/jira/browse/SPARK-32053 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Kayal >Priority: Blocker > Attachments: image-2020-06-22-18-19-32-236.png, > image-2020-08-13-20-24-57-309.png, image-2020-08-13-20-25-57-585.png, > image-2020-08-13-20-28-28-779.png, image-2020-08-13-20-29-40-555.png, > screenshot-1.png > > > {color:#172b4d}Hi, {color} > {color:#172b4d}We are using spark functionality to save the serialized model > to disk . On windows platform we are seeing save of the serialized model is > failing with the error: o288.save() failed. {color} > > > > !image-2020-06-22-18-19-32-236.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32053) pyspark save of serialized model is failing for windows.
[ https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kayal updated SPARK-32053: -- Attachment: image-2020-08-13-20-29-40-555.png > pyspark save of serialized model is failing for windows. > > > Key: SPARK-32053 > URL: https://issues.apache.org/jira/browse/SPARK-32053 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Kayal >Priority: Blocker > Attachments: image-2020-06-22-18-19-32-236.png, > image-2020-08-13-20-24-57-309.png, image-2020-08-13-20-25-57-585.png, > image-2020-08-13-20-28-28-779.png, image-2020-08-13-20-29-40-555.png, > screenshot-1.png > > > {color:#172b4d}Hi, {color} > {color:#172b4d}We are using spark functionality to save the serialized model > to disk . On windows platform we are seeing save of the serialized model is > failing with the error: o288.save() failed. {color} > > > > !image-2020-06-22-18-19-32-236.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32053) pyspark save of serialized model is failing for windows.
[ https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kayal updated SPARK-32053: -- Attachment: image-2020-08-13-20-28-28-779.png > pyspark save of serialized model is failing for windows. > > > Key: SPARK-32053 > URL: https://issues.apache.org/jira/browse/SPARK-32053 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Kayal >Priority: Blocker > Attachments: image-2020-06-22-18-19-32-236.png, > image-2020-08-13-20-24-57-309.png, image-2020-08-13-20-25-57-585.png, > image-2020-08-13-20-28-28-779.png, screenshot-1.png > > > {color:#172b4d}Hi, {color} > {color:#172b4d}We are using spark functionality to save the serialized model > to disk . On windows platform we are seeing save of the serialized model is > failing with the error: o288.save() failed. {color} > > > > !image-2020-06-22-18-19-32-236.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32053) pyspark save of serialized model is failing for windows.
[ https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kayal updated SPARK-32053: -- Attachment: screenshot-1.png > pyspark save of serialized model is failing for windows. > > > Key: SPARK-32053 > URL: https://issues.apache.org/jira/browse/SPARK-32053 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Kayal >Priority: Blocker > Attachments: image-2020-06-22-18-19-32-236.png, > image-2020-08-13-20-24-57-309.png, image-2020-08-13-20-25-57-585.png, > screenshot-1.png > > > {color:#172b4d}Hi, {color} > {color:#172b4d}We are using spark functionality to save the serialized model > to disk . On windows platform we are seeing save of the serialized model is > failing with the error: o288.save() failed. {color} > > > > !image-2020-06-22-18-19-32-236.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32053) pyspark save of serialized model is failing for windows.
[ https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177079#comment-17177079 ] Kayal commented on SPARK-32053: --- !image-2020-08-13-20-25-57-585.png! > pyspark save of serialized model is failing for windows. > > > Key: SPARK-32053 > URL: https://issues.apache.org/jira/browse/SPARK-32053 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Kayal >Priority: Blocker > Attachments: image-2020-06-22-18-19-32-236.png, > image-2020-08-13-20-24-57-309.png > > > {color:#172b4d}Hi, {color} > {color:#172b4d}We are using spark functionality to save the serialized model > to disk . On windows platform we are seeing save of the serialized model is > failing with the error: o288.save() failed. {color} > > > > !image-2020-06-22-18-19-32-236.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32053) pyspark save of serialized model is failing for windows.
[ https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kayal updated SPARK-32053: -- Attachment: image-2020-08-13-20-25-57-585.png > pyspark save of serialized model is failing for windows. > > > Key: SPARK-32053 > URL: https://issues.apache.org/jira/browse/SPARK-32053 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Kayal >Priority: Blocker > Attachments: image-2020-06-22-18-19-32-236.png, > image-2020-08-13-20-24-57-309.png, image-2020-08-13-20-25-57-585.png > > > {color:#172b4d}Hi, {color} > {color:#172b4d}We are using spark functionality to save the serialized model > to disk . On windows platform we are seeing save of the serialized model is > failing with the error: o288.save() failed. {color} > > > > !image-2020-06-22-18-19-32-236.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32053) pyspark save of serialized model is failing for windows.
[ https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kayal updated SPARK-32053: -- Attachment: image-2020-08-13-20-24-57-309.png > pyspark save of serialized model is failing for windows. > > > Key: SPARK-32053 > URL: https://issues.apache.org/jira/browse/SPARK-32053 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Kayal >Priority: Blocker > Attachments: image-2020-06-22-18-19-32-236.png, > image-2020-08-13-20-24-57-309.png > > > {color:#172b4d}Hi, {color} > {color:#172b4d}We are using spark functionality to save the serialized model > to disk . On windows platform we are seeing save of the serialized model is > failing with the error: o288.save() failed. {color} > > > > !image-2020-06-22-18-19-32-236.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32053) pyspark save of serialized model is failing for windows.
[ https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177077#comment-17177077 ] Kayal commented on SPARK-32053: --- !image-2020-08-13-20-24-57-309.png! > pyspark save of serialized model is failing for windows. > > > Key: SPARK-32053 > URL: https://issues.apache.org/jira/browse/SPARK-32053 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Kayal >Priority: Blocker > Attachments: image-2020-06-22-18-19-32-236.png > > > {color:#172b4d}Hi, {color} > {color:#172b4d}We are using spark functionality to save the serialized model > to disk . On windows platform we are seeing save of the serialized model is > failing with the error: o288.save() failed. {color} > > > > !image-2020-06-22-18-19-32-236.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32053) pyspark save of serialized model is failing for windows.
[ https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177073#comment-17177073 ] Kayal commented on SPARK-32053: --- The code to reproduce the issue on windows jupyter notebook: import pyspark #from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext("local", "First App") from pyspark.sql import SparkSession sess = SparkSession(sc) training = sess.createDataFrame([ ("0L", "a b c d e WML", 1.0), ("1L", "b d", 0.0), ("2L", "WML f g h", 1.0), ("3L", "hadoop mapreduce", 0.0)], ["id", "text", "label"]) evaluation = sess.createDataFrame([ ("4L", "a b c WML", 1.0), ("5L", "l m n o p", 0.0), ("6L", "WML g h i k", 1.0), ("7L", "apache hadoop zuzu", 0.0)], ["id", "text", "label"]) testing = sess.createDataFrame([ ("4L", "a b c z WML"), ("5L", "l m n"), ("6L", "WML g h i j k"), ("7L", "apache hadoop")], ["id", "text"]) import traceback from pyspark.ml.pipeline import Pipeline from pyspark.ml.classification import LogisticRegression from pyspark.ml.feature import HashingTF, Tokenizer from pyspark.sql import SQLContext as sql_context tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression(maxIter=10, regParam=0.01) stages=[tokenizer, hashingTF, lr] pipeline = Pipeline(stages=stages) model = pipeline.fit(training) test_result = model.transform(testing) pipeline.write().overwrite().save("tempfile") The write operation is failing with the error that I mentioned above. This is blocking our product delivery. could consider this with high priority blocker issue. Is there a work around for this ? sparkml is supported on windows pyspark ? I also noticed the same error with pipline.save() method. > pyspark save of serialized model is failing for windows. > > > Key: SPARK-32053 > URL: https://issues.apache.org/jira/browse/SPARK-32053 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Kayal >Priority: Major > Attachments: image-2020-06-22-18-19-32-236.png > > > {color:#172b4d}Hi, {color} > {color:#172b4d}We are using spark functionality to save the serialized model > to disk . On windows platform we are seeing save of the serialized model is > failing with the error: o288.save() failed. {color} > > > > !image-2020-06-22-18-19-32-236.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32053) pyspark save of serialized model is failing for windows.
[ https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kayal updated SPARK-32053: -- Priority: Blocker (was: Major) > pyspark save of serialized model is failing for windows. > > > Key: SPARK-32053 > URL: https://issues.apache.org/jira/browse/SPARK-32053 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Kayal >Priority: Blocker > Attachments: image-2020-06-22-18-19-32-236.png > > > {color:#172b4d}Hi, {color} > {color:#172b4d}We are using spark functionality to save the serialized model > to disk . On windows platform we are seeing save of the serialized model is > failing with the error: o288.save() failed. {color} > > > > !image-2020-06-22-18-19-32-236.png! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-32053) pyspark save of serialized model is failing for windows.
[ https://issues.apache.org/jira/browse/SPARK-32053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kayal reopened SPARK-32053: --- Hi, I have verified the issue in spark latest version 3.0.0 , the issue seems to be still there on windows. The problem is on windows when we try to pipline.write().overwrite().save(temp_dir) is failing with ~\AppData\Local\IBMWS\miniconda3\envs\desktop\lib\site-packages\pyspark\ml\util.py in save(self, path) 173 if not isinstance(path, basestring): 174 raise TypeError("path should be a basestring, got type %s" % type(path)) --> 175 self._jwrite.save(path) 176 177 def overwrite(self): ~\AppData\Local\IBMWS\miniconda3\envs\desktop\lib\site-packages\py4j\java_gateway.py in __call__(self, *args) 1303 answer = self.gateway_client.send_command(command) 1304 return_value = get_return_value( -> 1305 answer, self.gateway_client, self.target_id, self.name) 1306 1307 for temp_arg in temp_args: ~\AppData\Local\IBMWS\miniconda3\envs\desktop\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw) 129 def deco(*a, **kw): 130 try: --> 131 return f(*a, **kw) 132 except py4j.protocol.Py4JJavaError as e: 133 converted = convert_exception(e.java_exception) ~\AppData\Local\IBMWS\miniconda3\envs\desktop\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name) 326 raise Py4JJavaError( 327 "An error occurred while calling \{0}{1}\{2}.\n". --> 328 format(target_id, ".", name), value) 329 else: 330 raise Py4JError( Py4JJavaError: An error occurred while calling o662.save. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:100) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopDataset$1(PairRDDFunctions.scala:1090) at org.apache.spark.rdd.PairRDDFunctions$$Lambda$2417.1D19A7B0.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1088) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$4(PairRDDFunctions.scala:1061) at org.apache.spark.rdd.PairRDDFunctions$$Lambda$2415.0FE34B70.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1026) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$3(PairRDDFunctions.scala:1008) at org.apache.spark.rdd.PairRDDFunctions$$Lambda$2414.1CBB0D40.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1007) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$2(PairRDDFunctions.scala:964) at org.apache.spark.rdd.PairRDDFunctions$$Lambda$2413.1D196EA0.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:962) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$2(RDD.scala:1552) at org.apache.spark.rdd.RDD$$Lambda$2411.18FEB4E0.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1552) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$1(RDD.scala:1538) at org.apache.spark.rdd.RDD$$Lambda$2410.1CA30180.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperat
[jira] [Commented] (SPARK-32587) SPARK SQL writing to JDBC target with bit datatype using Dataframe is writing NULL values
[ https://issues.apache.org/jira/browse/SPARK-32587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176996#comment-17176996 ] Hyukjin Kwon commented on SPARK-32587: -- Could you share the reproducible codes and the actual output? > SPARK SQL writing to JDBC target with bit datatype using Dataframe is writing > NULL values > - > > Key: SPARK-32587 > URL: https://issues.apache.org/jira/browse/SPARK-32587 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Mohit Dave >Priority: Major > > While writing to a target in SQL Server using Microsoft's SQL Server driver > using dataframe.write API the target is storing NULL values for BIT columns. > > Table definition > Azure SQL DB > 1)Create 2 tables with column type as bit > 2)Insert some record into 1 table > Create a SPARK job > 1)Create a Dataframe using spark.read with the following query > select from > 2)Write the dataframe to a target table with bit type as column. > > Observation : Bit type is getting converted to NULL at the target > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32591) Add better api docs for stage level scheduling Resources
[ https://issues.apache.org/jira/browse/SPARK-32591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176994#comment-17176994 ] Hyukjin Kwon commented on SPARK-32591: -- +100!! > Add better api docs for stage level scheduling Resources > > > Key: SPARK-32591 > URL: https://issues.apache.org/jira/browse/SPARK-32591 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Priority: Major > > A question came up when we added offheap memory to be able to set in a > ResourceProfile executor resources. > [https://github.com/apache/spark/pull/28972/] > Based on that discussion we should add better api docs to explain what each > one does. Perhaps point to the corresponding configuration . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32601) Issue in converting an RDD of Arrow RecordBatches in v3.0.0
[ https://issues.apache.org/jira/browse/SPARK-32601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176993#comment-17176993 ] Hyukjin Kwon commented on SPARK-32601: -- {{ArrowSerializer}} isn't supposed to be an API. Can this be reproduced by using regular APIs in Spark? > Issue in converting an RDD of Arrow RecordBatches in v3.0.0 > --- > > Key: SPARK-32601 > URL: https://issues.apache.org/jira/browse/SPARK-32601 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Tanveer >Priority: Major > > The following simple code snippet for converting an RDD of Arrow > RecordBatches works perfectly in Spark v2.3.4. > > {code:java} > // code placeholder > from pyspark.sql import SparkSession > import pyspark > import pyarrow as pa > from pyspark.serializers import ArrowSerializer > def _arrow_record_batch_dumps(rb): > # Fix for interoperability between pyarrow version >=0.15 and Spark's > arrow version > # Streaming message protocol has changed, remove setting when upgrading > spark. > import os > os.environ['ARROW_PRE_0_15_IPC_FORMAT'] = '1' > > return bytearray(ArrowSerializer().dumps(rb)) > def rb_return(ardd): > data = [ > pa.array(range(5), type='int16'), > pa.array([-10, -5, 0, None, 10], type='int32') > ] > schema = pa.schema([pa.field('c0', pa.int16()), > pa.field('c1', pa.int32())], >metadata={b'foo': b'bar'}) > return pa.RecordBatch.from_arrays(data, schema=schema) > if __name__ == '__main__': > spark = SparkSession \ > .builder \ > .appName("Python Arrow-in-Spark example") \ > .getOrCreate() > # Enable Arrow-based columnar data transfers > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > sc = spark.sparkContext > ardd = spark.sparkContext.parallelize([0,1,2], 3) > ardd = ardd.map(rb_return) > from pyspark.sql.types import from_arrow_schema > from pyspark.sql.dataframe import DataFrame > from pyspark.serializers import ArrowSerializer, PickleSerializer, > AutoBatchedSerializer > # Filter out and cache arrow record batches > ardd = ardd.filter(lambda x: isinstance(x, pa.RecordBatch)).cache() > ardd = ardd.map(_arrow_record_batch_dumps) > schema = pa.schema([pa.field('c0', pa.int16()), > pa.field('c1', pa.int32())], >metadata={b'foo': b'bar'}) > schema = from_arrow_schema(schema) > jrdd = ardd._to_java_object_rdd() > jdf = spark._jvm.PythonSQLUtils.arrowPayloadToDataFrame(jrdd, > schema.json(), spark._wrapped._jsqlContext) > df = DataFrame(jdf, spark._wrapped) > df._schema = schema > df.show() > {code} > > But after updating to Spark to v3.0.0, the same functionality with just > changing arrowPayloadToDataFrame() -> toDataFrame() doesn't work. > > {code:java} > // code placeholder > from pyspark.sql import SparkSession > import pyspark > import pyarrow as pa > #from pyspark.serializers import ArrowSerializerdef dumps(batch): > import pyarrow as pa > import io > sink = io.BytesIO() > writer = pa.RecordBatchFileWriter(sink, batch.schema) > writer.write_batch(batch) > writer.close() > return sink.getvalue()def _arrow_record_batch_dumps(rb): > # Fix for interoperability between pyarrow version >=0.15 and Spark's > arrow version > # Streaming message protocol has changed, remove setting when upgrading > spark. > #import os > #os.environ['ARROW_PRE_0_15_IPC_FORMAT'] = '1'#return > bytearray(ArrowSerializer().dumps(rb)) > return bytearray(dumps(rb)) > def rb_return(ardd): > data = [ > pa.array(range(5), type='int16'), > pa.array([-10, -5, 0, None, 10], type='int32') > ] > schema = pa.schema([pa.field('c0', pa.int16()), > pa.field('c1', pa.int32())], >metadata={b'foo': b'bar'}) > return pa.RecordBatch.from_arrays(data, schema=schema)if __name__ == > '__main__': > spark = SparkSession \ > .builder \ > .appName("Python Arrow-in-Spark example") \ > .getOrCreate()# Enable Arrow-based columnar data transfers > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > sc = spark.sparkContextardd = spark.sparkContext.parallelize([0,1,2], > 3) > ardd = ardd.map(rb_return)from pyspark.sql.pandas.types import > from_arrow_schema > from pyspark.sql.dataframe import DataFrame# Filter out and cache > arrow record batches > ardd = ardd.filter(lambda x: isinstance(x, pa.RecordBatch)).cache() > ardd = ardd.map(_arrow_record_batch_dumps)schema = > pa.schema([pa.field('c0', p
[jira] [Commented] (SPARK-32604) Bug in ALSModel Python Documentation
[ https://issues.apache.org/jira/browse/SPARK-32604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176992#comment-17176992 ] Hyukjin Kwon commented on SPARK-32604: -- Are you interested in submitting a PR to fix? > Bug in ALSModel Python Documentation > > > Key: SPARK-32604 > URL: https://issues.apache.org/jira/browse/SPARK-32604 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 2.4.0, 3.0.0 >Reporter: Zach Cahoone >Priority: Minor > > In the ALSModel documentation > ([https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]), > there is a bug which causes data frame creation to fail with the following > error: > {code:java} > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 > (TID 15, 10.0.0.133, executor 10): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, > in main > process() > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, > in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line > 390, in dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/usr/lib/spark/python/pyspark/rdd.py", line 1354, in takeUpToNumLeft > yield next(iterator) > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in > wrapper > return f(*args, **kwargs) > File "", line 24, in > NameError: name 'long' is not defined > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28) > at > org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153) > at > org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuf
[jira] [Updated] (SPARK-32604) Bug in ALSModel Python Documentation
[ https://issues.apache.org/jira/browse/SPARK-32604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32604: - Component/s: PySpark > Bug in ALSModel Python Documentation > > > Key: SPARK-32604 > URL: https://issues.apache.org/jira/browse/SPARK-32604 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 2.4.0, 3.0.0 >Reporter: Zach Cahoone >Priority: Minor > > In the ALSModel documentation > ([https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]), > there is a bug which causes data frame creation to fail with the following > error: > {code:java} > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 > (TID 15, 10.0.0.133, executor 10): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, > in main > process() > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, > in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line > 390, in dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/usr/lib/spark/python/pyspark/rdd.py", line 1354, in takeUpToNumLeft > yield next(iterator) > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in > wrapper > return f(*args, **kwargs) > File "", line 24, in > NameError: name 'long' is not defined > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28) > at > org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153) > at > org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(
[jira] [Updated] (SPARK-32604) Bug in ALSModel Python Documentation
[ https://issues.apache.org/jira/browse/SPARK-32604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32604: - Target Version/s: (was: 3.0.0) > Bug in ALSModel Python Documentation > > > Key: SPARK-32604 > URL: https://issues.apache.org/jira/browse/SPARK-32604 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.4.0, 3.0.0 >Reporter: Zach Cahoone >Priority: Minor > > In the ALSModel documentation > ([https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]), > there is a bug which causes data frame creation to fail with the following > error: > {code:java} > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 > (TID 15, 10.0.0.133, executor 10): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, > in main > process() > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, > in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line > 390, in dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/usr/lib/spark/python/pyspark/rdd.py", line 1354, in takeUpToNumLeft > yield next(iterator) > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in > wrapper > return f(*args, **kwargs) > File "", line 24, in > NameError: name 'long' is not defined > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28) > at > org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153) > at > org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortSta
[jira] [Commented] (SPARK-32604) Bug in ALSModel Python Documentation
[ https://issues.apache.org/jira/browse/SPARK-32604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176990#comment-17176990 ] Hyukjin Kwon commented on SPARK-32604: -- Please avoid setting target version which is reserved for committers in general. > Bug in ALSModel Python Documentation > > > Key: SPARK-32604 > URL: https://issues.apache.org/jira/browse/SPARK-32604 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.4.0, 3.0.0 >Reporter: Zach Cahoone >Priority: Minor > > In the ALSModel documentation > ([https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]), > there is a bug which causes data frame creation to fail with the following > error: > {code:java} > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 > (TID 15, 10.0.0.133, executor 10): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, > in main > process() > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, > in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line > 390, in dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/usr/lib/spark/python/pyspark/rdd.py", line 1354, in takeUpToNumLeft > yield next(iterator) > File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in > wrapper > return f(*args, **kwargs) > File "", line 24, in > NameError: name 'long' is not defined > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28) > at > org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153) > at > org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:153) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1890) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.Ar
[jira] [Commented] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeachBatch in PySpark
[ https://issues.apache.org/jira/browse/SPARK-32500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176966#comment-17176966 ] JinxinTang commented on SPARK-32500: Kindly ping [~hyukjin.kwon] It's my pleasure : ) > Query and Batch Id not set for Structured Streaming Jobs in case of > ForeachBatch in PySpark > --- > > Key: SPARK-32500 > URL: https://issues.apache.org/jira/browse/SPARK-32500 > Project: Spark > Issue Type: Bug > Components: PySpark, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Abhishek Dixit >Priority: Major > Attachments: Screen Shot 2020-07-26 at 6.50.39 PM.png, Screen Shot > 2020-07-30 at 9.04.21 PM.png, image-2020-08-01-10-21-51-246.png > > > Query Id and Batch Id information is not available for jobs started by > structured streaming query when _foreachBatch_ API is used in PySpark. > This happens only with foreachBatch in pyspark. ForeachBatch in scala works > fine, and also other structured streaming sinks in pyspark work fine. I am > attaching a screenshot of jobs pages. > I think job group is not set properly when _foreachBatch_ is used via > pyspark. I have a framework that depends on the _queryId_ and _batchId_ > information available in the job properties and so my framework doesn't work > for pyspark-foreachBatch use case. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32018) Fix UnsafeRow set overflowed decimal
[ https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176955#comment-17176955 ] Wenchen Fan commented on SPARK-32018: - [~Gengliang.Wang] we should create a new JIRA ticket for the new fix. The new fix is not applicable to 2.4 as 2.4 does not have ANSI mode. > Fix UnsafeRow set overflowed decimal > > > Key: SPARK-32018 > URL: https://issues.apache.org/jira/browse/SPARK-32018 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Allison Wang >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.7, 3.0.1, 3.1.0 > > > There is a bug that writing an overflowed decimal into UnsafeRow is fine but > reading it out will throw ArithmeticException. This exception is thrown when > calling {{getDecimal}} in UnsafeRow with input decimal's precision greater > than the input precision. Setting the value of the overflowed decimal to null > when writing into UnsafeRow should fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32244) Build and run the Spark with test cases in Github Actions
[ https://issues.apache.org/jira/browse/SPARK-32244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32244: Assignee: Hyukjin Kwon > Build and run the Spark with test cases in Github Actions > - > > Key: SPARK-32244 > URL: https://issues.apache.org/jira/browse/SPARK-32244 > Project: Spark > Issue Type: Umbrella > Components: Project Infra >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > > Last week and onwards, the Jenkins machines became very unstable for some > reasons. > - Apparently, the machines became extremely slow. Almost all tests can't > pass. > - One machine (worker 4) started to have the corrupt .m2 which fails the > build. > - Documentation build fails time to time for an unknown reason in Jenkins > machine specifically. > Almost all PRs are basically blocked by this instability currently. > This JIRA aims to run the tests in Github Actions. > - To avoid depending on few persons who can access to the cluster. > - To reduce the elapsed time in the build - we could split the tests (e.g., > SQL, ML, CORE), and run them in parallel so the total build time will > significantly reduce. > - To control the environment more flexibly. > - Other contributors can test and propose to fix Github Actions > configurations so we can distribute this build management cost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32606) Remove the fork of action-download-artifact in test_report.yml
Hyukjin Kwon created SPARK-32606: Summary: Remove the fork of action-download-artifact in test_report.yml Key: SPARK-32606 URL: https://issues.apache.org/jira/browse/SPARK-32606 Project: Spark Issue Type: Sub-task Components: Project Infra Affects Versions: 3.1.0 Reporter: Hyukjin Kwon https://github.com/HyukjinKwon/action-download-artifact/commit/750b71af351aba467757d7be6924199bb08db4ed in order to add the support to download all artifacts. It should be contributed back to the original plugin and avoid using the fork. Alternatively, we can use the official actions/download-artifact once they support to download artifacts between different workloads, see also https://github.com/actions/download-artifact/issues/3 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org