[jira] [Updated] (SPARK-30279) Support 32 or more grouping attributes for GROUPING_ID
[ https://issues.apache.org/jira/browse/SPARK-30279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-30279: - Affects Version/s: 2.4.6 > Support 32 or more grouping attributes for GROUPING_ID > --- > > Key: SPARK-30279 > URL: https://issues.apache.org/jira/browse/SPARK-30279 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 2.4.6 >Reporter: Takeshi Yamamuro >Priority: Major > > This ticket targets to support 32 or more grouping attributes for > GROUPING_ID. In the current master, an integer overflow can occur to compute > grouping IDs; > https://github.com/apache/spark/blob/e75d9afb2f282ce79c9fd8bce031287739326a4f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L613 > For example, the query below generates wrong grouping IDs in the master; > {code} > scala> val numCols = 32 // or, 31 > scala> val cols = (0 until numCols).map { i => s"c$i" } > scala> sql(s"create table test_$numCols (${cols.map(c => s"$c > int").mkString(",")}, v int) using parquet") > scala> val insertVals = (0 until numCols).map { _ => 1 }.mkString(",") > scala> sql(s"insert into test_$numCols values ($insertVals,3)") > scala> sql(s"select grouping_id(), sum(v) from test_$numCols group by > grouping sets ((${cols.mkString(",")}), > (${cols.init.mkString(",")}))").show(10, false) > scala> sql(s"drop table test_$numCols") > // numCols = 32 > +-+--+ > |grouping_id()|sum(v)| > +-+--+ > |0|3 | > |0|3 | // Wrong Grouping ID > +-+--+ > // numCols = 31 > +-+--+ > |grouping_id()|sum(v)| > +-+--+ > |0|3 | > |1|3 | > +-+--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31061) Impossible to change the provider of a table in the HiveMetaStore
[ https://issues.apache.org/jira/browse/SPARK-31061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31061: --- Assignee: Burak Yavuz > Impossible to change the provider of a table in the HiveMetaStore > - > > Key: SPARK-31061 > URL: https://issues.apache.org/jira/browse/SPARK-31061 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > > Currently, it's impossible to alter the datasource of a table in the > HiveMetaStore by using alterTable, as the HiveExternalCatalog doesn't change > the provider table property during an alterTable command. This is required to > support changing table formats when using commands like REPLACE TABLE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31061) Impossible to change the provider of a table in the HiveMetaStore
[ https://issues.apache.org/jira/browse/SPARK-31061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31061. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27822 [https://github.com/apache/spark/pull/27822] > Impossible to change the provider of a table in the HiveMetaStore > - > > Key: SPARK-31061 > URL: https://issues.apache.org/jira/browse/SPARK-31061 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > Currently, it's impossible to alter the datasource of a table in the > HiveMetaStore by using alterTable, as the HiveExternalCatalog doesn't change > the provider table property during an alterTable command. This is required to > support changing table formats when using commands like REPLACE TABLE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31068) IllegalArgumentException in BroadcastExchangeExec
Lantao Jin created SPARK-31068: -- Summary: IllegalArgumentException in BroadcastExchangeExec Key: SPARK-31068 URL: https://issues.apache.org/jira/browse/SPARK-31068 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Lantao Jin {code} Caused by: org.apache.spark.SparkException: Failed to materialize query stage: BroadcastQueryStage 0 +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true], input[1, bigint, true], input[2, int, true])) +- *(1) Project [guid#138126, session_skey#138127L, seqnum#138132] +- *(1) Filter isnotnull(session_start_dt#138129) && (session_start_dt#138129 = 2020-01-01)) && isnotnull(seqnum#138132)) && isnotnull(session_skey#138127L)) && isnotnull(guid#138126)) +- *(1) FileScan parquet p_soj_cl_t.clav_events[guid#138126, session_skey#138127L, session_start_dt#138129, seqnum#138132] DataFilters: [isnotnull(session_start_dt#138129), (session_start_dt#138129 = 2020-01-01), isnotnull(seqnum#138..., Format: Parquet, Location: TahoeLogFileIndex[hdfs://hermes-rno/workspaces/P_SOJ_CL_T/clav_events], PartitionFilters: [], PushedFilters: [IsNotNull(session_start_dt), EqualTo(session_start_dt,2020-01-01), IsNotNull(seqnum), IsNotNull(..., ReadSchema: struct, SelectedBucketsCount: 1000 out of 1000, UsedIndexes: [] at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$$anonfun$generateFinalPlan$3.apply(AdaptiveSparkPlanExec.scala:230) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$$anonfun$generateFinalPlan$3.apply(AdaptiveSparkPlanExec.scala:225) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.generateFinalPlan(AdaptiveSparkPlanExec.scala:225) ... 48 more Caused by: java.lang.IllegalArgumentException: Initial capacity 670166426 exceeds maximum capacity of 536870912 at org.apache.spark.unsafe.map.BytesToBytesMap.(BytesToBytesMap.java:196) at org.apache.spark.unsafe.map.BytesToBytesMap.(BytesToBytesMap.java:219) at org.apache.spark.sql.execution.joins.UnsafeHashedRelation$.apply(HashedRelation.scala:340) at org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:123) at org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:964) at org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:952) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$9.apply(BroadcastExchangeExec.scala:220) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$9.apply(BroadcastExchangeExec.scala:207) at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:128) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:206) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:172) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) ... 3 more {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31067) Spark 2.4.* SQL query with partition columns scans entire AVRO data
Gopal created SPARK-31067: - Summary: Spark 2.4.* SQL query with partition columns scans entire AVRO data Key: SPARK-31067 URL: https://issues.apache.org/jira/browse/SPARK-31067 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4, 2.4.0 Reporter: Gopal Partition Column: dt SQL Query: select distinct dt from table1 Table format: AVRO It is scanning entire avro data in a table to get the distinct dt values -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master
[ https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053080#comment-17053080 ] Nicholas Chammas commented on SPARK-31043: -- FWIW I was seeing the same {{java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal}} issue on {{branch-3.0}} and pulling the latest changes fixed it for me too. > Spark 3.0 built against hadoop2.7 can't start standalone master > --- > > Key: SPARK-31043 > URL: https://issues.apache.org/jira/browse/SPARK-31043 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Critical > Fix For: 3.0.0 > > > trying to start a standalone master when building spark branch 3.0 with > hadoop2.7 fails with: > > Exception in thread "main" java.lang.NoClassDefFoundError: > org/w3c/dom/ElementTraversal > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at > [java.net|http://java.net/] > .URLClassLoader.defineClass(URLClassLoader.java:468) > at > [java.net|http://java.net/] > .URLClassLoader.access$100(URLClassLoader.java:74) > at > [java.net|http://java.net/] > .URLClassLoader$1.run(URLClassLoader.java:369) > ... > Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal > at > [java.net|http://java.net/] > .URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > ... 42 more -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()
[ https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053079#comment-17053079 ] Nicholas Chammas commented on SPARK-31065: -- Confirmed this issue is also present on {{branch-3.0}} as of commit {{9b48f3358d3efb523715a5f258e5ed83e28692f6}}. > Empty string values cause schema_of_json() to return a schema not usable by > from_json() > --- > > Key: SPARK-31065 > URL: https://issues.apache.org/jira/browse/SPARK-31065 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Here's a reproduction: > > {code:python} > from pyspark.sql.functions import from_json, schema_of_json > json = '{"a": ""}' > df = spark.createDataFrame([(json,)], schema=['json']) > df.show() > # chokes with org.apache.spark.sql.catalyst.parser.ParseException > json_schema = schema_of_json(json) > df.select(from_json('json', json_schema)) > # works fine > json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema > df.select(from_json('json', json_schema)) > {code} > The output: > {code:java} > >>> from pyspark.sql.functions import from_json, schema_of_json > >>> json = '{"a": ""}' > >>> > >>> df = spark.createDataFrame([(json,)], schema=['json']) > >>> df.show() > +-+ > | json| > +-+ > |{"a": ""}| > +-+ > >>> > >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException > >>> json_schema = schema_of_json(json) > >>> df.select(from_json('json', json_schema)) > Traceback (most recent call last): > File ".../site-packages/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File > ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.sql.functions.from_json. > : org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', > 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', > 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', > 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', > 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', > 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', > 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', > 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', > 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', > 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', > 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', > 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', > 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', > 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', > 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', > 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', > 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', > 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', > 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', > 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', > 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', > 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', > 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', > 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', > 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', > 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', > 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', > 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', > 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6) > == SQL == > struct > --^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64) > at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123) > at > org.apache.spark.sql.catalyst.expressions.JsonExprUtils$.evalSchemaExpr(jsonExpress
[jira] [Updated] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()
[ https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31065: - Affects Version/s: 3.0.0 > Empty string values cause schema_of_json() to return a schema not usable by > from_json() > --- > > Key: SPARK-31065 > URL: https://issues.apache.org/jira/browse/SPARK-31065 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Here's a reproduction: > > {code:python} > from pyspark.sql.functions import from_json, schema_of_json > json = '{"a": ""}' > df = spark.createDataFrame([(json,)], schema=['json']) > df.show() > # chokes with org.apache.spark.sql.catalyst.parser.ParseException > json_schema = schema_of_json(json) > df.select(from_json('json', json_schema)) > # works fine > json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema > df.select(from_json('json', json_schema)) > {code} > The output: > {code:java} > >>> from pyspark.sql.functions import from_json, schema_of_json > >>> json = '{"a": ""}' > >>> > >>> df = spark.createDataFrame([(json,)], schema=['json']) > >>> df.show() > +-+ > | json| > +-+ > |{"a": ""}| > +-+ > >>> > >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException > >>> json_schema = schema_of_json(json) > >>> df.select(from_json('json', json_schema)) > Traceback (most recent call last): > File ".../site-packages/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File > ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.sql.functions.from_json. > : org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', > 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', > 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', > 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', > 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', > 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', > 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', > 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', > 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', > 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', > 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', > 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', > 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', > 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', > 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', > 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', > 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', > 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', > 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', > 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', > 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', > 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', > 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', > 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', > 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', > 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', > 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', > 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', > 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6) > == SQL == > struct > --^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64) > at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123) > at > org.apache.spark.sql.catalyst.expressions.JsonExprUtils$.evalSchemaExpr(jsonExpressions.scala:777) > at > org.apache.spark.sql.catalyst.expressions.JsonToStructs.(jsonExpressions.scala:527) > at org.apache.spark.sq
[jira] [Created] (SPARK-31066) Disable useless and uncleaned hive SessionState initialization parts
Kent Yao created SPARK-31066: Summary: Disable useless and uncleaned hive SessionState initialization parts Key: SPARK-31066 URL: https://issues.apache.org/jira/browse/SPARK-31066 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Kent Yao As a common usage and according to the spark doc, users may often just copy their hive-site.xml to Spark directly from hive projects. Sometimes, the config file is not that clean for spark and may cause some side effects for example, hive.session.history.enabled will create a log for the hive jobs but useless for spark and also it will not be deleted on JVM exit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()
[ https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053052#comment-17053052 ] Nicholas Chammas commented on SPARK-31065: -- cc [~hyukjin.kwon] > Empty string values cause schema_of_json() to return a schema not usable by > from_json() > --- > > Key: SPARK-31065 > URL: https://issues.apache.org/jira/browse/SPARK-31065 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Nicholas Chammas >Priority: Minor > > Here's a reproduction: > > {code:python} > from pyspark.sql.functions import from_json, schema_of_json > json = '{"a": ""}' > df = spark.createDataFrame([(json,)], schema=['json']) > df.show() > # chokes with org.apache.spark.sql.catalyst.parser.ParseException > json_schema = schema_of_json(json) > df.select(from_json('json', json_schema)) > # works fine > json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema > df.select(from_json('json', json_schema)) > {code} > The output: > {code:java} > >>> from pyspark.sql.functions import from_json, schema_of_json > >>> json = '{"a": ""}' > >>> > >>> df = spark.createDataFrame([(json,)], schema=['json']) > >>> df.show() > +-+ > | json| > +-+ > |{"a": ""}| > +-+ > >>> > >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException > >>> json_schema = schema_of_json(json) > >>> df.select(from_json('json', json_schema)) > Traceback (most recent call last): > File ".../site-packages/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File > ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.sql.functions.from_json. > : org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', > 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', > 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', > 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', > 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', > 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', > 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', > 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', > 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', > 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', > 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', > 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', > 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', > 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', > 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', > 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', > 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', > 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', > 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', > 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', > 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', > 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', > 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', > 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', > 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', > 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', > 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', > 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', > 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6) > == SQL == > struct > --^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64) > at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123) > at > org.apache.spark.sql.catalyst.expressions.JsonExprUtils$.evalSchemaExpr(jsonExpressions.scala:777) > at > org.apache.spark.sql.catalyst.expressions.JsonToStructs.(jsonExpressions.
[jira] [Created] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()
Nicholas Chammas created SPARK-31065: Summary: Empty string values cause schema_of_json() to return a schema not usable by from_json() Key: SPARK-31065 URL: https://issues.apache.org/jira/browse/SPARK-31065 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5 Reporter: Nicholas Chammas Here's a reproduction: {code:python} from pyspark.sql.functions import from_json, schema_of_json json = '{"a": ""}' df = spark.createDataFrame([(json,)], schema=['json']) df.show() # chokes with org.apache.spark.sql.catalyst.parser.ParseException json_schema = schema_of_json(json) df.select(from_json('json', json_schema)) # works fine json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema df.select(from_json('json', json_schema)) {code} The output: {code:java} >>> from pyspark.sql.functions import from_json, schema_of_json >>> json = '{"a": ""}' >>> >>> df = spark.createDataFrame([(json,)], schema=['json']) >>> df.show() +-+ | json| +-+ |{"a": ""}| +-+ >>> >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException >>> json_schema = schema_of_json(json) >>> df.select(from_json('json', json_schema)) Traceback (most recent call last): File ".../site-packages/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.from_json. : org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6) == SQL == struct --^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64) at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123) at org.apache.spark.sql.catalyst.expressions.JsonExprUtils$.evalSchemaExpr(jsonExpressions.scala:777) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.(jsonExpressions.scala:527) at org.apache.spark.sql.functions$.from_json(functions.scala:3606) at org.apache.spark.sql.functions.from_json(functions.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Metho
[jira] [Resolved] (SPARK-30776) Support FValueRegressionSelector for continuous features and continuous labels
[ https://issues.apache.org/jira/browse/SPARK-30776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-30776. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27679 [https://github.com/apache/spark/pull/27679] > Support FValueRegressionSelector for continuous features and continuous labels > -- > > Key: SPARK-30776 > URL: https://issues.apache.org/jira/browse/SPARK-30776 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.1.0 > > > Support FValueRegressionSelector for continuous features and continuous labels -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30776) Support FValueRegressionSelector for continuous features and continuous labels
[ https://issues.apache.org/jira/browse/SPARK-30776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-30776: Assignee: Huaxin Gao > Support FValueRegressionSelector for continuous features and continuous labels > -- > > Key: SPARK-30776 > URL: https://issues.apache.org/jira/browse/SPARK-30776 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > > Support FValueRegressionSelector for continuous features and continuous labels -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.
[ https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma reassigned SPARK-30985: --- Assignee: (was: Prashant Sharma) > Propagate SPARK_CONF_DIR files to driver and exec pods. > --- > > Key: SPARK-30985 > URL: https://issues.apache.org/jira/browse/SPARK-30985 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Prashant Sharma >Priority: Major > > SPARK_CONF_DIR hosts configuration files like, > 1) spark-defaults.conf - containing all the spark properties. > 2) log4j.properties - Logger configuration. > 3) spark-env.sh - Environment variables to be setup at driver and executor. > 4) core-site.xml - Hadoop related configuration. > 5) fairscheduler.xml - Spark's fair scheduling policy at the job level. > 6) metrics.properties - Spark metrics. > 7) Any user specific - library or framework specific configuration file. > Traditionally, SPARK_CONF_DIR has been the home to all user specific > configuration files and the default behaviour in the Yarn or standalone mode > is that these configuration files are copied to the worker nodes as required > by the users themselves. In other words, they are not auto-copied. > But, in the case of spark on kubernetes, we use spark images and generally > these images are approved or undergoe some kind of standardisation. These > files cannot be simply copied to the SPARK_CONF_DIR of the running executor > and driver pods by the user. > So, at the moment we have special casing for providing each configuration and > for any other user specific configuration files, the process is more complex, > i.e. - e.g. one can start with their own custom image of spark with > configuration files pre installed etc.. > Examples of special casing are: > 1. Hadoop configuration in spark.kubernetes.hadoop.configMapName > 2. Spark-env.sh as in spark.kubernetes.driverEnv.[EnvironmentVariableName] > 3. Log4j.properties as in https://github.com/apache/spark/pull/26193 > ... And for those such special casing does not exist, they are simply out of > luck. > So this feature, will let the user specific configuration files be mounted on > the driver and executor pods' SPARK_CONF_DIR. > At the moment it is not clear, if there is a need to, let user specify which > config files to propagate - to driver and or executor. But, if there is a > case that feature will be helpful, we can increase the scope of this work or > create another JIRA issue to track that work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30930) ML, GraphX 3.0 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-30930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053017#comment-17053017 ] Sean R. Owen commented on SPARK-30930: -- My general stance is: anything that's been public for a while is pretty much stable now. Given the stronger preference for not modifying APIs going forward, I can't see changing or removing these any more readily than one would a 'stable' API. So I'd be OK removing Experimental / DeveloperApi on anything public right now, unless there are specific reasons not to. I would not un-seal or un-final classes unless there is a clear reason to believe it's intended to be an extensible API, however. > ML, GraphX 3.0 QA: API: Experimental, DeveloperApi, final, sealed audit > --- > > Key: SPARK-30930 > URL: https://issues.apache.org/jira/browse/SPARK-30930 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Critical > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31063) xml-apis is missing after xercesImpl bumped up to 2.12.0
[ https://issues.apache.org/jira/browse/SPARK-31063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-31063. -- Resolution: Not A Problem issue fix by a follow-up, not a problem anymore > xml-apis is missing after xercesImpl bumped up to 2.12.0 > > > Key: SPARK-31063 > URL: https://issues.apache.org/jira/browse/SPARK-31063 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Blocker > > {code:java} > ✘ kentyao@hulk ~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200305 > bin/spark-sql > 20/03/06 11:05:58 WARN Utils: Your hostname, hulk.local resolves to a > loopback address: 127.0.0.1; using 10.242.189.214 instead (on interface en0) > 20/03/06 11:05:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > Exception in thread "main" java.lang.NoClassDefFoundError: > org/w3c/dom/ElementTraversal > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:756) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:418) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355) > at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > at org.apache.xerces.parsers.AbstractDOMParser.startDocument(Unknown > Source) > at org.apache.xerces.xinclude.XIncludeHandler.startDocument(Unknown > Source) > at org.apache.xerces.impl.dtd.XMLDTDValidator.startDocument(Unknown > Source) > at org.apache.xerces.impl.XMLDocumentScannerImpl.startEntity(Unknown > Source) > at > org.apache.xerces.impl.XMLVersionDetector.startDocumentParsing(Unknown Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) > at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150) > at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2482) > at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2470) > at > org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2541) > at > org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2494) > at > org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2407) > at org.apache.hadoop.conf.Configuration.set(Configuration.java:1143) > at org.apache.hadoop.conf.Configuration.set(Configuration.java:1115) > at > org.apache.spark.deploy.SparkHadoopUtil$.org$apache$spark$deploy$SparkHadoopUtil$$appendS3AndSparkHadoopHiveConfigurations(SparkHadoopUtil.scala:456) > at > org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:427) > at > org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$2(SparkSubmit.scala:342) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:342) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loa
[jira] [Assigned] (SPARK-31044) Support foldable input by `schema_of_json`
[ https://issues.apache.org/jira/browse/SPARK-31044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31044: --- Assignee: Maxim Gekk > Support foldable input by `schema_of_json` > -- > > Key: SPARK-31044 > URL: https://issues.apache.org/jira/browse/SPARK-31044 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > Currently, the `schema_of_json()` function allows only string literal as the > input. The ticket aims to support any foldable string expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31025) Support foldable input by `schema_of_csv`
[ https://issues.apache.org/jira/browse/SPARK-31025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31025: --- Assignee: Maxim Gekk > Support foldable input by `schema_of_csv` > -- > > Key: SPARK-31025 > URL: https://issues.apache.org/jira/browse/SPARK-31025 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > Currently, the `schema_of_csv()` function allows only string literal as the > input. The ticket aims to support any foldable string expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31025) Support foldable input by `schema_of_csv`
[ https://issues.apache.org/jira/browse/SPARK-31025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31025. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27804 [https://github.com/apache/spark/pull/27804] > Support foldable input by `schema_of_csv` > -- > > Key: SPARK-31025 > URL: https://issues.apache.org/jira/browse/SPARK-31025 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.1.0 > > > Currently, the `schema_of_csv()` function allows only string literal as the > input. The ticket aims to support any foldable string expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31044) Support foldable input by `schema_of_json`
[ https://issues.apache.org/jira/browse/SPARK-31044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31044. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27804 [https://github.com/apache/spark/pull/27804] > Support foldable input by `schema_of_json` > -- > > Key: SPARK-31044 > URL: https://issues.apache.org/jira/browse/SPARK-31044 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.1.0 > > > Currently, the `schema_of_json()` function allows only string literal as the > input. The ticket aims to support any foldable string expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31023) Support foldable schemas by `from_json`
[ https://issues.apache.org/jira/browse/SPARK-31023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31023. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27804 [https://github.com/apache/spark/pull/27804] > Support foldable schemas by `from_json` > --- > > Key: SPARK-31023 > URL: https://issues.apache.org/jira/browse/SPARK-31023 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.1.0 > > > Currently, Spark accepts only literals or schema_of_json w/ literal input as > the schema parameter of from_json. And it fails on any foldable expressions, > for instance: > {code:sql} > spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id > INT, dpt_org_city STRING', 'dpt_org_', '')); > Error in query: Schema should be specified in DDL format as a string literal > or output of the schema_of_json function instead of replace('dpt_org_id INT, > dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7 > {code} > There are no reasons to restrict users by literals. The ticket aims to > support any foldable schemas by from_json(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31023) Support foldable schemas by `from_json`
[ https://issues.apache.org/jira/browse/SPARK-31023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31023: --- Assignee: Maxim Gekk > Support foldable schemas by `from_json` > --- > > Key: SPARK-31023 > URL: https://issues.apache.org/jira/browse/SPARK-31023 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > Currently, Spark accepts only literals or schema_of_json w/ literal input as > the schema parameter of from_json. And it fails on any foldable expressions, > for instance: > {code:sql} > spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id > INT, dpt_org_city STRING', 'dpt_org_', '')); > Error in query: Schema should be specified in DDL format as a string literal > or output of the schema_of_json function instead of replace('dpt_org_id INT, > dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7 > {code} > There are no reasons to restrict users by literals. The ticket aims to > support any foldable schemas by from_json(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31020) Support foldable schemas by `from_csv`
[ https://issues.apache.org/jira/browse/SPARK-31020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31020: --- Assignee: Maxim Gekk > Support foldable schemas by `from_csv` > -- > > Key: SPARK-31020 > URL: https://issues.apache.org/jira/browse/SPARK-31020 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > Currently, Spark accepts only literals or schema_of_csv w/ literal input as > the schema parameter of from_csv. And it fails on any foldable expressions, > for instance: > {code:sql} > spark-sql> select from_csv('1, 3.14', replace('dpt_org_id INT, dpt_org_city > STRING', 'dpt_org_', '')); > Error in query: Schema should be specified in DDL format as a string literal > or output of the schema_of_csv function instead of replace('dpt_org_id INT, > dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7 > {code} > There are no reasons to restrict users by literals. The ticket aims to > support any foldable schemas by from_csv(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31020) Support foldable schemas by `from_csv`
[ https://issues.apache.org/jira/browse/SPARK-31020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31020. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27804 [https://github.com/apache/spark/pull/27804] > Support foldable schemas by `from_csv` > -- > > Key: SPARK-31020 > URL: https://issues.apache.org/jira/browse/SPARK-31020 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.1.0 > > > Currently, Spark accepts only literals or schema_of_csv w/ literal input as > the schema parameter of from_csv. And it fails on any foldable expressions, > for instance: > {code:sql} > spark-sql> select from_csv('1, 3.14', replace('dpt_org_id INT, dpt_org_city > STRING', 'dpt_org_', '')); > Error in query: Schema should be specified in DDL format as a string literal > or output of the schema_of_csv function instead of replace('dpt_org_id INT, > dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7 > {code} > There are no reasons to restrict users by literals. The ticket aims to > support any foldable schemas by from_csv(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31045) Add config for AQE logging level
[ https://issues.apache.org/jira/browse/SPARK-31045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31045: --- Assignee: Wei Xue > Add config for AQE logging level > > > Key: SPARK-31045 > URL: https://issues.apache.org/jira/browse/SPARK-31045 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wei Xue >Assignee: Wei Xue >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31045) Add config for AQE logging level
[ https://issues.apache.org/jira/browse/SPARK-31045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31045. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27798 [https://github.com/apache/spark/pull/27798] > Add config for AQE logging level > > > Key: SPARK-31045 > URL: https://issues.apache.org/jira/browse/SPARK-31045 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wei Xue >Assignee: Wei Xue >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30886) Deprecate two-parameter TRIM/LTRIM/RTRIM functions
[ https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30886. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27643 [https://github.com/apache/spark/pull/27643] > Deprecate two-parameter TRIM/LTRIM/RTRIM functions > -- > > Key: SPARK-30886 > URL: https://issues.apache.org/jira/browse/SPARK-30886 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > Apache Spark community decided to keep the existing esoteric two-parameter > use cases with a proper warning. This JIRA aims to show warning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30886) Deprecate two-parameter TRIM/LTRIM/RTRIM functions
[ https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30886: - Assignee: Dongjoon Hyun > Deprecate two-parameter TRIM/LTRIM/RTRIM functions > -- > > Key: SPARK-30886 > URL: https://issues.apache.org/jira/browse/SPARK-30886 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > Apache Spark community decided to keep the existing esoteric two-parameter > use cases with a proper warning. This JIRA aims to show warning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31064) New Parquet Predicate Filter APIs with multi-part Identifier Support
DB Tsai created SPARK-31064: --- Summary: New Parquet Predicate Filter APIs with multi-part Identifier Support Key: SPARK-31064 URL: https://issues.apache.org/jira/browse/SPARK-31064 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.5 Reporter: DB Tsai Parquet's *org.apache.parquet.filter2.predicate.FilterApi* uses *dots* as separators to split the column name into multi-parts of nested fields. The drawback is this causes issues when the field name contains *dot*. The new APIs that will be added will take array of string directly for multi-parts of nested fields, so no confusion as using *dot* as a separator. It's intended to move this code back to parquet community. See [PARQUET-1809] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31063) xml-apis is missing after xercesImpl bumped up to 2.12.0
Kent Yao created SPARK-31063: Summary: xml-apis is missing after xercesImpl bumped up to 2.12.0 Key: SPARK-31063 URL: https://issues.apache.org/jira/browse/SPARK-31063 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0, 3.1.0 Reporter: Kent Yao {code:java} ✘ kentyao@hulk ~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200305 bin/spark-sql 20/03/06 11:05:58 WARN Utils: Your hostname, hulk.local resolves to a loopback address: 127.0.0.1; using 10.242.189.214 instead (on interface en0) 20/03/06 11:05:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Exception in thread "main" java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:756) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at org.apache.xerces.parsers.AbstractDOMParser.startDocument(Unknown Source) at org.apache.xerces.xinclude.XIncludeHandler.startDocument(Unknown Source) at org.apache.xerces.impl.dtd.XMLDTDValidator.startDocument(Unknown Source) at org.apache.xerces.impl.XMLDocumentScannerImpl.startEntity(Unknown Source) at org.apache.xerces.impl.XMLVersionDetector.startDocumentParsing(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150) at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2482) at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2470) at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2541) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2494) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2407) at org.apache.hadoop.conf.Configuration.set(Configuration.java:1143) at org.apache.hadoop.conf.Configuration.set(Configuration.java:1115) at org.apache.spark.deploy.SparkHadoopUtil$.org$apache$spark$deploy$SparkHadoopUtil$$appendS3AndSparkHadoopHiveConfigurations(SparkHadoopUtil.scala:456) at org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:427) at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$2(SparkSubmit.scala:342) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:342) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 42 more {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master
[ https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31043. -- Fix Version/s: 3.0.0 Resolution: Fixed > Spark 3.0 built against hadoop2.7 can't start standalone master > --- > > Key: SPARK-31043 > URL: https://issues.apache.org/jira/browse/SPARK-31043 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Critical > Fix For: 3.0.0 > > > trying to start a standalone master when building spark branch 3.0 with > hadoop2.7 fails with: > > Exception in thread "main" java.lang.NoClassDefFoundError: > org/w3c/dom/ElementTraversal > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at > [java.net|http://java.net/] > .URLClassLoader.defineClass(URLClassLoader.java:468) > at > [java.net|http://java.net/] > .URLClassLoader.access$100(URLClassLoader.java:74) > at > [java.net|http://java.net/] > .URLClassLoader$1.run(URLClassLoader.java:369) > ... > Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal > at > [java.net|http://java.net/] > .URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > ... 42 more -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master
[ https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052687#comment-17052687 ] Hyukjin Kwon commented on SPARK-31043: -- I believe it was fixed as of https://github.com/apache/spark/pull/27808 > Spark 3.0 built against hadoop2.7 can't start standalone master > --- > > Key: SPARK-31043 > URL: https://issues.apache.org/jira/browse/SPARK-31043 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Critical > > trying to start a standalone master when building spark branch 3.0 with > hadoop2.7 fails with: > > Exception in thread "main" java.lang.NoClassDefFoundError: > org/w3c/dom/ElementTraversal > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at > [java.net|http://java.net/] > .URLClassLoader.defineClass(URLClassLoader.java:468) > at > [java.net|http://java.net/] > .URLClassLoader.access$100(URLClassLoader.java:74) > at > [java.net|http://java.net/] > .URLClassLoader$1.run(URLClassLoader.java:369) > ... > Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal > at > [java.net|http://java.net/] > .URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > ... 42 more -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30914) Add version information to the configuration of UI
[ https://issues.apache.org/jira/browse/SPARK-30914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30914: Assignee: jiaan.geng > Add version information to the configuration of UI > -- > > Key: SPARK-30914 > URL: https://issues.apache.org/jira/browse/SPARK-30914 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > core/src/main/scala/org/apache/spark/internal/config/UI.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30914) Add version information to the configuration of UI
[ https://issues.apache.org/jira/browse/SPARK-30914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30914. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27806 [https://github.com/apache/spark/pull/27806] > Add version information to the configuration of UI > -- > > Key: SPARK-30914 > URL: https://issues.apache.org/jira/browse/SPARK-30914 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.1.0 > > > core/src/main/scala/org/apache/spark/internal/config/UI.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31036) Use stringArgs in Expression.toString to respect hidden parameters
[ https://issues.apache.org/jira/browse/SPARK-31036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31036. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27788 [https://github.com/apache/spark/pull/27788] > Use stringArgs in Expression.toString to respect hidden parameters > -- > > Key: SPARK-31036 > URL: https://issues.apache.org/jira/browse/SPARK-31036 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > Currently, the top of https://github.com/apache/spark/pull/27657, > {code} > val identify = udf((input: Seq[Int]) => input) > spark.range(10).select(identify(array("id"))).show() > {code} > shows hidden parameter `useStringTypeWhenEmpty`. > {code} > +-+ > |UDF(array(id, false))| > +-+ > | [0]| > | [1]| > ... > {code} > This is a general problem and we should respect hidden parameters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31036) Use stringArgs in Expression.toString to respect hidden parameters
[ https://issues.apache.org/jira/browse/SPARK-31036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31036: Assignee: Hyukjin Kwon > Use stringArgs in Expression.toString to respect hidden parameters > -- > > Key: SPARK-31036 > URL: https://issues.apache.org/jira/browse/SPARK-31036 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > Currently, the top of https://github.com/apache/spark/pull/27657, > {code} > val identify = udf((input: Seq[Int]) => input) > spark.range(10).select(identify(array("id"))).show() > {code} > shows hidden parameter `useStringTypeWhenEmpty`. > {code} > +-+ > |UDF(array(id, false))| > +-+ > | [0]| > | [1]| > ... > {code} > This is a general problem and we should respect hidden parameters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30563) Regressions in Join benchmarks
[ https://issues.apache.org/jira/browse/SPARK-30563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30563. -- Fix Version/s: 3.0.0 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/27791 > Regressions in Join benchmarks > -- > > Key: SPARK-30563 > URL: https://issues.apache.org/jira/browse/SPARK-30563 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > Regenerated benchmark results in the > https://github.com/apache/spark/pull/27078 shows many regressions in > JoinBenchmark. The benchmarked queries slowed down by up to 3 times, see > old results: > https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dL10 > new results: > https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dR10 > One of the difference in queries is using the `NoOp` datasource in new > queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30930) ML, GraphX 3.0 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-30930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052641#comment-17052641 ] Huaxin Gao commented on SPARK-30930: cc [~srowen], [~podongfeng] Developer API Most developer API are the basic components for ML pipeline, such as Transformer, Model, Estimator, PipelineStage, Params and Attributes, I guess we don't need to unmark any of them? final class: org.apache.spark.ml.classification.OneVsRest org.apache.spark.ml.evaluation.RegressionEvaluator org.apache.spark.ml.feature.Binarizer org.apache.spark.ml.feature.Bucketizer org.apache.spark.ml.feature.ChiSqSelector org.apache.spark.ml.feature.IDF org.apache.spark.ml.feature.QuantileDiscretizer org.apache.spark.ml.feature.VectorSlicer org.apache.spark.ml.feature.Word2Vec org.apache.spark.ml.param.ParamMap org.apache.spark.ml.fpm.PrefixSpan Do we need to unmark any of these? > ML, GraphX 3.0 QA: API: Experimental, DeveloperApi, final, sealed audit > --- > > Key: SPARK-30930 > URL: https://issues.apache.org/jira/browse/SPARK-30930 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Critical > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31062) Improve Spark Decommissioning K8s test relability
Holden Karau created SPARK-31062: Summary: Improve Spark Decommissioning K8s test relability Key: SPARK-31062 URL: https://issues.apache.org/jira/browse/SPARK-31062 Project: Spark Issue Type: Improvement Components: Kubernetes, Tests Affects Versions: 3.1.0 Reporter: Holden Karau Assignee: Holden Karau The test currently flakes more than the other Kubernetes tests. We can remove some of the timing that is likely to be a source of flakiness. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31061) Impossible to change the provider of a table in the HiveMetaStore
Burak Yavuz created SPARK-31061: --- Summary: Impossible to change the provider of a table in the HiveMetaStore Key: SPARK-31061 URL: https://issues.apache.org/jira/browse/SPARK-31061 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz Currently, it's impossible to alter the datasource of a table in the HiveMetaStore by using alterTable, as the HiveExternalCatalog doesn't change the provider table property during an alterTable command. This is required to support changing table formats when using commands like REPLACE TABLE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31058) Consolidate the implementation of quoteIfNeeded
[ https://issues.apache.org/jira/browse/SPARK-31058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-31058. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27814 [https://github.com/apache/spark/pull/27814] > Consolidate the implementation of quoteIfNeeded > --- > > Key: SPARK-31058 > URL: https://issues.apache.org/jira/browse/SPARK-31058 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5 >Reporter: DB Tsai >Priority: Major > Fix For: 3.0.0 > > > There are two implementation of quoteIfNeeded, and one is in > *org.apache.spark.sql.connector.catalog.CatalogV2Implicits.quote* and the > other is in *OrcFiltersBase.quoteAttributeNameIfNeeded* This PR will > consolidate them into one. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31058) Consolidate the implementation of quoteIfNeeded
[ https://issues.apache.org/jira/browse/SPARK-31058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-31058: --- Assignee: DB Tsai > Consolidate the implementation of quoteIfNeeded > --- > > Key: SPARK-31058 > URL: https://issues.apache.org/jira/browse/SPARK-31058 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 3.0.0 > > > There are two implementation of quoteIfNeeded, and one is in > *org.apache.spark.sql.connector.catalog.CatalogV2Implicits.quote* and the > other is in *OrcFiltersBase.quoteAttributeNameIfNeeded* This PR will > consolidate them into one. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30930) ML, GraphX 3.0 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-30930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052606#comment-17052606 ] Huaxin Gao commented on SPARK-30930: Sealed and Experimental are as following. I don't think we need to do anything about these. sealed: org.apache.spark.ml.attribute.events.MLEvent org.apache.spark.ml.attribute.Attribute org.apache.spark.ml.attribute.AttributeType org.apache.spark.ml.classification.LogisticRegressionTrainingSummary org.apache.spark.ml.classification.BinaryLogisticRegressionSummary org.apache.spark.ml.classification.LogisticRegressionTrainingSummary org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary org.apache.spark.ml.feature.Term org.apache.spark.ml.feature.InteractableTerm org.apache.spark.ml.optim.WeightedLeastSquares.Solver org.apache.spark.ml.optim.NormalEquationSolver org.apache.spark.ml.tree.Node org.apache.spark.ml.tree.Split org.apache.spark.ml.util.BaseReadWrite org.apache.spark.ml.stat.SummaryBuilder org.apache.spark.ml.stat.SummaryBuilderImpl.Metric org.apache.spark.ml.stat.SummaryBuilderImpl.ComputeMetric Experimental classes: org.apache.spark.ml.evaluation.MultilabelClassificationEvaluator org.apache.spark.ml.evaluation.RankingEvaluator org.apache.spark.ml.MLEvent // Experimental marked in @note This is experimental and unstable. Do we need to use @Experimental? Experimental methods: org.apache.spark.ml.feature.LSH.approxNearestNeighbors > ML, GraphX 3.0 QA: API: Experimental, DeveloperApi, final, sealed audit > --- > > Key: SPARK-30930 > URL: https://issues.apache.org/jira/browse/SPARK-30930 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Critical > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31059) Spark's SQL "group by" local processing operator is broken.
[ https://issues.apache.org/jira/browse/SPARK-31059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michail Giannakopoulos resolved SPARK-31059. Resolution: Invalid Invalid guys, there is no duplicate "Product line - Order method type" key when presenting the results. > Spark's SQL "group by" local processing operator is broken. > --- > > Key: SPARK-31059 > URL: https://issues.apache.org/jira/browse/SPARK-31059 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3, 2.4.5 > Environment: Windows 10. >Reporter: Michail Giannakopoulos >Priority: Blocker > Attachments: SampleFile_GOSales.csv > > > When applying "GROUP BY" processing operator (without an "ORDER BY" clause), > I expect to see all the grouping columns being grouped together to the same > buckets. However, this is not the case. > Steps to reproduce: > 1. Start spark-shell as follows: > bin\spark-shell.cmd --master local[4] --conf > spark.sql.catalogImplementation=in-memory > 2. Load the attached csv file: > val gosales = spark.read.format("csv").option("header", > "true").option("inferSchema", > "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv") > 3. Create a temp view: > gosales.createOrReplaceTempView("gosales") > 4. Execute the following sql statement: > spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM > `gosales` GROUP BY `Product line`, `Order method type`").show() > Output: > +---+--++ > |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))| > +---+--++ > |Golf Equipment|E-mail|92.25| > |Camping Equipment|Mail|0.0| > |Camping Equipment|Fax|null| > |Golf Equipment|Telephone|123.0| > |Camping Equipment|Special|null| > |Outdoor Protection|Telephone|34218.19| > |Mountaineering Eq...|Mail|0.0| > |Camping Equipment|Web|32469.03| > |Personal Accessories|Fax|3318.7| > |Golf Equipment|Sales visit|143.5| > |Mountaineering Eq...|Telephone|null| > |Mountaineering Eq...|E-mail|null| > |Outdoor Protection|Sales visit|20522.42| > |Outdoor Protection|Fax|5857.54| > |Personal Accessories|E-mail|26679.6403| > |Mountaineering Eq...|Fax|null| > |Outdoor Protection|Web|340836.853| > |Golf Equipment|Special|0.0| > |Outdoor Protection|E-mail|28505.93| > |Golf Equipment|Web|3034.0| > +---+--++ > Expected output: > +---+--++ > |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))| > +---+--++ > |Golf Equipment|E-mail|92.25| > |Golf Equipment|Fax|null| > |Golf Equipment|Mail|0.0| > |Golf Equipment|Sales visit|143.5| > |Golf Equipment|Special|0.0| > |Golf Equipment|Telephone|123.0| > |Golf Equipment|Web|3034.0| > |Camping Equipment|E-mail|1303.3| > |Camping Equipment|Fax|null| > |Camping Equipment|Sales visit|4754.87| > |Camping Equipment|Mail|0.0| > |Camping Equipment|Special|null| > |Camping Equipment|Telephone|5169.65| > |Camping Equipment|Web|32469.03| > |Mountaineering Eq...|E-mail|null| > |Mountaineering Eq...|Fax|null| > |Mountaineering Eq...|Mail|0.0| > |Mountaineering Eq...|Special|null| > |Mountaineering Eq...|Sales visit|null| > |Mountaineering Eq...|Telephone|null| > +---+--++ > Notice how in the expected output all the grouping columns should be bucketed > together without necessarily being in order, which is not the case with the > output that spark produces. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25722) Support a backtick character in column names
[ https://issues.apache.org/jira/browse/SPARK-25722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25722: -- Affects Version/s: (was: 3.0.0) 3.1.0 > Support a backtick character in column names > > > Key: SPARK-25722 > URL: https://issues.apache.org/jira/browse/SPARK-25722 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Among built-in data sources, `avro` and `orc` doesn't allow `backtick` in > column names. We had better be consistent if possible. > * Option 1: Support a backtick character > * Option 2: Disallow a backtick character (This may be considered as a > regression at TEXT/CSV/JSON/Parquet) > So, Option 1 is better. > *TEXT*, *CSV*, *JSON*, *PARQUET* > {code:java} > Seq("text", "csv", "json", "parquet").foreach { format => > Seq("1").toDF("`").write.mode("overwrite").format(format).save("/tmp/t") > }{code} > *AVRO* > {code:java} > scala> > Seq("1").toDF("`").write.mode("overwrite").format("avro").save("/tmp/t") > org.apache.avro.SchemaParseException: Illegal initial character: `{code} > *ORC (native)* > {code:java} > scala> Seq("1").toDF("`").write.mode("overwrite").format("orc").save("/tmp/t") > java.lang.IllegalArgumentException: Unmatched quote at > 'struct<^```:string>'{code} > *ORC (hive)* > {code:java} > scala> Seq("1").toDF("`").write.mode("overwrite").format("orc").save("/tmp/t") > java.lang.IllegalArgumentException: Error: name expected at the position 7 of > 'struct<`:string>' but '`' is found.{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31060) Handle column names containing `dots` in data source `Filter`
DB Tsai created SPARK-31060: --- Summary: Handle column names containing `dots` in data source `Filter` Key: SPARK-31060 URL: https://issues.apache.org/jira/browse/SPARK-31060 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.5 Reporter: DB Tsai -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30961) Arrow enabled: to_pandas with date column fails
[ https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052464#comment-17052464 ] Kevin Appel edited comment on SPARK-30961 at 3/5/20, 10:19 PM: --- (python 3.6, pyarrow 0.8.0, pandas 0.21.0) or (python 3.7, pyarrow 0.11.1, pandas 0.24.1) are combinations I found that is still working correctly for Date in both Spark 2.3 and Spark 2.4, in additional all the examples listed on the pandas udf spark documentation also works with this setup was (Author: kevinappel): python 3.6, pyarrow 0.8.0, pandas 0.21.0 is a combination I found that is still working correctly for Date in both Spark 2.3 and Spark 2.4, in additional all the examples listed on the pandas udf spark documentation also works with this setup > Arrow enabled: to_pandas with date column fails > --- > > Key: SPARK-30961 > URL: https://issues.apache.org/jira/browse/SPARK-30961 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5 > Environment: Apache Spark 2.4.5 >Reporter: Nicolas Renkamp >Priority: Major > Labels: ready-to-commit > > Hi, > there seems to be a bug in the arrow enabled to_pandas conversion from spark > dataframe to pandas dataframe when the dataframe has a column of type > DateType. Here is a minimal example to reproduce the issue: > {code:java} > spark = SparkSession.builder.getOrCreate() > is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled") > print("Arrow optimization is enabled: " + is_arrow_enabled) > spark_df = spark.createDataFrame( > [['2019-12-06']], 'created_at: string') \ > .withColumn('created_at', F.to_date('created_at')) > # works > spark_df.toPandas() > spark.conf.set("spark.sql.execution.arrow.enabled", 'true') > is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled") > print("Arrow optimization is enabled: " + is_arrow_enabled) > # raises AttributeError: Can only use .dt accessor with datetimelike values > # series is still of type object, .dt does not exist > spark_df.toPandas(){code} > A fix would be to modify the _check_series_convert_date function in > pyspark.sql.types to: > {code:java} > def _check_series_convert_date(series, data_type): > """ > Cast the series to datetime.date if it's a date type, otherwise returns > the original series.:param series: pandas.Series > :param data_type: a Spark data type for the series > """ > from pyspark.sql.utils import require_minimum_pandas_version > require_minimum_pandas_version()from pandas import to_datetime > if type(data_type) == DateType: > return to_datetime(series).dt.date > else: > return series > {code} > Let me know if I should prepare a Pull Request for the 2.4.5 branch. > I have not tested the behavior on master branch. > > Thanks, > Nicolas -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31059) Spark's SQL "group by" local processing operator is broken.
[ https://issues.apache.org/jira/browse/SPARK-31059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michail Giannakopoulos updated SPARK-31059: --- Description: When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I expect to see all the grouping columns being grouped together to the same buckets. However, this is not the case. Steps to reproduce: 1. Start spark-shell as follows: bin\spark-shell.cmd --master local[4] --conf spark.sql.catalogImplementation=in-memory 2. Load the attached csv file: val gosales = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv") 3. Create a temp view: gosales.createOrReplaceTempView("gosales") 4. Execute the following sql statement: spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM `gosales` GROUP BY `Product line`, `Order method type`").show() Output: +---+--++ |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))| +---+--++ |Golf Equipment|E-mail|92.25| |Camping Equipment|Mail|0.0| |Camping Equipment|Fax|null| |Golf Equipment|Telephone|123.0| |Camping Equipment|Special|null| |Outdoor Protection|Telephone|34218.19| |Mountaineering Eq...|Mail|0.0| |Camping Equipment|Web|32469.03| |Personal Accessories|Fax|3318.7| |Golf Equipment|Sales visit|143.5| |Mountaineering Eq...|Telephone|null| |Mountaineering Eq...|E-mail|null| |Outdoor Protection|Sales visit|20522.42| |Outdoor Protection|Fax|5857.54| |Personal Accessories|E-mail|26679.6403| |Mountaineering Eq...|Fax|null| |Outdoor Protection|Web|340836.853| |Golf Equipment|Special|0.0| |Outdoor Protection|E-mail|28505.93| |Golf Equipment|Web|3034.0| +---+--++ Expected output: +---+--++ |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))| +---+--++ |Golf Equipment|E-mail|92.25| |Golf Equipment|Fax|null| |Golf Equipment|Mail|0.0| |Golf Equipment|Sales visit|143.5| |Golf Equipment|Special|0.0| |Golf Equipment|Telephone|123.0| |Golf Equipment|Web|3034.0| |Camping Equipment|E-mail|1303.3| |Camping Equipment|Fax|null| |Camping Equipment|Sales visit|4754.87| |Camping Equipment|Mail|0.0| |Camping Equipment|Special|null| |Camping Equipment|Telephone|5169.65| |Camping Equipment|Web|32469.03| |Mountaineering Eq...|E-mail|null| |Mountaineering Eq...|Fax|null| |Mountaineering Eq...|Mail|0.0| |Mountaineering Eq...|Special|null| |Mountaineering Eq...|Sales visit|null| |Mountaineering Eq...|Telephone|null| +---+--++ Notice how in the expected output all the grouping columns should be bucketed together without necessarily being in order, which is not the case with the output that spark produces. was: When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I expect to see all the grouping columns being grouped together to the same buckets. However, this is not the case. Steps to reproduce: 1. Start spark-shell as follows: bin\spark-shell.cmd --master local[4] --conf spark.sql.catalogImplementation=in-memory 2. Load the attached csv file: val gosales = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv") 3. Create a temp view: gosales.createOrReplaceTempView("gosales") 4. Execute the following sql statement: spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM `gosales` GROUP BY `Product line`, `Order method type`").show() Output: +--+---++ |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))| +--+---++ |Golf Equipment|E-mail|92.25| |Camping Equipment|Mail|0.0| |Camping Equipment|Fax|null| |Golf Equipment|Telephone|123.0| |Camping Equipment|Special|null| |Outdoor Protection|Telephone|34218.19| |Mountaineering Eq...|Mail|0.0| |Camping Equipment|Web|32469.03| |Personal Accessories|Fax|3318.7| |Golf Equipment|Sales visit|143.5| |Mountaineering Eq...|Telephone|null| |Mountaineering Eq...|E-mail|null| |Outdoor Protection|Sales visit|20522.42| |Outdoor Protection|Fax|5857.54| |Personal Accessories|E-mail|26679.6403| |Mountaineering Eq...|Fax|null| |Outdoor Protection|Web|340836.853| |Golf Equipment|Special|0.0| |Outdoor Protection|E-mail|28505.93| |Golf Equipment|Web|3034.0| +--+---++ Expected output: +--+---++---
[jira] [Updated] (SPARK-31059) Spark's SQL "group by" local processing operator is broken.
[ https://issues.apache.org/jira/browse/SPARK-31059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michail Giannakopoulos updated SPARK-31059: --- Description: When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I expect to see all the grouping columns being grouped together to the same buckets. However, this is not the case. Steps to reproduce: 1. Start spark-shell as follows: bin\spark-shell.cmd --master local[4] --conf spark.sql.catalogImplementation=in-memory 2. Load the attached csv file: val gosales = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv") 3. Create a temp view: gosales.createOrReplaceTempView("gosales") 4. Execute the following sql statement: spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM `gosales` GROUP BY `Product line`, `Order method type`").show() Output: +--+---++ |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))| +--+---++ |Golf Equipment|E-mail|92.25| |Camping Equipment|Mail|0.0| |Camping Equipment|Fax|null| |Golf Equipment|Telephone|123.0| |Camping Equipment|Special|null| |Outdoor Protection|Telephone|34218.19| |Mountaineering Eq...|Mail|0.0| |Camping Equipment|Web|32469.03| |Personal Accessories|Fax|3318.7| |Golf Equipment|Sales visit|143.5| |Mountaineering Eq...|Telephone|null| |Mountaineering Eq...|E-mail|null| |Outdoor Protection|Sales visit|20522.42| |Outdoor Protection|Fax|5857.54| |Personal Accessories|E-mail|26679.6403| |Mountaineering Eq...|Fax|null| |Outdoor Protection|Web|340836.853| |Golf Equipment|Special|0.0| |Outdoor Protection|E-mail|28505.93| |Golf Equipment|Web|3034.0| +--+---++ Expected output: +--+---++ |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))| +--+---++ |Golf Equipment|E-mail|92.25| |Golf Equipment|Fax|null| |Golf Equipment|Mail|0.0| |Golf Equipment|Sales visit|143.5| |Golf Equipment|Special|0.0| |Golf Equipment|Telephone|123.0| |Golf Equipment|Web|3034.0| |Camping Equipment|E-mail|1303.3| |Camping Equipment|Fax|null| |Camping Equipment|Sales visit|4754.87| |Camping Equipment|Mail|0.0| |Camping Equipment|Special|null| |Camping Equipment|Telephone|5169.65| |Camping Equipment|Web|32469.03| |Mountaineering Eq...|E-mail|null| |Mountaineering Eq...|Fax|null| |Mountaineering Eq...|Mail|0.0| |Mountaineering Eq...|Special|null| |Mountaineering Eq...|Sales visit|null| |Mountaineering Eq...|Telephone|null| +--+---++ Notice how in the expected output all the grouping columns should be bucketed together without necessarily being in order, which is not the case with output that spark produces. was: When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I expect to see all the grouping columns being grouped together to the same buckets. However, this is not the case. Steps to reproduce: 1. Start spark-shell as follows: bin\spark-shell.cmd --master local[4] --conf spark.sql.catalogImplementation=in-memory 2. Load the attached csv file: val gosales = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv") 3. Create a temp view: gosales.createOrReplaceTempView("gosales") 4. Execute the following sql statement: spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM `gosales` GROUP BY `Product line`, `Order method type`").show() Output: +-+++ |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))| +-+++ |Golf Equipment|E-mail|92.25| |Camping Equipment|Mail|0.0| |Camping Equipment|Fax|null| |Golf Equipment|Telephone|123.0| |Camping Equipment|Special|null| |Outdoor Protection|Telephone|34218.19| |Mountaineering Eq...|Mail|0.0| |Camping Equipment|Web|32469.03| |Personal Accessories|Fax|3318.7| |Golf Equipment|Sales visit|143.5| |Mountaineering Eq...|Telephone|null| |Mountaineering Eq...|E-mail|null| |Outdoor Protection|Sales visit|20522.42| |Outdoor Protection|Fax|5857.54| |Personal Accessories|E-mail|26679.6403| |Mountaineering Eq...|Fax|null| |Outdoor Protection|Web|340836.853| |Golf Equipment|Special|0.0| |Outdoor Protection|E-mail|28505.93| |Golf Equipment|Web|3034.0| +-+++ Expected output: +-+++---
[jira] [Updated] (SPARK-31059) Spark's SQL "group by" local processing operator is broken.
[ https://issues.apache.org/jira/browse/SPARK-31059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michail Giannakopoulos updated SPARK-31059: --- Attachment: SampleFile_GOSales.csv > Spark's SQL "group by" local processing operator is broken. > --- > > Key: SPARK-31059 > URL: https://issues.apache.org/jira/browse/SPARK-31059 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3, 2.4.5 > Environment: Windows 10. >Reporter: Michail Giannakopoulos >Priority: Blocker > Attachments: SampleFile_GOSales.csv > > > When applying "GROUP BY" processing operator (without an "ORDER BY" clause), > I expect to see all the grouping columns being grouped together to the same > buckets. However, this is not the case. > Steps to reproduce: > 1. Start spark-shell as follows: > bin\spark-shell.cmd --master local[4] --conf > spark.sql.catalogImplementation=in-memory > 2. Load the attached csv file: > val gosales = spark.read.format("csv").option("header", > "true").option("inferSchema", > "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv") > 3. Create a temp view: > gosales.createOrReplaceTempView("gosales") > 4. Execute the following sql statement: > spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM > `gosales` GROUP BY `Product line`, `Order method type`").show() > Output: > +-+++ > |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))| > +-+++ > |Golf Equipment|E-mail|92.25| > |Camping Equipment|Mail|0.0| > |Camping Equipment|Fax|null| > |Golf Equipment|Telephone|123.0| > |Camping Equipment|Special|null| > |Outdoor Protection|Telephone|34218.19| > |Mountaineering Eq...|Mail|0.0| > |Camping Equipment|Web|32469.03| > |Personal Accessories|Fax|3318.7| > |Golf Equipment|Sales visit|143.5| > |Mountaineering Eq...|Telephone|null| > |Mountaineering Eq...|E-mail|null| > |Outdoor Protection|Sales visit|20522.42| > |Outdoor Protection|Fax|5857.54| > |Personal Accessories|E-mail|26679.6403| > |Mountaineering Eq...|Fax|null| > |Outdoor Protection|Web|340836.853| > |Golf Equipment|Special|0.0| > |Outdoor Protection|E-mail|28505.93| > |Golf Equipment|Web|3034.0| > +-+++ > Expected output: > +-+++ > |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))| > +-+++ > |Golf Equipment|E-mail|92.25| > |Golf Equipment|Fax|null| > |Golf Equipment|Mail|0.0| > |Golf Equipment|Sales visit|143.5| > |Golf Equipment|Special|0.0| > |Golf Equipment|Telephone|123.0| > |Golf Equipment|Web|3034.0| > |Camping Equipment|E-mail|1303.3| > |Camping Equipment|Fax|null| > |Camping Equipment|Sales visit|4754.87| > |Camping Equipment|Mail|0.0| > |Camping Equipment|Special|null| > |Camping Equipment|Telephone|5169.65| > |Camping Equipment|Web|32469.03| > |Mountaineering Eq...|E-mail|null| > |Mountaineering Eq...|Fax|null| > |Mountaineering Eq...|Mail|0.0| > |Mountaineering Eq...|Special|null| > |Mountaineering Eq...|Sales visit|null| > |Mountaineering Eq...|Telephone|null| > +-+++ > Notice how in the expected output all the grouping columns should be bucketed > together without necessary being in order, which is not the case with output > that spark produces. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31059) Spark's SQL "group by" local processing operator is broken.
[ https://issues.apache.org/jira/browse/SPARK-31059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michail Giannakopoulos updated SPARK-31059: --- Description: When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I expect to see all the grouping columns being grouped together to the same buckets. However, this is not the case. Steps to reproduce: 1. Start spark-shell as follows: bin\spark-shell.cmd --master local[4] --conf spark.sql.catalogImplementation=in-memory 2. Load the attached csv file: val gosales = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv") 3. Create a temp view: gosales.createOrReplaceTempView("gosales") 4. Execute the following sql statement: spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM `gosales` GROUP BY `Product line`, `Order method type`").show() Output: +-+++ |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))| +-+++ |Golf Equipment|E-mail|92.25| |Camping Equipment|Mail|0.0| |Camping Equipment|Fax|null| |Golf Equipment|Telephone|123.0| |Camping Equipment|Special|null| |Outdoor Protection|Telephone|34218.19| |Mountaineering Eq...|Mail|0.0| |Camping Equipment|Web|32469.03| |Personal Accessories|Fax|3318.7| |Golf Equipment|Sales visit|143.5| |Mountaineering Eq...|Telephone|null| |Mountaineering Eq...|E-mail|null| |Outdoor Protection|Sales visit|20522.42| |Outdoor Protection|Fax|5857.54| |Personal Accessories|E-mail|26679.6403| |Mountaineering Eq...|Fax|null| |Outdoor Protection|Web|340836.853| |Golf Equipment|Special|0.0| |Outdoor Protection|E-mail|28505.93| |Golf Equipment|Web|3034.0| +-+++ Expected output: +-+++ |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))| +-+++ |Golf Equipment|E-mail|92.25| |Golf Equipment|Fax|null| |Golf Equipment|Mail|0.0| |Golf Equipment|Sales visit|143.5| |Golf Equipment|Special|0.0| |Golf Equipment|Telephone|123.0| |Golf Equipment|Web|3034.0| |Camping Equipment|E-mail|1303.3| |Camping Equipment|Fax|null| |Camping Equipment|Sales visit|4754.87| |Camping Equipment|Mail|0.0| |Camping Equipment|Special|null| |Camping Equipment|Telephone|5169.65| |Camping Equipment|Web|32469.03| |Mountaineering Eq...|E-mail|null| |Mountaineering Eq...|Fax|null| |Mountaineering Eq...|Mail|0.0| |Mountaineering Eq...|Special|null| |Mountaineering Eq...|Sales visit|null| |Mountaineering Eq...|Telephone|null| +-+++ Notice how in the expected output all the grouping columns should be bucketed together without necessary being in order, which is not the case with output that spark produces. was: When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I expect to see all the grouping columns being grouped together to the same buckets. However, this is not the case. Steps to reproduce: 1. Start spark-shell as follows: bin\spark-shell.cmd --master local[4] --conf spark.sql.catalogImplementation=in-memory 2. Load the attached csv file: val gosales = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv") 3. Create a temp view: gosales.createOrReplaceTempView("gosales") 4. Execute the following sql statement: spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM `gosales` GROUP BY `Product line`, `Order method type`").show() Output: ++-++ | Product line|Order method type|sum(CAST(Revenue AS DOUBLE))| ++-++ | Golf Equipment| E-mail| 92.25| | Camping Equipment| Mail| 0.0| | Camping Equipment| Fax| null| | Golf Equipment| Telephone| 123.0| | Camping Equipment| Special| null| | Outdoor Protection| Telephone| 34218.19| |Mountaineering Eq...| Mail| 0.0| | Camping Equipment| Web| 32469.03| |Personal Accessories| Fax| 3318.7| | Golf Equipment| Sales visit| 143.5| |Mountaineering Eq...| Telephone| null| |Mountaineering Eq...| E-mail| null| | Outdoor Protection| Sales visit| 20522.42| | Outdoor Protection| Fax| 5857.54| |Personal Accessories| E-mail| 26679.6403| |Mountaineering Eq...| Fax| null| | Outdoor Protection| Web| 340836.853| | Golf Equipment| Special| 0.0| | Outdoor Protection| E-mail| 28505.93| | Golf Equipment| Web| 3034.0| ++-++ Expected output: +--
[jira] [Created] (SPARK-31059) Spark's SQL "group by" local processing operator is broken.
Michail Giannakopoulos created SPARK-31059: -- Summary: Spark's SQL "group by" local processing operator is broken. Key: SPARK-31059 URL: https://issues.apache.org/jira/browse/SPARK-31059 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5, 2.4.3 Environment: Windows 10. Reporter: Michail Giannakopoulos When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I expect to see all the grouping columns being grouped together to the same buckets. However, this is not the case. Steps to reproduce: 1. Start spark-shell as follows: bin\spark-shell.cmd --master local[4] --conf spark.sql.catalogImplementation=in-memory 2. Load the attached csv file: val gosales = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv") 3. Create a temp view: gosales.createOrReplaceTempView("gosales") 4. Execute the following sql statement: spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM `gosales` GROUP BY `Product line`, `Order method type`").show() Output: ++-++ | Product line|Order method type|sum(CAST(Revenue AS DOUBLE))| ++-++ | Golf Equipment| E-mail| 92.25| | Camping Equipment| Mail| 0.0| | Camping Equipment| Fax| null| | Golf Equipment| Telephone| 123.0| | Camping Equipment| Special| null| | Outdoor Protection| Telephone| 34218.19| |Mountaineering Eq...| Mail| 0.0| | Camping Equipment| Web| 32469.03| |Personal Accessories| Fax| 3318.7| | Golf Equipment| Sales visit| 143.5| |Mountaineering Eq...| Telephone| null| |Mountaineering Eq...| E-mail| null| | Outdoor Protection| Sales visit| 20522.42| | Outdoor Protection| Fax| 5857.54| |Personal Accessories| E-mail| 26679.6403| |Mountaineering Eq...| Fax| null| | Outdoor Protection| Web| 340836.853| | Golf Equipment| Special| 0.0| | Outdoor Protection| E-mail| 28505.93| | Golf Equipment| Web| 3034.0| ++-++ Expected output: ++-++ | Product line|Order method type|sum(CAST(Revenue AS DOUBLE))| ++-++ | Golf Equipment| E-mail| 92.25| | Golf Equipment| Fax| null| | Golf Equipment| Mail| 0.0| | Golf Equipment| Sales visit| 143.5| | Golf Equipment| Special| 0.0| | Golf Equipment| Telephone| 123.0| | Golf Equipment| Web| 3034.0| | Camping Equipment| E-mail| 1303.3| | Camping Equipment| Fax| null| | Camping Equipment| Sales visit| 4754.87| | Camping Equipment| Mail| 0.0| | Camping Equipment| Special| null| | Camping Equipment| Telephone| 5169.65| | Camping Equipment| Web| 32469.03| |Mountaineering Eq...| E-mail| null| |Mountaineering Eq...| Fax| null| |Mountaineering Eq...| Mail| 0.0| |Mountaineering Eq...| Special| null| |Mountaineering Eq...| Sales visit| null| |Mountaineering Eq...| Telephone| null| ++-++ Notice how all the grouping columns should be bucketed together without being in order. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20901) Feature parity for ORC with Parquet
[ https://issues.apache.org/jira/browse/SPARK-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052476#comment-17052476 ] Felix Kizhakkel Jose commented on SPARK-20901: -- Thank you [~dongjoon]. > Feature parity for ORC with Parquet > --- > > Key: SPARK-20901 > URL: https://issues.apache.org/jira/browse/SPARK-20901 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to track the feature parity for ORC with Parquet. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20901) Feature parity for ORC with Parquet
[ https://issues.apache.org/jira/browse/SPARK-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052473#comment-17052473 ] Dongjoon Hyun commented on SPARK-20901: --- Hi, [~FelixKJose]. Apache Spark community is trying to provide a seamless user experience and this issue aims to track those kind of difference. You may want to link something more if you want. Basically, new feature development speed is different among Apache Parquet and ORC. For example, ORC bloom filter is used in Apache Spark already, but Parquet boom filter is not applicable yet. For ZStandard support, it's opposite. Please note that Apache Spark also didn't consume the latest versions; Apache ORC 1.6.2 and Apache Parquet 1.11.0. The difference table is changed time to time. You had better investigate both one in your use cases. > Feature parity for ORC with Parquet > --- > > Key: SPARK-20901 > URL: https://issues.apache.org/jira/browse/SPARK-20901 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to track the feature parity for ORC with Parquet. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30961) Arrow enabled: to_pandas with date column fails
[ https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052464#comment-17052464 ] Kevin Appel commented on SPARK-30961: - python 3.6, pyarrow 0.8.0, pandas 0.21.0 is a combination I found that is still working correctly for Date in both Spark 2.3 and Spark 2.4, in additional all the examples listed on the pandas udf spark documentation also works with this setup > Arrow enabled: to_pandas with date column fails > --- > > Key: SPARK-30961 > URL: https://issues.apache.org/jira/browse/SPARK-30961 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5 > Environment: Apache Spark 2.4.5 >Reporter: Nicolas Renkamp >Priority: Major > Labels: ready-to-commit > > Hi, > there seems to be a bug in the arrow enabled to_pandas conversion from spark > dataframe to pandas dataframe when the dataframe has a column of type > DateType. Here is a minimal example to reproduce the issue: > {code:java} > spark = SparkSession.builder.getOrCreate() > is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled") > print("Arrow optimization is enabled: " + is_arrow_enabled) > spark_df = spark.createDataFrame( > [['2019-12-06']], 'created_at: string') \ > .withColumn('created_at', F.to_date('created_at')) > # works > spark_df.toPandas() > spark.conf.set("spark.sql.execution.arrow.enabled", 'true') > is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled") > print("Arrow optimization is enabled: " + is_arrow_enabled) > # raises AttributeError: Can only use .dt accessor with datetimelike values > # series is still of type object, .dt does not exist > spark_df.toPandas(){code} > A fix would be to modify the _check_series_convert_date function in > pyspark.sql.types to: > {code:java} > def _check_series_convert_date(series, data_type): > """ > Cast the series to datetime.date if it's a date type, otherwise returns > the original series.:param series: pandas.Series > :param data_type: a Spark data type for the series > """ > from pyspark.sql.utils import require_minimum_pandas_version > require_minimum_pandas_version()from pandas import to_datetime > if type(data_type) == DateType: > return to_datetime(series).dt.date > else: > return series > {code} > Let me know if I should prepare a Pull Request for the 2.4.5 branch. > I have not tested the behavior on master branch. > > Thanks, > Nicolas -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31058) Consolidate the implementation of quoteIfNeeded
DB Tsai created SPARK-31058: --- Summary: Consolidate the implementation of quoteIfNeeded Key: SPARK-31058 URL: https://issues.apache.org/jira/browse/SPARK-31058 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.5 Reporter: DB Tsai There are two implementation of quoteIfNeeded, and one is in *org.apache.spark.sql.connector.catalog.CatalogV2Implicits.quote* and the other is in *OrcFiltersBase.quoteAttributeNameIfNeeded* This PR will consolidate them into one. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31052) Fix flaky test "DAGSchedulerSuite.shuffle fetch failed on speculative task, but original task succeed"
[ https://issues.apache.org/jira/browse/SPARK-31052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingbo Jiang resolved SPARK-31052. -- Fix Version/s: 3.0.0 Target Version/s: 3.0.0 Assignee: wuyi Resolution: Fixed Fixed by https://github.com/apache/spark/pull/27809 > Fix flaky test "DAGSchedulerSuite.shuffle fetch failed on speculative task, > but original task succeed" > -- > > Key: SPARK-31052 > URL: https://issues.apache.org/jira/browse/SPARK-31052 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > Test "shuffle fetch failed on speculative task, but original task succeed" is > flaky. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31052) Fix flaky test "DAGSchedulerSuite.shuffle fetch failed on speculative task, but original task succeed"
[ https://issues.apache.org/jira/browse/SPARK-31052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingbo Jiang updated SPARK-31052: - Summary: Fix flaky test "DAGSchedulerSuite.shuffle fetch failed on speculative task, but original task succeed" (was: Fix flaky test of SPARK-30388) > Fix flaky test "DAGSchedulerSuite.shuffle fetch failed on speculative task, > but original task succeed" > -- > > Key: SPARK-31052 > URL: https://issues.apache.org/jira/browse/SPARK-31052 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > Test "shuffle fetch failed on speculative task, but original task succeed" is > flaky. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30886) Deprecate two-parameter TRIM/LTRIM/RTRIM functions
[ https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30886: -- Summary: Deprecate two-parameter TRIM/LTRIM/RTRIM functions (was: Deprecate two-parameter TRIM/LTRIM/RTRIM function) > Deprecate two-parameter TRIM/LTRIM/RTRIM functions > -- > > Key: SPARK-30886 > URL: https://issues.apache.org/jira/browse/SPARK-30886 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Apache Spark community decided to keep the existing esoteric two-parameter > use cases with a proper warning. This JIRA aims to show warning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30886) Deprecate two-parameter TRIM/LTRIM/RTRIM function
[ https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30886: -- Summary: Deprecate two-parameter TRIM/LTRIM/RTRIM function (was: Deprecate two-parameter TRIM function) > Deprecate two-parameter TRIM/LTRIM/RTRIM function > - > > Key: SPARK-30886 > URL: https://issues.apache.org/jira/browse/SPARK-30886 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Apache Spark community decided to keep the existing esoteric two-parameter > use cases with a proper warning. This JIRA aims to show warning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31057) approxQuantile function of spark , not taking List as first parameter
[ https://issues.apache.org/jira/browse/SPARK-31057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shyam updated SPARK-31057: -- Shepherd: Sean R. Owen > approxQuantile function of spark , not taking List as first parameter > -- > > Key: SPARK-31057 > URL: https://issues.apache.org/jira/browse/SPARK-31057 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.1, 2.4.3 > Environment: spark-sql-2.4.3 v and eclipse neon ide. > >Reporter: Shyam >Priority: Major > > 0 > I am using spark-sql-2.4.1v in my project with java8. > I need to calculate the quantiles on the some of the (calculated) columns > (i.e. con_dist_1 , con_dist_2 ) of below given dataframe df. > {{List calcColmns = Arrays.asList("con_dist_1","con_dist_2")}} > When I am trying to use first version of approxQuantile i.e. > approxQuantile(List, List, double) as below > Dataset df = //dataset > {{List> quants = df.stat().approxQuantile(calcColmns , > Array(0.0,0.1,0.5),0.0);}} > *It is giving error :* > {quote}The method approxQuantile(String, double[], double) in the type > DataFrameStatFunctions is not applicable for the arguments (List, List, > double) > {quote} > so what is wrong here , I am doing it in my eclipseIDE. Why it is not > invoking List even though i am passing List ?? > Really appreciate any help on this. > more details are added here > [https://stackoverflow.com/questions/60550152/issue-with-approxquantile-of-spark-not-recognizing-liststring] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31057) approxQuantile function of spark , not taking List as first parameter
Shyam created SPARK-31057: - Summary: approxQuantile function of spark , not taking List as first parameter Key: SPARK-31057 URL: https://issues.apache.org/jira/browse/SPARK-31057 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.3, 2.4.1 Environment: spark-sql-2.4.3 v and eclipse neon ide. Reporter: Shyam 0 I am using spark-sql-2.4.1v in my project with java8. I need to calculate the quantiles on the some of the (calculated) columns (i.e. con_dist_1 , con_dist_2 ) of below given dataframe df. {{List calcColmns = Arrays.asList("con_dist_1","con_dist_2")}} When I am trying to use first version of approxQuantile i.e. approxQuantile(List, List, double) as below Dataset df = //dataset {{List> quants = df.stat().approxQuantile(calcColmns , Array(0.0,0.1,0.5),0.0);}} *It is giving error :* {quote}The method approxQuantile(String, double[], double) in the type DataFrameStatFunctions is not applicable for the arguments (List, List, double) {quote} so what is wrong here , I am doing it in my eclipseIDE. Why it is not invoking List even though i am passing List ?? Really appreciate any help on this. more details are added here [https://stackoverflow.com/questions/60550152/issue-with-approxquantile-of-spark-not-recognizing-liststring] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20901) Feature parity for ORC with Parquet
[ https://issues.apache.org/jira/browse/SPARK-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052384#comment-17052384 ] Felix Kizhakkel Jose commented on SPARK-20901: -- Hi [~dongjoon], I was trying to choose between ORC and Parquet formats (using AWS Glue /Spark). While researching I came across this parity feature SPARK-20901. What features are not implemented for ORC to have complete feature parity as Parquet in Spark? Here I could see everything (issues linked) listed here are either Resolved or Closed, so I am confused. Could you please provide some insights? > Feature parity for ORC with Parquet > --- > > Key: SPARK-20901 > URL: https://issues.apache.org/jira/browse/SPARK-20901 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to track the feature parity for ORC with Parquet. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host
[ https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052348#comment-17052348 ] Attila Zsolt Piros commented on SPARK-27651: well in case of dynamic allocation and a recalculation the executors could be already gone. > Avoid the network when block manager fetches shuffle blocks from the same host > -- > > Key: SPARK-27651 > URL: https://issues.apache.org/jira/browse/SPARK-27651 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.0.0 > > > When a shuffle block (content) is fetched the network is always used even > when it is fetched from the external shuffle service running on the same > host. This can be avoided by getting the local directories of the same host > executors from the external shuffle service and accessing those blocks from > the disk directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30886) Deprecate two-parameter TRIM function
[ https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30886: -- Summary: Deprecate two-parameter TRIM function (was: Deprecate LTRIM, RTRIM, and two-parameter TRIM functions) > Deprecate two-parameter TRIM function > - > > Key: SPARK-30886 > URL: https://issues.apache.org/jira/browse/SPARK-30886 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Apache Spark community decided to keep the existing esoteric two-parameter > use cases with a proper warning. This JIRA aims to show warning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31056) Add CalendarIntervals division
Enrico Minack created SPARK-31056: - Summary: Add CalendarIntervals division Key: SPARK-31056 URL: https://issues.apache.org/jira/browse/SPARK-31056 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Enrico Minack {{CalendarInterval}} should be allowed for division. The {{CalendarInterval}} consists of three time components: {{months}}, {{days}} and {{microseconds}}. The division can only be defined between intervals that have a single non-zero time component, while both intervals have the same non-zero time component. Otherwise the division expression would be ambiguous. This allows to evaluate the magnitude of {{CalendarInterval}} in SQL expressions: {code} Seq((Timestamp.valueOf("2020-02-01 12:00:00"), Timestamp.valueOf("2020-02-01 13:30:25"))) .toDF("start", "end") .withColumn("interval", $"end" - $"start") .withColumn("interval [h]", $"interval" / lit("1 hour").cast(CalendarIntervalType)) .withColumn("rate [€/h]", lit(1.45)) .withColumn("price [€]", $"interval [h]" * $"rate [€/h]") .show(false) +---+---+-+--+--+--+ |start |end|interval |interval [h] |rate [€/h]|price [€] | +---+---+-+--+--+--+ |2020-02-01 12:00:00|2020-02-01 13:30:25|1 hours 30 minutes 25 seconds|1.5069|1.45 |2.18506943| +---+---+-+--+--+--+ {code} The currently available approach is {code} Seq((Timestamp.valueOf("2020-02-01 12:00:00"), Timestamp.valueOf("2020-02-01 13:30:25"))) .toDF("start", "end") .withColumn("interval [s]", unix_timestamp($"end") - unix_timestamp($"start")) .withColumn("interval [h]", $"interval [s]" / 3600) .withColumn("rate [€/h]", lit(1.45)) .withColumn("price [€]", $"interval [h]" * $"rate [€/h]") .show(false) {code} Going through {{unix_timestamp}} is a hack and it pollutes the SQL query with unrelated semantics (unix timestamp is completely irrelevant for this computation). It is merely there because there is currently no way to access the length of an {{CalendarInterval}}. Dividing an interval by another interval provides means to measure the length in an arbitrary unit (minutes, hours, quarter hours). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31055) Update config docs for shuffle local host reads to have dep on external shuffle service
Thomas Graves created SPARK-31055: - Summary: Update config docs for shuffle local host reads to have dep on external shuffle service Key: SPARK-31055 URL: https://issues.apache.org/jira/browse/SPARK-31055 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.0.0 Reporter: Thomas Graves with SPARK-27651 we now support host local reads for shuffle, but only when external shuffle service is enabled. Update the config docs to state that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31054) Turn on deprecation in Scala REPL/spark-shell by default
wuyi created SPARK-31054: Summary: Turn on deprecation in Scala REPL/spark-shell by default Key: SPARK-31054 URL: https://issues.apache.org/jira/browse/SPARK-31054 Project: Spark Issue Type: Improvement Components: Spark Shell Affects Versions: 3.0.0 Reporter: wuyi Turn on deprecation in Scala REPL/spark-shell by default, so user can aways see the details about deprecated API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host
[ https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052251#comment-17052251 ] Thomas Graves commented on SPARK-27651: --- thanks, that makes sense. I can look in more details at the code, but I assume the executors could ask the other executors for the directory list rather than going to the external shuffle service if we wanted to support it. > Avoid the network when block manager fetches shuffle blocks from the same host > -- > > Key: SPARK-27651 > URL: https://issues.apache.org/jira/browse/SPARK-27651 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.0.0 > > > When a shuffle block (content) is fetched the network is always used even > when it is fetched from the external shuffle service running on the same > host. This can be avoided by getting the local directories of the same host > executors from the external shuffle service and accessing those blocks from > the disk directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31053) mark connector API as Evolving
Wenchen Fan created SPARK-31053: --- Summary: mark connector API as Evolving Key: SPARK-31053 URL: https://issues.apache.org/jira/browse/SPARK-31053 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31034) ShuffleBlockFetcherIterator may can't create request for last group
[ https://issues.apache.org/jira/browse/SPARK-31034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31034. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27786 [https://github.com/apache/spark/pull/27786] > ShuffleBlockFetcherIterator may can't create request for last group > --- > > Key: SPARK-31034 > URL: https://issues.apache.org/jira/browse/SPARK-31034 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > When the size of all blocks is less than targetRemoteRequestSize and the size > of last block group is less than maxBlocksInFlightPerAddress, > ShuffleBlockFetcherIterator will not create a request for the last group. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31034) ShuffleBlockFetcherIterator may can't create request for last group
[ https://issues.apache.org/jira/browse/SPARK-31034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31034: --- Assignee: wuyi > ShuffleBlockFetcherIterator may can't create request for last group > --- > > Key: SPARK-31034 > URL: https://issues.apache.org/jira/browse/SPARK-31034 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > When the size of all blocks is less than targetRemoteRequestSize and the size > of last block group is less than maxBlocksInFlightPerAddress, > ShuffleBlockFetcherIterator will not create a request for the last group. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31052) Fix flaky test of SPARK-30388
wuyi created SPARK-31052: Summary: Fix flaky test of SPARK-30388 Key: SPARK-31052 URL: https://issues.apache.org/jira/browse/SPARK-31052 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: wuyi Test "shuffle fetch failed on speculative task, but original task succeed" is flaky. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31005) Support time zone ids in casting strings to timestamps
[ https://issues.apache.org/jira/browse/SPARK-31005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31005. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27753 [https://github.com/apache/spark/pull/27753] > Support time zone ids in casting strings to timestamps > -- > > Key: SPARK-31005 > URL: https://issues.apache.org/jira/browse/SPARK-31005 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > Currently, Spark supports only time zone offsets in the formats: > * -[h]h:[m]m > * +[h]h:[m]m > * Z > The ticket aims to support any valid time zone ids at the end of timestamp > strings, for instance: > {code} > 2015-03-18T12:03:17.123456 Europe/Moscow > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31005) Support time zone ids in casting strings to timestamps
[ https://issues.apache.org/jira/browse/SPARK-31005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31005: --- Assignee: Maxim Gekk > Support time zone ids in casting strings to timestamps > -- > > Key: SPARK-31005 > URL: https://issues.apache.org/jira/browse/SPARK-31005 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > Currently, Spark supports only time zone offsets in the formats: > * -[h]h:[m]m > * +[h]h:[m]m > * Z > The ticket aims to support any valid time zone ids at the end of timestamp > strings, for instance: > {code} > 2015-03-18T12:03:17.123456 Europe/Moscow > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31051) Thriftserver operations other than SparkExecuteStatementOperation do not call onOperationClosed
[ https://issues.apache.org/jira/browse/SPARK-31051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juliusz Sompolski resolved SPARK-31051. --- Resolution: Not A Problem I'm just blind. They are there. > Thriftserver operations other than SparkExecuteStatementOperation do not call > onOperationClosed > --- > > Key: SPARK-31051 > URL: https://issues.apache.org/jira/browse/SPARK-31051 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Juliusz Sompolski >Priority: Major > > In Spark 3.0 onOperationClosed was implemented in HIveThriftServer2Listener > to track closing the operation in the thriftserver (after client finishes > fetching). > However, it seems that only SparkExecuteStatementOperation calls it in it's > close() function. Other operations need to do this as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31051) Thriftserver operations other than SparkExecuteStatementOperation do not call onOperationClosed
Juliusz Sompolski created SPARK-31051: - Summary: Thriftserver operations other than SparkExecuteStatementOperation do not call onOperationClosed Key: SPARK-31051 URL: https://issues.apache.org/jira/browse/SPARK-31051 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Juliusz Sompolski In Spark 3.0 onOperationClosed was implemented in HIveThriftServer2Listener to track closing the operation in the thriftserver (after client finishes fetching). However, it seems that only SparkExecuteStatementOperation calls it in it's close() function. Other operations need to do this as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30100) Decimal Precision Inferred from JDBC via Spark
[ https://issues.apache.org/jira/browse/SPARK-30100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052050#comment-17052050 ] Joby Joje commented on SPARK-30100: --- [~hyukjin.kwon] is there any workaround for this precision data loss ? > Decimal Precision Inferred from JDBC via Spark > -- > > Key: SPARK-30100 > URL: https://issues.apache.org/jira/browse/SPARK-30100 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Joby Joje >Priority: Major > > When trying to load data from JDBC(Oracle) into Spark, there seems to be > precision loss in the decimal field, as per my understanding Spark supports > *DECIMAL(38,18)*. The field from the Oracle is DECIMAL(38,14), whereas Spark > rounds off the last four digits making it a precision of DECIMAL(38,10). This > is happening to few fields in the dataframe where the column is fetched using > a CASE statement whereas in the same query another field populates the right > schema. > Tried to pass the > {code:java} > spark.sql.decimalOperations.allowPrecisionLoss=false{code} > conf in the Spark-submit though didn't get the desired results. > {code:java} > jdbcDF = spark.read \ > .format("jdbc") \ > .option("url", "ORACLE") \ > .option("dbtable", "QUERY") \ > .option("user", "USERNAME") \ > .option("password", "PASSWORD") \ > .load(){code} > So considering that the Spark infers the schema from a sample records, how > does this work here? Does it use the results of the query i.e (SELECT * FROM > TABLE_NAME JOIN ...) or does it take a different route to guess the schema > for itself? Can someone throw some light on this and advise how to achieve > the right decimal precision on this regards without manipulating the query as > doing a CAST on the query does solve the issue, but would prefer to get some > alternatives. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31046) Make more efficient and clean up AQE update UI code
[ https://issues.apache.org/jira/browse/SPARK-31046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31046. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27799 [https://github.com/apache/spark/pull/27799] > Make more efficient and clean up AQE update UI code > --- > > Key: SPARK-31046 > URL: https://issues.apache.org/jira/browse/SPARK-31046 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wei Xue >Assignee: Wei Xue >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31046) Make more efficient and clean up AQE update UI code
[ https://issues.apache.org/jira/browse/SPARK-31046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31046: --- Assignee: Wei Xue > Make more efficient and clean up AQE update UI code > --- > > Key: SPARK-31046 > URL: https://issues.apache.org/jira/browse/SPARK-31046 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wei Xue >Assignee: Wei Xue >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host
[ https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-27651: --- Description: When a shuffle block (content) is fetched the network is always used even when it is fetched from the external shuffle service running on the same host. This can be avoided by getting the local directories of the same host executors from the external shuffle service and accessing those blocks from the disk directly. (was: When a shuffle block (content) is fetched the network is always used even when it is fetched from an executor (or the external shuffle service) running on the same host.) > Avoid the network when block manager fetches shuffle blocks from the same host > -- > > Key: SPARK-27651 > URL: https://issues.apache.org/jira/browse/SPARK-27651 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.0.0 > > > When a shuffle block (content) is fetched the network is always used even > when it is fetched from the external shuffle service running on the same > host. This can be avoided by getting the local directories of the same host > executors from the external shuffle service and accessing those blocks from > the disk directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host
[ https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051995#comment-17051995 ] Attila Zsolt Piros commented on SPARK-27651: Yes, the final implementation works only when the external shuffle service is used as the local directories of the other host local executors are asked from the external shuffle service. The initial implementation when the PR was opened was using the driver to get the host local directories. The technical reasons of asking the external shuffle service was: * decreasing network pressure on the driver (main reason). * getting rid of an unbounded (or bounded but in that case complex fall back logic at the fetcher) map which maps the executors to local dirs. In addition does that redundantly as this information is already available at the external shuffle service just stored in distributed way I mean at a running ext shuffle service process only for those executor data are stored which are on the same host. > Avoid the network when block manager fetches shuffle blocks from the same host > -- > > Key: SPARK-27651 > URL: https://issues.apache.org/jira/browse/SPARK-27651 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.0.0 > > > When a shuffle block (content) is fetched the network is always used even > when it is fetched from an executor (or the external shuffle service) running > on the same host. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31024) Allow specifying session catalog name (spark_catalog) in qualified column names
[ https://issues.apache.org/jira/browse/SPARK-31024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31024. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27776 [https://github.com/apache/spark/pull/27776] > Allow specifying session catalog name (spark_catalog) in qualified column > names > --- > > Key: SPARK-31024 > URL: https://issues.apache.org/jira/browse/SPARK-31024 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > Currently, the user cannot specify the session catalog name when using > qualified column names for v1 tables: > {code:java} > SELECT spark_catalog.default.t.i FROM spark_catalog.default.t > {code} > fails with "cannot resolve '`spark_catalog.default.t.i`". > This is inconsistent with v2 tables where catalog name can be used: > {code:java} > SELECT testcat.ns1.tbl.id FROM testcat.ns1.tbl.id > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31024) Allow specifying session catalog name (spark_catalog) in qualified column names
[ https://issues.apache.org/jira/browse/SPARK-31024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31024: --- Assignee: Terry Kim > Allow specifying session catalog name (spark_catalog) in qualified column > names > --- > > Key: SPARK-31024 > URL: https://issues.apache.org/jira/browse/SPARK-31024 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > > Currently, the user cannot specify the session catalog name when using > qualified column names for v1 tables: > {code:java} > SELECT spark_catalog.default.t.i FROM spark_catalog.default.t > {code} > fails with "cannot resolve '`spark_catalog.default.t.i`". > This is inconsistent with v2 tables where catalog name can be used: > {code:java} > SELECT testcat.ns1.tbl.id FROM testcat.ns1.tbl.id > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15798) Secondary sort in Dataset/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051990#comment-17051990 ] Melitta Dragaschnig commented on SPARK-15798: - Hi all, I am a frequent user of tresata's spark-sorted library (thank you [~koert]!) to get the Secondary Sort functionality (for large groups, in order to avoid memory issues), so I tried to figure out whether there are plans to merge this useful functionality into the core library. After checking the progression of this Jira issue and seeing that it's marked as Incomplete, has been closed but no Fix versions are given, my conclusion was that presently it is not provided by the core library, and it's advisable to continue using spark-sorted for the time being. Is my assumption correct? Also any further information on additional ways to stay informed about the development of this topic would be greatly appreciated! > Secondary sort in Dataset/DataFrame > --- > > Key: SPARK-15798 > URL: https://issues.apache.org/jira/browse/SPARK-15798 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: koert kuipers >Priority: Major > Labels: bulk-closed > > Secondary sort for Spark RDDs was discussed in > https://issues.apache.org/jira/browse/SPARK-3655 > Since the RDD API allows for easy extensions outside the core library this > was implemented separately here: > https://github.com/tresata/spark-sorted > However it seems to me that with Dataset an implementation in a 3rd party > library of such a feature is not really an option. > Dataset already has methods that suggest a secondary sort is present, such as > in KeyValueGroupedDataset: > {noformat} > def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): > Dataset[U] > {noformat} > This operation pushes all the data to the reducer, something you only would > want to do if you need the elements in a particular order. > How about as an API sortBy methods in KeyValueGroupedDataset and > RelationalGroupedDataset? > {noformat} > dataFrame.groupBy("a").sortBy("b").fold(...) > {noformat} > (yes i know RelationalGroupedDataset doesnt have a fold yet... but it should > :)) > {noformat} > dataset.groupBy(_._1).sortBy(_._3).flatMapGroups(...) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"
[ https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30668. - Resolution: Fixed Issue resolved by pull request 27537 [https://github.com/apache/spark/pull/27537] > to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern > "-MM-dd'T'HH:mm:ss.SSSz" > > > Key: SPARK-30668 > URL: https://issues.apache.org/jira/browse/SPARK-30668 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Maxim Gekk >Priority: Blocker > Fix For: 3.0.0 > > > {code:java} > SELECT to_timestamp("2020-01-27T20:06:11.847-0800", > "-MM-dd'T'HH:mm:ss.SSSz") > {code} > This can return a valid value in Spark 2.4 but return NULL in the latest > master > **2.4.5 RC2** > {code} > scala> sql("""SELECT to_timestamp("2020-01-27T20:06:11.847-0800", > "-MM-dd'T'HH:mm:ss.SSSz")""").show > ++ > |to_timestamp('2020-01-27T20:06:11.847-0800', '-MM-dd\'T\'HH:mm:ss.SSSz')| > ++ > | 2020-01-27 20:06:11| > ++ > {code} > **2.2.3 ~ 2.4.4** (2.0.2 ~ 2.1.3 doesn't have `to_timestamp`). > {code} > spark-sql> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", > "-MM-dd'T'HH:mm:ss.SSSz"); > 2020-01-27 20:06:11 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31050) Disable flaky KafkaDelegationTokenSuite
[ https://issues.apache.org/jira/browse/SPARK-31050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31050. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27789 [https://github.com/apache/spark/pull/27789] > Disable flaky KafkaDelegationTokenSuite > --- > > Key: SPARK-31050 > URL: https://issues.apache.org/jira/browse/SPARK-31050 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > Disable flaky KafkaDelegationTokenSuite since it's too flaky. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31050) Disable flaky KafkaDelegationTokenSuite
[ https://issues.apache.org/jira/browse/SPARK-31050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31050: - Assignee: wuyi > Disable flaky KafkaDelegationTokenSuite > --- > > Key: SPARK-31050 > URL: https://issues.apache.org/jira/browse/SPARK-31050 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > Disable flaky KafkaDelegationTokenSuite since it's too flaky. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30541) Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite
[ https://issues.apache.org/jira/browse/SPARK-30541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051877#comment-17051877 ] Dongjoon Hyun commented on SPARK-30541: --- Please see the PR. This is decided to be a blocker for 3.0.0. > Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite > --- > > Key: SPARK-30541 > URL: https://issues.apache.org/jira/browse/SPARK-30541 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Blocker > > The test suite has been failing intermittently as of now: > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116862/testReport/] > > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.(It is not a test it > is a sbt.testing.SuiteSelector) > > {noformat} > Error Details > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 3939 times over > 1.000122353532 minutes. Last failure message: KeeperErrorCode = > AuthFailed for /brokers/ids. > Stack Trace > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 3939 times over > 1.000122353532 minutes. Last failure message: KeeperErrorCode = > AuthFailed for /brokers/ids. > at > org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432) > at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439) > at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391) > at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479) > at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:337) > at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:336) > at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:292) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) > at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:58) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: sbt.ForkMain$ForkError: > org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = > AuthFailed for /brokers/ids > at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) > at > kafka.zookeeper.AsyncResponse.resultException(ZooKeeperClient.scala:554) > at kafka.zk.KafkaZkClient.getChildren(KafkaZkClient.scala:719) > at kafka.zk.KafkaZkClient.getSortedBrokerList(KafkaZkClient.scala:455) > at > kafka.zk.KafkaZkClient.getAllBrokersInCluster(KafkaZkClient.scala:404) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.$anonfun$setup$3(KafkaTestUtils.scala:293) > at > org.scalatest.concurrent.Eventually.makeAValiantAttempt$1(Eventually.scala:395) > at > org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:409) > ... 20 more > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30541) Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite
[ https://issues.apache.org/jira/browse/SPARK-30541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30541: -- Target Version/s: 3.0.0 > Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite > --- > > Key: SPARK-30541 > URL: https://issues.apache.org/jira/browse/SPARK-30541 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Blocker > > The test suite has been failing intermittently as of now: > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116862/testReport/] > > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.(It is not a test it > is a sbt.testing.SuiteSelector) > > {noformat} > Error Details > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 3939 times over > 1.000122353532 minutes. Last failure message: KeeperErrorCode = > AuthFailed for /brokers/ids. > Stack Trace > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 3939 times over > 1.000122353532 minutes. Last failure message: KeeperErrorCode = > AuthFailed for /brokers/ids. > at > org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432) > at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439) > at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391) > at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479) > at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:337) > at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:336) > at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:292) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) > at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:58) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: sbt.ForkMain$ForkError: > org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = > AuthFailed for /brokers/ids > at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) > at > kafka.zookeeper.AsyncResponse.resultException(ZooKeeperClient.scala:554) > at kafka.zk.KafkaZkClient.getChildren(KafkaZkClient.scala:719) > at kafka.zk.KafkaZkClient.getSortedBrokerList(KafkaZkClient.scala:455) > at > kafka.zk.KafkaZkClient.getAllBrokersInCluster(KafkaZkClient.scala:404) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.$anonfun$setup$3(KafkaTestUtils.scala:293) > at > org.scalatest.concurrent.Eventually.makeAValiantAttempt$1(Eventually.scala:395) > at > org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:409) > ... 20 more > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31050) Disable flaky KafkaDelegationTokenSuite
[ https://issues.apache.org/jira/browse/SPARK-31050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-31050: - Description: Disable flaky KafkaDelegationTokenSuite since it's too flaky. > Disable flaky KafkaDelegationTokenSuite > --- > > Key: SPARK-31050 > URL: https://issues.apache.org/jira/browse/SPARK-31050 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 > Environment: Disable flaky KafkaDelegationTokenSuite since it's too > flaky. >Reporter: wuyi >Priority: Major > > Disable flaky KafkaDelegationTokenSuite since it's too flaky. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31050) Disable flaky KafkaDelegationTokenSuite
[ https://issues.apache.org/jira/browse/SPARK-31050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-31050: - Environment: (was: Disable flaky KafkaDelegationTokenSuite since it's too flaky.) > Disable flaky KafkaDelegationTokenSuite > --- > > Key: SPARK-31050 > URL: https://issues.apache.org/jira/browse/SPARK-31050 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > Disable flaky KafkaDelegationTokenSuite since it's too flaky. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org