[jira] [Updated] (SPARK-30279) Support 32 or more grouping attributes for GROUPING_ID

2020-03-05 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-30279:
-
Affects Version/s: 2.4.6

> Support 32 or more grouping attributes for GROUPING_ID 
> ---
>
> Key: SPARK-30279
> URL: https://issues.apache.org/jira/browse/SPARK-30279
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.6
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket targets to support 32 or more grouping attributes for 
> GROUPING_ID. In the current master, an integer overflow can occur to compute 
> grouping IDs;
> https://github.com/apache/spark/blob/e75d9afb2f282ce79c9fd8bce031287739326a4f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L613
> For example, the query below generates wrong grouping IDs in the master;
> {code}
> scala> val numCols = 32 // or, 31
> scala> val cols = (0 until numCols).map { i => s"c$i" }
> scala> sql(s"create table test_$numCols (${cols.map(c => s"$c 
> int").mkString(",")}, v int) using parquet")
> scala> val insertVals = (0 until numCols).map { _ => 1 }.mkString(",")
> scala> sql(s"insert into test_$numCols values ($insertVals,3)")
> scala> sql(s"select grouping_id(), sum(v) from test_$numCols group by 
> grouping sets ((${cols.mkString(",")}), 
> (${cols.init.mkString(",")}))").show(10, false)
> scala> sql(s"drop table test_$numCols")
> // numCols = 32
> +-+--+
> |grouping_id()|sum(v)|
> +-+--+
> |0|3 |
> |0|3 | // Wrong Grouping ID
> +-+--+
> // numCols = 31
> +-+--+
> |grouping_id()|sum(v)|
> +-+--+
> |0|3 |
> |1|3 |
> +-+--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31061) Impossible to change the provider of a table in the HiveMetaStore

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31061:
---

Assignee: Burak Yavuz

> Impossible to change the provider of a table in the HiveMetaStore
> -
>
> Key: SPARK-31061
> URL: https://issues.apache.org/jira/browse/SPARK-31061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> Currently, it's impossible to alter the datasource of a table in the 
> HiveMetaStore by using alterTable, as the HiveExternalCatalog doesn't change 
> the provider table property during an alterTable command. This is required to 
> support changing table formats when using commands like REPLACE TABLE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31061) Impossible to change the provider of a table in the HiveMetaStore

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31061.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27822
[https://github.com/apache/spark/pull/27822]

> Impossible to change the provider of a table in the HiveMetaStore
> -
>
> Key: SPARK-31061
> URL: https://issues.apache.org/jira/browse/SPARK-31061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, it's impossible to alter the datasource of a table in the 
> HiveMetaStore by using alterTable, as the HiveExternalCatalog doesn't change 
> the provider table property during an alterTable command. This is required to 
> support changing table formats when using commands like REPLACE TABLE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31068) IllegalArgumentException in BroadcastExchangeExec

2020-03-05 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-31068:
--

 Summary: IllegalArgumentException in BroadcastExchangeExec
 Key: SPARK-31068
 URL: https://issues.apache.org/jira/browse/SPARK-31068
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Lantao Jin


{code}
Caused by: org.apache.spark.SparkException: Failed to materialize query stage: 
BroadcastQueryStage 0
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true], 
input[1, bigint, true], input[2, int, true]))
   +- *(1) Project [guid#138126, session_skey#138127L, seqnum#138132]
  +- *(1) Filter isnotnull(session_start_dt#138129) && 
(session_start_dt#138129 = 2020-01-01)) && isnotnull(seqnum#138132)) && 
isnotnull(session_skey#138127L)) && isnotnull(guid#138126))
 +- *(1) FileScan parquet p_soj_cl_t.clav_events[guid#138126, 
session_skey#138127L, session_start_dt#138129, seqnum#138132] DataFilters: 
[isnotnull(session_start_dt#138129), (session_start_dt#138129 = 2020-01-01), 
isnotnull(seqnum#138..., Format: Parquet, Location: 
TahoeLogFileIndex[hdfs://hermes-rno/workspaces/P_SOJ_CL_T/clav_events], 
PartitionFilters: [], PushedFilters: [IsNotNull(session_start_dt), 
EqualTo(session_start_dt,2020-01-01), IsNotNull(seqnum), IsNotNull(..., 
ReadSchema: 
struct, 
SelectedBucketsCount: 1000 out of 1000, UsedIndexes: []

at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$$anonfun$generateFinalPlan$3.apply(AdaptiveSparkPlanExec.scala:230)
at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$$anonfun$generateFinalPlan$3.apply(AdaptiveSparkPlanExec.scala:225)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.generateFinalPlan(AdaptiveSparkPlanExec.scala:225)
... 48 more
Caused by: java.lang.IllegalArgumentException: Initial capacity 670166426 
exceeds maximum capacity of 536870912
at org.apache.spark.unsafe.map.BytesToBytesMap.(BytesToBytesMap.java:196)
at 
org.apache.spark.unsafe.map.BytesToBytesMap.(BytesToBytesMap.java:219)
at 
org.apache.spark.sql.execution.joins.UnsafeHashedRelation$.apply(HashedRelation.scala:340)
at 
org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:123)
at 
org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:964)
at 
org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:952)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$9.apply(BroadcastExchangeExec.scala:220)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$9.apply(BroadcastExchangeExec.scala:207)
at 
org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:128)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:206)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:172)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
... 3 more
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31067) Spark 2.4.* SQL query with partition columns scans entire AVRO data

2020-03-05 Thread Gopal (Jira)
Gopal created SPARK-31067:
-

 Summary: Spark 2.4.* SQL query with partition columns scans entire 
AVRO data 
 Key: SPARK-31067
 URL: https://issues.apache.org/jira/browse/SPARK-31067
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4, 2.4.0
Reporter: Gopal


Partition Column: dt

SQL Query: select distinct dt from table1 

Table format: AVRO

It is scanning entire avro data in a table to get the distinct dt values



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master

2020-03-05 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053080#comment-17053080
 ] 

Nicholas Chammas commented on SPARK-31043:
--

FWIW I was seeing the same {{java.lang.NoClassDefFoundError: 
org/w3c/dom/ElementTraversal}} issue on {{branch-3.0}} and pulling the latest 
changes fixed it for me too.

> Spark 3.0 built against hadoop2.7 can't start standalone master
> ---
>
> Key: SPARK-31043
> URL: https://issues.apache.org/jira/browse/SPARK-31043
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Critical
> Fix For: 3.0.0
>
>
> trying to start a standalone master when building spark branch 3.0 with 
> hadoop2.7 fails with:
>  
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/w3c/dom/ElementTraversal
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.defineClass(URLClassLoader.java:468)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.access$100(URLClassLoader.java:74)
> at 
> [java.net|http://java.net/]
> .URLClassLoader$1.run(URLClassLoader.java:369)
> ...
> Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal
> at 
> [java.net|http://java.net/]
> .URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> ... 42 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()

2020-03-05 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053079#comment-17053079
 ] 

Nicholas Chammas commented on SPARK-31065:
--

Confirmed this issue is also present on {{branch-3.0}} as of commit 
{{9b48f3358d3efb523715a5f258e5ed83e28692f6}}.

> Empty string values cause schema_of_json() to return a schema not usable by 
> from_json()
> ---
>
> Key: SPARK-31065
> URL: https://issues.apache.org/jira/browse/SPARK-31065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a reproduction:
>   
> {code:python}
> from pyspark.sql.functions import from_json, schema_of_json
> json = '{"a": ""}'
> df = spark.createDataFrame([(json,)], schema=['json'])
> df.show()
> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> json_schema = schema_of_json(json)
> df.select(from_json('json', json_schema))
> # works fine
> json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema
> df.select(from_json('json', json_schema))
> {code}
> The output:
> {code:java}
> >>> from pyspark.sql.functions import from_json, schema_of_json
> >>> json = '{"a": ""}'
> >>> 
> >>> df = spark.createDataFrame([(json,)], schema=['json'])
> >>> df.show()
> +-+
> | json|
> +-+
> |{"a": ""}|
> +-+
> >>> 
> >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> >>> json_schema = schema_of_json(json)
> >>> df.select(from_json('json', json_schema))
> Traceback (most recent call last):
>   File ".../site-packages/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File 
> ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
> line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.functions.from_json.
> : org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 
> 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
> 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
> 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
> 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
> 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 
> 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 
> 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 
> 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 
> 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
> 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
> 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 
> 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 
> 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 
> 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 
> 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 
> 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 
> 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 
> 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 
> 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 
> 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 
> 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 
> 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 
> 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 
> 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 
> 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 
> 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 
> 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6)
> == SQL ==
> struct
> --^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64)
>   at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonExprUtils$.evalSchemaExpr(jsonExpress

[jira] [Updated] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()

2020-03-05 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31065:
-
Affects Version/s: 3.0.0

> Empty string values cause schema_of_json() to return a schema not usable by 
> from_json()
> ---
>
> Key: SPARK-31065
> URL: https://issues.apache.org/jira/browse/SPARK-31065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a reproduction:
>   
> {code:python}
> from pyspark.sql.functions import from_json, schema_of_json
> json = '{"a": ""}'
> df = spark.createDataFrame([(json,)], schema=['json'])
> df.show()
> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> json_schema = schema_of_json(json)
> df.select(from_json('json', json_schema))
> # works fine
> json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema
> df.select(from_json('json', json_schema))
> {code}
> The output:
> {code:java}
> >>> from pyspark.sql.functions import from_json, schema_of_json
> >>> json = '{"a": ""}'
> >>> 
> >>> df = spark.createDataFrame([(json,)], schema=['json'])
> >>> df.show()
> +-+
> | json|
> +-+
> |{"a": ""}|
> +-+
> >>> 
> >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> >>> json_schema = schema_of_json(json)
> >>> df.select(from_json('json', json_schema))
> Traceback (most recent call last):
>   File ".../site-packages/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File 
> ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
> line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.functions.from_json.
> : org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 
> 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
> 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
> 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
> 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
> 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 
> 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 
> 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 
> 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 
> 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
> 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
> 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 
> 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 
> 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 
> 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 
> 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 
> 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 
> 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 
> 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 
> 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 
> 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 
> 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 
> 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 
> 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 
> 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 
> 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 
> 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 
> 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6)
> == SQL ==
> struct
> --^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64)
>   at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonExprUtils$.evalSchemaExpr(jsonExpressions.scala:777)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.(jsonExpressions.scala:527)
>   at org.apache.spark.sq

[jira] [Created] (SPARK-31066) Disable useless and uncleaned hive SessionState initialization parts

2020-03-05 Thread Kent Yao (Jira)
Kent Yao created SPARK-31066:


 Summary: Disable useless and uncleaned hive SessionState 
initialization parts
 Key: SPARK-31066
 URL: https://issues.apache.org/jira/browse/SPARK-31066
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao


As a common usage and according to the spark doc, users may often just copy 
their hive-site.xml to Spark directly from hive projects. Sometimes, the config 
file is not that clean for spark and may cause some side effects

for example, hive.session.history.enabled will create a log for the hive jobs 
but useless for spark and also it will not be deleted on JVM exit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()

2020-03-05 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053052#comment-17053052
 ] 

Nicholas Chammas commented on SPARK-31065:
--

cc [~hyukjin.kwon]

> Empty string values cause schema_of_json() to return a schema not usable by 
> from_json()
> ---
>
> Key: SPARK-31065
> URL: https://issues.apache.org/jira/browse/SPARK-31065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a reproduction:
>   
> {code:python}
> from pyspark.sql.functions import from_json, schema_of_json
> json = '{"a": ""}'
> df = spark.createDataFrame([(json,)], schema=['json'])
> df.show()
> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> json_schema = schema_of_json(json)
> df.select(from_json('json', json_schema))
> # works fine
> json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema
> df.select(from_json('json', json_schema))
> {code}
> The output:
> {code:java}
> >>> from pyspark.sql.functions import from_json, schema_of_json
> >>> json = '{"a": ""}'
> >>> 
> >>> df = spark.createDataFrame([(json,)], schema=['json'])
> >>> df.show()
> +-+
> | json|
> +-+
> |{"a": ""}|
> +-+
> >>> 
> >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> >>> json_schema = schema_of_json(json)
> >>> df.select(from_json('json', json_schema))
> Traceback (most recent call last):
>   File ".../site-packages/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File 
> ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
> line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.functions.from_json.
> : org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 
> 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
> 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
> 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
> 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
> 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 
> 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 
> 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 
> 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 
> 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
> 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
> 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 
> 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 
> 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 
> 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 
> 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 
> 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 
> 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 
> 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 
> 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 
> 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 
> 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 
> 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 
> 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 
> 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 
> 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 
> 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 
> 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6)
> == SQL ==
> struct
> --^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64)
>   at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonExprUtils$.evalSchemaExpr(jsonExpressions.scala:777)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.(jsonExpressions.

[jira] [Created] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()

2020-03-05 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-31065:


 Summary: Empty string values cause schema_of_json() to return a 
schema not usable by from_json()
 Key: SPARK-31065
 URL: https://issues.apache.org/jira/browse/SPARK-31065
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5
Reporter: Nicholas Chammas


Here's a reproduction:
  
{code:python}
from pyspark.sql.functions import from_json, schema_of_json
json = '{"a": ""}'

df = spark.createDataFrame([(json,)], schema=['json'])
df.show()

# chokes with org.apache.spark.sql.catalyst.parser.ParseException
json_schema = schema_of_json(json)
df.select(from_json('json', json_schema))

# works fine
json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema
df.select(from_json('json', json_schema))
{code}
The output:
{code:java}
>>> from pyspark.sql.functions import from_json, schema_of_json
>>> json = '{"a": ""}'
>>> 
>>> df = spark.createDataFrame([(json,)], schema=['json'])
>>> df.show()
+-+
| json|
+-+
|{"a": ""}|
+-+

>>> 
>>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
>>> json_schema = schema_of_json(json)
>>> df.select(from_json('json', json_schema))
Traceback (most recent call last):
  File ".../site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
  File 
".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.sql.functions.from_json.
: org.apache.spark.sql.catalyst.parser.ParseException: 
extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 
'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 
'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 
'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 
'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 
'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 
'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 'POSITION', 
'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 
'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 'SERDEPROPERTIES', 
'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 
'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 
'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 
'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 
'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 
'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 
'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 
'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 
'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 
'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 
'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 
'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', IDENTIFIER, 
BACKQUOTED_IDENTIFIER}(line 1, pos 6)

== SQL ==
struct
--^^^

at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64)
at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123)
at 
org.apache.spark.sql.catalyst.expressions.JsonExprUtils$.evalSchemaExpr(jsonExpressions.scala:777)
at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.(jsonExpressions.scala:527)
at org.apache.spark.sql.functions$.from_json(functions.scala:3606)
at org.apache.spark.sql.functions.from_json(functions.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Metho

[jira] [Resolved] (SPARK-30776) Support FValueRegressionSelector for continuous features and continuous labels

2020-03-05 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-30776.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27679
[https://github.com/apache/spark/pull/27679]

> Support FValueRegressionSelector for continuous features and continuous labels
> --
>
> Key: SPARK-30776
> URL: https://issues.apache.org/jira/browse/SPARK-30776
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.1.0
>
>
> Support FValueRegressionSelector for continuous features and continuous labels



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30776) Support FValueRegressionSelector for continuous features and continuous labels

2020-03-05 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-30776:


Assignee: Huaxin Gao

> Support FValueRegressionSelector for continuous features and continuous labels
> --
>
> Key: SPARK-30776
> URL: https://issues.apache.org/jira/browse/SPARK-30776
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Support FValueRegressionSelector for continuous features and continuous labels



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.

2020-03-05 Thread Prashant Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma reassigned SPARK-30985:
---

Assignee: (was: Prashant Sharma)

> Propagate SPARK_CONF_DIR files to driver and exec pods.
> ---
>
> Key: SPARK-30985
> URL: https://issues.apache.org/jira/browse/SPARK-30985
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Prashant Sharma
>Priority: Major
>
> SPARK_CONF_DIR hosts configuration files like, 
> 1) spark-defaults.conf - containing all the spark properties.
> 2) log4j.properties - Logger configuration.
> 3) spark-env.sh - Environment variables to be setup at driver and executor.
> 4) core-site.xml - Hadoop related configuration.
> 5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
> 6) metrics.properties - Spark metrics.
> 7) Any user specific - library or framework specific configuration file.
> Traditionally, SPARK_CONF_DIR has been the home to all user specific 
> configuration files and the default behaviour in the Yarn or standalone mode 
> is that these configuration files are copied to the worker nodes as required 
> by the users themselves. In other words, they are not auto-copied.
> But, in the case of  spark on kubernetes, we use spark images and generally 
> these images are approved or undergoe some kind of standardisation. These 
> files cannot be simply copied to the SPARK_CONF_DIR of the running executor 
> and driver pods by the user. 
> So, at the moment we have special casing for providing each configuration and 
> for any other user specific configuration files, the process is more complex, 
> i.e. - e.g. one can start with their own custom image of spark with 
> configuration files pre installed etc..
> Examples of special casing are:
> 1. Hadoop configuration in spark.kubernetes.hadoop.configMapName
> 2. Spark-env.sh as in spark.kubernetes.driverEnv.[EnvironmentVariableName]
> 3. Log4j.properties as in https://github.com/apache/spark/pull/26193
> ... And for those such special casing does not exist, they are simply out of 
> luck.
> So this feature, will let the user specific configuration files be mounted on 
> the driver and executor pods' SPARK_CONF_DIR.
> At the moment it is not clear, if there is a need to, let user specify which 
> config files to propagate - to driver and or executor. But, if there is a 
> case that feature will be helpful, we can increase the scope of this work or 
> create another JIRA issue to track that work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30930) ML, GraphX 3.0 QA: API: Experimental, DeveloperApi, final, sealed audit

2020-03-05 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053017#comment-17053017
 ] 

Sean R. Owen commented on SPARK-30930:
--

My general  stance is: anything that's been public for a while is pretty much 
stable now. Given the stronger preference for not modifying APIs going forward, 
I can't see changing or removing these any more readily than one would a 
'stable' API.  So I'd be OK removing Experimental / DeveloperApi on anything 
public right now, unless there are specific reasons not to.

I would not un-seal or un-final classes unless there is a clear reason to 
believe it's intended to be an extensible API, however.

> ML, GraphX 3.0 QA: API: Experimental, DeveloperApi, final, sealed audit
> ---
>
> Key: SPARK-30930
> URL: https://issues.apache.org/jira/browse/SPARK-30930
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Critical
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31063) xml-apis is missing after xercesImpl bumped up to 2.12.0

2020-03-05 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-31063.
--
Resolution: Not A Problem

issue fix by a follow-up, not a problem anymore

> xml-apis is missing after xercesImpl bumped up to 2.12.0
> 
>
> Key: SPARK-31063
> URL: https://issues.apache.org/jira/browse/SPARK-31063
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Blocker
>
> {code:java}
>  ✘ kentyao@hulk  ~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200305  
> bin/spark-sql
> 20/03/06 11:05:58 WARN Utils: Your hostname, hulk.local resolves to a 
> loopback address: 127.0.0.1; using 10.242.189.214 instead (on interface en0)
> 20/03/06 11:05:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/w3c/dom/ElementTraversal
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>   at org.apache.xerces.parsers.AbstractDOMParser.startDocument(Unknown 
> Source)
>   at org.apache.xerces.xinclude.XIncludeHandler.startDocument(Unknown 
> Source)
>   at org.apache.xerces.impl.dtd.XMLDTDValidator.startDocument(Unknown 
> Source)
>   at org.apache.xerces.impl.XMLDocumentScannerImpl.startEntity(Unknown 
> Source)
>   at 
> org.apache.xerces.impl.XMLVersionDetector.startDocumentParsing(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>   at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150)
>   at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2482)
>   at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2470)
>   at 
> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2541)
>   at 
> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2494)
>   at 
> org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2407)
>   at org.apache.hadoop.conf.Configuration.set(Configuration.java:1143)
>   at org.apache.hadoop.conf.Configuration.set(Configuration.java:1115)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.org$apache$spark$deploy$SparkHadoopUtil$$appendS3AndSparkHadoopHiveConfigurations(SparkHadoopUtil.scala:456)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:427)
>   at 
> org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$2(SparkSubmit.scala:342)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:342)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loa

[jira] [Assigned] (SPARK-31044) Support foldable input by `schema_of_json`

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31044:
---

Assignee: Maxim Gekk

> Support foldable input by `schema_of_json`
> --
>
> Key: SPARK-31044
> URL: https://issues.apache.org/jira/browse/SPARK-31044
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Currently, the `schema_of_json()` function allows only string literal as the 
> input. The ticket aims to support any foldable string expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31025) Support foldable input by `schema_of_csv`

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31025:
---

Assignee: Maxim Gekk

> Support foldable input by `schema_of_csv` 
> --
>
> Key: SPARK-31025
> URL: https://issues.apache.org/jira/browse/SPARK-31025
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Currently, the `schema_of_csv()` function allows only string literal as the 
> input. The ticket aims to support any foldable string expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31025) Support foldable input by `schema_of_csv`

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31025.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27804
[https://github.com/apache/spark/pull/27804]

> Support foldable input by `schema_of_csv` 
> --
>
> Key: SPARK-31025
> URL: https://issues.apache.org/jira/browse/SPARK-31025
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently, the `schema_of_csv()` function allows only string literal as the 
> input. The ticket aims to support any foldable string expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31044) Support foldable input by `schema_of_json`

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31044.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27804
[https://github.com/apache/spark/pull/27804]

> Support foldable input by `schema_of_json`
> --
>
> Key: SPARK-31044
> URL: https://issues.apache.org/jira/browse/SPARK-31044
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently, the `schema_of_json()` function allows only string literal as the 
> input. The ticket aims to support any foldable string expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31023) Support foldable schemas by `from_json`

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31023.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27804
[https://github.com/apache/spark/pull/27804]

> Support foldable schemas by `from_json`
> ---
>
> Key: SPARK-31023
> URL: https://issues.apache.org/jira/browse/SPARK-31023
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently, Spark accepts only literals or schema_of_json w/ literal input as 
> the schema parameter of from_json. And it fails on any foldable expressions, 
> for instance:
> {code:sql}
> spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id 
> INT, dpt_org_city STRING', 'dpt_org_', ''));
> Error in query: Schema should be specified in DDL format as a string literal 
> or output of the schema_of_json function instead of replace('dpt_org_id INT, 
> dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7
> {code}
> There are no reasons to restrict users by literals. The ticket aims to 
> support any foldable schemas by from_json().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31023) Support foldable schemas by `from_json`

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31023:
---

Assignee: Maxim Gekk

> Support foldable schemas by `from_json`
> ---
>
> Key: SPARK-31023
> URL: https://issues.apache.org/jira/browse/SPARK-31023
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Currently, Spark accepts only literals or schema_of_json w/ literal input as 
> the schema parameter of from_json. And it fails on any foldable expressions, 
> for instance:
> {code:sql}
> spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id 
> INT, dpt_org_city STRING', 'dpt_org_', ''));
> Error in query: Schema should be specified in DDL format as a string literal 
> or output of the schema_of_json function instead of replace('dpt_org_id INT, 
> dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7
> {code}
> There are no reasons to restrict users by literals. The ticket aims to 
> support any foldable schemas by from_json().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31020) Support foldable schemas by `from_csv`

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31020:
---

Assignee: Maxim Gekk

> Support foldable schemas by `from_csv`
> --
>
> Key: SPARK-31020
> URL: https://issues.apache.org/jira/browse/SPARK-31020
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Currently, Spark accepts only literals or schema_of_csv w/ literal input as 
> the schema parameter of from_csv. And it fails on any foldable expressions, 
> for instance:
> {code:sql}
> spark-sql> select from_csv('1, 3.14', replace('dpt_org_id INT, dpt_org_city 
> STRING', 'dpt_org_', ''));
> Error in query: Schema should be specified in DDL format as a string literal 
> or output of the schema_of_csv function instead of replace('dpt_org_id INT, 
> dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7
> {code}
> There are no reasons to restrict users by literals. The ticket aims to 
> support any foldable schemas by from_csv().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31020) Support foldable schemas by `from_csv`

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31020.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27804
[https://github.com/apache/spark/pull/27804]

> Support foldable schemas by `from_csv`
> --
>
> Key: SPARK-31020
> URL: https://issues.apache.org/jira/browse/SPARK-31020
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently, Spark accepts only literals or schema_of_csv w/ literal input as 
> the schema parameter of from_csv. And it fails on any foldable expressions, 
> for instance:
> {code:sql}
> spark-sql> select from_csv('1, 3.14', replace('dpt_org_id INT, dpt_org_city 
> STRING', 'dpt_org_', ''));
> Error in query: Schema should be specified in DDL format as a string literal 
> or output of the schema_of_csv function instead of replace('dpt_org_id INT, 
> dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7
> {code}
> There are no reasons to restrict users by literals. The ticket aims to 
> support any foldable schemas by from_csv().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31045) Add config for AQE logging level

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31045:
---

Assignee: Wei Xue

> Add config for AQE logging level
> 
>
> Key: SPARK-31045
> URL: https://issues.apache.org/jira/browse/SPARK-31045
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31045) Add config for AQE logging level

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31045.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27798
[https://github.com/apache/spark/pull/27798]

> Add config for AQE logging level
> 
>
> Key: SPARK-31045
> URL: https://issues.apache.org/jira/browse/SPARK-31045
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30886) Deprecate two-parameter TRIM/LTRIM/RTRIM functions

2020-03-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30886.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27643
[https://github.com/apache/spark/pull/27643]

> Deprecate two-parameter TRIM/LTRIM/RTRIM functions
> --
>
> Key: SPARK-30886
> URL: https://issues.apache.org/jira/browse/SPARK-30886
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Apache Spark community decided to keep the existing esoteric two-parameter 
> use cases with a proper warning. This JIRA aims to show warning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30886) Deprecate two-parameter TRIM/LTRIM/RTRIM functions

2020-03-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30886:
-

Assignee: Dongjoon Hyun

> Deprecate two-parameter TRIM/LTRIM/RTRIM functions
> --
>
> Key: SPARK-30886
> URL: https://issues.apache.org/jira/browse/SPARK-30886
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Apache Spark community decided to keep the existing esoteric two-parameter 
> use cases with a proper warning. This JIRA aims to show warning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31064) New Parquet Predicate Filter APIs with multi-part Identifier Support

2020-03-05 Thread DB Tsai (Jira)
DB Tsai created SPARK-31064:
---

 Summary: New Parquet Predicate Filter APIs with multi-part 
Identifier Support
 Key: SPARK-31064
 URL: https://issues.apache.org/jira/browse/SPARK-31064
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.5
Reporter: DB Tsai


Parquet's *org.apache.parquet.filter2.predicate.FilterApi* uses *dots* as 
separators to split the column name into multi-parts of nested fields. The 
drawback is this causes issues when the field name contains *dot*.

The new APIs that will be added will take array of string directly for 
multi-parts of nested fields, so no confusion as using *dot* as a separator.

It's intended to move this code back to parquet community. See [PARQUET-1809]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31063) xml-apis is missing after xercesImpl bumped up to 2.12.0

2020-03-05 Thread Kent Yao (Jira)
Kent Yao created SPARK-31063:


 Summary: xml-apis is missing after xercesImpl bumped up to 2.12.0
 Key: SPARK-31063
 URL: https://issues.apache.org/jira/browse/SPARK-31063
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao



{code:java}
 ✘ kentyao@hulk  ~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200305  
bin/spark-sql
20/03/06 11:05:58 WARN Utils: Your hostname, hulk.local resolves to a loopback 
address: 127.0.0.1; using 10.242.189.214 instead (on interface en0)
20/03/06 11:05:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/w3c/dom/ElementTraversal
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.xerces.parsers.AbstractDOMParser.startDocument(Unknown 
Source)
at org.apache.xerces.xinclude.XIncludeHandler.startDocument(Unknown 
Source)
at org.apache.xerces.impl.dtd.XMLDTDValidator.startDocument(Unknown 
Source)
at org.apache.xerces.impl.XMLDocumentScannerImpl.startEntity(Unknown 
Source)
at 
org.apache.xerces.impl.XMLVersionDetector.startDocumentParsing(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150)
at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2482)
at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2470)
at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2541)
at 
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2494)
at 
org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2407)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1143)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1115)
at 
org.apache.spark.deploy.SparkHadoopUtil$.org$apache$spark$deploy$SparkHadoopUtil$$appendS3AndSparkHadoopHiveConfigurations(SparkHadoopUtil.scala:456)
at 
org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:427)
at 
org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$2(SparkSubmit.scala:342)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:342)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 42 more

{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master

2020-03-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31043.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

> Spark 3.0 built against hadoop2.7 can't start standalone master
> ---
>
> Key: SPARK-31043
> URL: https://issues.apache.org/jira/browse/SPARK-31043
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Critical
> Fix For: 3.0.0
>
>
> trying to start a standalone master when building spark branch 3.0 with 
> hadoop2.7 fails with:
>  
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/w3c/dom/ElementTraversal
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.defineClass(URLClassLoader.java:468)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.access$100(URLClassLoader.java:74)
> at 
> [java.net|http://java.net/]
> .URLClassLoader$1.run(URLClassLoader.java:369)
> ...
> Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal
> at 
> [java.net|http://java.net/]
> .URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> ... 42 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master

2020-03-05 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052687#comment-17052687
 ] 

Hyukjin Kwon commented on SPARK-31043:
--

I believe it was fixed as of https://github.com/apache/spark/pull/27808

> Spark 3.0 built against hadoop2.7 can't start standalone master
> ---
>
> Key: SPARK-31043
> URL: https://issues.apache.org/jira/browse/SPARK-31043
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Critical
>
> trying to start a standalone master when building spark branch 3.0 with 
> hadoop2.7 fails with:
>  
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/w3c/dom/ElementTraversal
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.defineClass(URLClassLoader.java:468)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.access$100(URLClassLoader.java:74)
> at 
> [java.net|http://java.net/]
> .URLClassLoader$1.run(URLClassLoader.java:369)
> ...
> Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal
> at 
> [java.net|http://java.net/]
> .URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> ... 42 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30914) Add version information to the configuration of UI

2020-03-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30914:


Assignee: jiaan.geng

> Add version information to the configuration of UI
> --
>
> Key: SPARK-30914
> URL: https://issues.apache.org/jira/browse/SPARK-30914
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> core/src/main/scala/org/apache/spark/internal/config/UI.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30914) Add version information to the configuration of UI

2020-03-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30914.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27806
[https://github.com/apache/spark/pull/27806]

> Add version information to the configuration of UI
> --
>
> Key: SPARK-30914
> URL: https://issues.apache.org/jira/browse/SPARK-30914
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>
> core/src/main/scala/org/apache/spark/internal/config/UI.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31036) Use stringArgs in Expression.toString to respect hidden parameters

2020-03-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31036.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27788
[https://github.com/apache/spark/pull/27788]

> Use stringArgs in Expression.toString to respect hidden parameters
> --
>
> Key: SPARK-31036
> URL: https://issues.apache.org/jira/browse/SPARK-31036
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently, the top of https://github.com/apache/spark/pull/27657, 
> {code}
> val identify = udf((input: Seq[Int]) => input)
> spark.range(10).select(identify(array("id"))).show()
> {code}
> shows hidden parameter `useStringTypeWhenEmpty`.
> {code}
> +-+
> |UDF(array(id, false))|
> +-+
> |  [0]|
> |  [1]|
> ...
> {code}
> This is a general problem and we should respect hidden parameters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31036) Use stringArgs in Expression.toString to respect hidden parameters

2020-03-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31036:


Assignee: Hyukjin Kwon

> Use stringArgs in Expression.toString to respect hidden parameters
> --
>
> Key: SPARK-31036
> URL: https://issues.apache.org/jira/browse/SPARK-31036
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> Currently, the top of https://github.com/apache/spark/pull/27657, 
> {code}
> val identify = udf((input: Seq[Int]) => input)
> spark.range(10).select(identify(array("id"))).show()
> {code}
> shows hidden parameter `useStringTypeWhenEmpty`.
> {code}
> +-+
> |UDF(array(id, false))|
> +-+
> |  [0]|
> |  [1]|
> ...
> {code}
> This is a general problem and we should respect hidden parameters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30563) Regressions in Join benchmarks

2020-03-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30563.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/27791

> Regressions in Join benchmarks
> --
>
> Key: SPARK-30563
> URL: https://issues.apache.org/jira/browse/SPARK-30563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Regenerated benchmark results in the 
> https://github.com/apache/spark/pull/27078 shows many regressions in 
> JoinBenchmark. The benchmarked queries slowed down by up to 3 times, see
> old results:
> https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dL10
> new results:
> https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dR10
> One of the difference in queries is using the `NoOp` datasource in new 
> queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30930) ML, GraphX 3.0 QA: API: Experimental, DeveloperApi, final, sealed audit

2020-03-05 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052641#comment-17052641
 ] 

Huaxin Gao commented on SPARK-30930:


cc [~srowen],  [~podongfeng]
Developer API
Most developer API are the basic components for ML pipeline, such as 
Transformer, Model, Estimator, PipelineStage, Params and Attributes, I guess we 
don't need to unmark any of them?

final class:
org.apache.spark.ml.classification.OneVsRest
org.apache.spark.ml.evaluation.RegressionEvaluator
org.apache.spark.ml.feature.Binarizer
org.apache.spark.ml.feature.Bucketizer
org.apache.spark.ml.feature.ChiSqSelector
org.apache.spark.ml.feature.IDF
org.apache.spark.ml.feature.QuantileDiscretizer
org.apache.spark.ml.feature.VectorSlicer
org.apache.spark.ml.feature.Word2Vec
org.apache.spark.ml.param.ParamMap
org.apache.spark.ml.fpm.PrefixSpan
Do we need to unmark any of these? 


> ML, GraphX 3.0 QA: API: Experimental, DeveloperApi, final, sealed audit
> ---
>
> Key: SPARK-30930
> URL: https://issues.apache.org/jira/browse/SPARK-30930
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Critical
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31062) Improve Spark Decommissioning K8s test relability

2020-03-05 Thread Holden Karau (Jira)
Holden Karau created SPARK-31062:


 Summary: Improve Spark Decommissioning K8s test relability
 Key: SPARK-31062
 URL: https://issues.apache.org/jira/browse/SPARK-31062
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Tests
Affects Versions: 3.1.0
Reporter: Holden Karau
Assignee: Holden Karau


The test currently flakes more than the other Kubernetes tests. We can remove 
some of the timing that is likely to be a source of flakiness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31061) Impossible to change the provider of a table in the HiveMetaStore

2020-03-05 Thread Burak Yavuz (Jira)
Burak Yavuz created SPARK-31061:
---

 Summary: Impossible to change the provider of a table in the 
HiveMetaStore
 Key: SPARK-31061
 URL: https://issues.apache.org/jira/browse/SPARK-31061
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


Currently, it's impossible to alter the datasource of a table in the 
HiveMetaStore by using alterTable, as the HiveExternalCatalog doesn't change 
the provider table property during an alterTable command. This is required to 
support changing table formats when using commands like REPLACE TABLE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31058) Consolidate the implementation of quoteIfNeeded

2020-03-05 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-31058.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27814
[https://github.com/apache/spark/pull/27814]

> Consolidate the implementation of quoteIfNeeded
> ---
>
> Key: SPARK-31058
> URL: https://issues.apache.org/jira/browse/SPARK-31058
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> There are two implementation of quoteIfNeeded, and one is in 
> *org.apache.spark.sql.connector.catalog.CatalogV2Implicits.quote* and the 
> other is in *OrcFiltersBase.quoteAttributeNameIfNeeded* This PR will 
> consolidate them into one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31058) Consolidate the implementation of quoteIfNeeded

2020-03-05 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-31058:
---

Assignee: DB Tsai

> Consolidate the implementation of quoteIfNeeded
> ---
>
> Key: SPARK-31058
> URL: https://issues.apache.org/jira/browse/SPARK-31058
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> There are two implementation of quoteIfNeeded, and one is in 
> *org.apache.spark.sql.connector.catalog.CatalogV2Implicits.quote* and the 
> other is in *OrcFiltersBase.quoteAttributeNameIfNeeded* This PR will 
> consolidate them into one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30930) ML, GraphX 3.0 QA: API: Experimental, DeveloperApi, final, sealed audit

2020-03-05 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052606#comment-17052606
 ] 

Huaxin Gao commented on SPARK-30930:


Sealed and Experimental  are as following.  I don't think we need to do 
anything about these. 
sealed:
org.apache.spark.ml.attribute.events.MLEvent
org.apache.spark.ml.attribute.Attribute
org.apache.spark.ml.attribute.AttributeType
org.apache.spark.ml.classification.LogisticRegressionTrainingSummary
org.apache.spark.ml.classification.BinaryLogisticRegressionSummary
org.apache.spark.ml.classification.LogisticRegressionTrainingSummary
org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary
org.apache.spark.ml.feature.Term
org.apache.spark.ml.feature.InteractableTerm
org.apache.spark.ml.optim.WeightedLeastSquares.Solver
org.apache.spark.ml.optim.NormalEquationSolver
org.apache.spark.ml.tree.Node
org.apache.spark.ml.tree.Split
org.apache.spark.ml.util.BaseReadWrite
org.apache.spark.ml.stat.SummaryBuilder
org.apache.spark.ml.stat.SummaryBuilderImpl.Metric
org.apache.spark.ml.stat.SummaryBuilderImpl.ComputeMetric

Experimental classes:
org.apache.spark.ml.evaluation.MultilabelClassificationEvaluator
org.apache.spark.ml.evaluation.RankingEvaluator
org.apache.spark.ml.MLEvent  // Experimental marked in @note This is 
experimental and unstable.   Do we need to use @Experimental?
Experimental methods:
org.apache.spark.ml.feature.LSH.approxNearestNeighbors


> ML, GraphX 3.0 QA: API: Experimental, DeveloperApi, final, sealed audit
> ---
>
> Key: SPARK-30930
> URL: https://issues.apache.org/jira/browse/SPARK-30930
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Critical
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31059) Spark's SQL "group by" local processing operator is broken.

2020-03-05 Thread Michail Giannakopoulos (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michail Giannakopoulos resolved SPARK-31059.

Resolution: Invalid

Invalid guys, there is no duplicate "Product line - Order method type" key when 
presenting the results.

> Spark's SQL "group by" local processing operator is broken.
> ---
>
> Key: SPARK-31059
> URL: https://issues.apache.org/jira/browse/SPARK-31059
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3, 2.4.5
> Environment: Windows 10.
>Reporter: Michail Giannakopoulos
>Priority: Blocker
> Attachments: SampleFile_GOSales.csv
>
>
> When applying "GROUP BY" processing operator (without an "ORDER BY" clause), 
> I expect to see all the grouping columns being grouped together to the same 
> buckets. However, this is not the case.
> Steps to reproduce:
>  1. Start spark-shell as follows:
>  bin\spark-shell.cmd --master local[4] --conf 
> spark.sql.catalogImplementation=in-memory
>  2. Load the attached csv file:
>  val gosales = spark.read.format("csv").option("header", 
> "true").option("inferSchema", 
> "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
>  3. Create a temp view:
>  gosales.createOrReplaceTempView("gosales")
>  4. Execute the following sql statement:
>  spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM 
> `gosales` GROUP BY `Product line`, `Order method type`").show()
> Output: 
>  +---+--++
> |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
> +---+--++
> |Golf Equipment|E-mail|92.25|
> |Camping Equipment|Mail|0.0|
> |Camping Equipment|Fax|null|
> |Golf Equipment|Telephone|123.0|
> |Camping Equipment|Special|null|
> |Outdoor Protection|Telephone|34218.19|
> |Mountaineering Eq...|Mail|0.0|
> |Camping Equipment|Web|32469.03|
> |Personal Accessories|Fax|3318.7|
> |Golf Equipment|Sales visit|143.5|
> |Mountaineering Eq...|Telephone|null|
> |Mountaineering Eq...|E-mail|null|
> |Outdoor Protection|Sales visit|20522.42|
> |Outdoor Protection|Fax|5857.54|
> |Personal Accessories|E-mail|26679.6403|
> |Mountaineering Eq...|Fax|null|
> |Outdoor Protection|Web|340836.853|
> |Golf Equipment|Special|0.0|
> |Outdoor Protection|E-mail|28505.93|
> |Golf Equipment|Web|3034.0|
> +---+--++
> Expected output:
>  +---+--++
> |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
> +---+--++
> |Golf Equipment|E-mail|92.25|
> |Golf Equipment|Fax|null|
> |Golf Equipment|Mail|0.0|
> |Golf Equipment|Sales visit|143.5|
> |Golf Equipment|Special|0.0|
> |Golf Equipment|Telephone|123.0|
> |Golf Equipment|Web|3034.0|
> |Camping Equipment|E-mail|1303.3|
> |Camping Equipment|Fax|null|
> |Camping Equipment|Sales visit|4754.87|
> |Camping Equipment|Mail|0.0|
> |Camping Equipment|Special|null|
> |Camping Equipment|Telephone|5169.65|
> |Camping Equipment|Web|32469.03|
> |Mountaineering Eq...|E-mail|null|
> |Mountaineering Eq...|Fax|null|
> |Mountaineering Eq...|Mail|0.0|
> |Mountaineering Eq...|Special|null|
> |Mountaineering Eq...|Sales visit|null|
> |Mountaineering Eq...|Telephone|null|
> +---+--++
> Notice how in the expected output all the grouping columns should be bucketed 
> together without necessarily being in order, which is not the case with the 
> output that spark produces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25722) Support a backtick character in column names

2020-03-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25722:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support a backtick character in column names
> 
>
> Key: SPARK-25722
> URL: https://issues.apache.org/jira/browse/SPARK-25722
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Among built-in data sources, `avro` and `orc` doesn't allow `backtick` in 
> column names. We had better be consistent if possible.
>  * Option 1: Support a backtick character
>  * Option 2: Disallow a backtick character (This may be considered as a 
> regression at TEXT/CSV/JSON/Parquet)
>  So, Option 1 is better.
> *TEXT*, *CSV*, *JSON*, *PARQUET*
> {code:java}
> Seq("text", "csv", "json", "parquet").foreach { format =>
>   Seq("1").toDF("`").write.mode("overwrite").format(format).save("/tmp/t")
> }{code}
> *AVRO*
> {code:java}
> scala> 
> Seq("1").toDF("`").write.mode("overwrite").format("avro").save("/tmp/t")
> org.apache.avro.SchemaParseException: Illegal initial character: `{code}
> *ORC (native)*
> {code:java}
> scala> Seq("1").toDF("`").write.mode("overwrite").format("orc").save("/tmp/t")
> java.lang.IllegalArgumentException: Unmatched quote at 
> 'struct<^```:string>'{code}
> *ORC (hive)*
> {code:java}
> scala> Seq("1").toDF("`").write.mode("overwrite").format("orc").save("/tmp/t")
> java.lang.IllegalArgumentException: Error: name expected at the position 7 of 
> 'struct<`:string>' but '`' is found.{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31060) Handle column names containing `dots` in data source `Filter`

2020-03-05 Thread DB Tsai (Jira)
DB Tsai created SPARK-31060:
---

 Summary: Handle column names containing `dots` in data source 
`Filter`
 Key: SPARK-31060
 URL: https://issues.apache.org/jira/browse/SPARK-31060
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.5
Reporter: DB Tsai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30961) Arrow enabled: to_pandas with date column fails

2020-03-05 Thread Kevin Appel (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052464#comment-17052464
 ] 

Kevin Appel edited comment on SPARK-30961 at 3/5/20, 10:19 PM:
---

(python 3.6, pyarrow 0.8.0, pandas 0.21.0) or (python 3.7, pyarrow 0.11.1, 
pandas 0.24.1) are  combinations I found that is still working correctly for 
Date in both Spark 2.3 and Spark 2.4, in additional all the examples listed on 
the pandas udf spark documentation also works with this setup


was (Author: kevinappel):
python 3.6, pyarrow 0.8.0, pandas 0.21.0 is a combination I found that is still 
working correctly for Date in both Spark 2.3 and Spark 2.4, in additional all 
the examples listed on the pandas udf spark documentation also works with this 
setup

> Arrow enabled: to_pandas with date column fails
> ---
>
> Key: SPARK-30961
> URL: https://issues.apache.org/jira/browse/SPARK-30961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5
>Reporter: Nicolas Renkamp
>Priority: Major
>  Labels: ready-to-commit
>
> Hi,
> there seems to be a bug in the arrow enabled to_pandas conversion from spark 
> dataframe to pandas dataframe when the dataframe has a column of type 
> DateType. Here is a minimal example to reproduce the issue:
> {code:java}
> spark = SparkSession.builder.getOrCreate()
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> spark_df = spark.createDataFrame(
> [['2019-12-06']], 'created_at: string') \
> .withColumn('created_at', F.to_date('created_at'))
> # works
> spark_df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", 'true')
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> # raises AttributeError: Can only use .dt accessor with datetimelike values
> # series is still of type object, .dt does not exist
> spark_df.toPandas(){code}
> A fix would be to modify the _check_series_convert_date function in 
> pyspark.sql.types to:
> {code:java}
> def _check_series_convert_date(series, data_type):
> """
> Cast the series to datetime.date if it's a date type, otherwise returns 
> the original series.:param series: pandas.Series
> :param data_type: a Spark data type for the series
> """
> from pyspark.sql.utils import require_minimum_pandas_version
> require_minimum_pandas_version()from pandas import to_datetime
> if type(data_type) == DateType:
> return to_datetime(series).dt.date
> else:
> return series
> {code}
> Let me know if I should prepare a Pull Request for the 2.4.5 branch.
> I have not tested the behavior on master branch.
>  
> Thanks,
> Nicolas



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31059) Spark's SQL "group by" local processing operator is broken.

2020-03-05 Thread Michail Giannakopoulos (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michail Giannakopoulos updated SPARK-31059:
---
Description: 
When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I 
expect to see all the grouping columns being grouped together to the same 
buckets. However, this is not the case.

Steps to reproduce:
 1. Start spark-shell as follows:
 bin\spark-shell.cmd --master local[4] --conf 
spark.sql.catalogImplementation=in-memory
 2. Load the attached csv file:
 val gosales = spark.read.format("csv").option("header", 
"true").option("inferSchema", 
"true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
 3. Create a temp view:
 gosales.createOrReplaceTempView("gosales")
 4. Execute the following sql statement:
 spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM 
`gosales` GROUP BY `Product line`, `Order method type`").show()

Output: 
 +---+--++
|Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|

+---+--++
|Golf Equipment|E-mail|92.25|
|Camping Equipment|Mail|0.0|
|Camping Equipment|Fax|null|
|Golf Equipment|Telephone|123.0|
|Camping Equipment|Special|null|
|Outdoor Protection|Telephone|34218.19|
|Mountaineering Eq...|Mail|0.0|
|Camping Equipment|Web|32469.03|
|Personal Accessories|Fax|3318.7|
|Golf Equipment|Sales visit|143.5|
|Mountaineering Eq...|Telephone|null|
|Mountaineering Eq...|E-mail|null|
|Outdoor Protection|Sales visit|20522.42|
|Outdoor Protection|Fax|5857.54|
|Personal Accessories|E-mail|26679.6403|
|Mountaineering Eq...|Fax|null|
|Outdoor Protection|Web|340836.853|
|Golf Equipment|Special|0.0|
|Outdoor Protection|E-mail|28505.93|
|Golf Equipment|Web|3034.0|

+---+--++

Expected output:
 +---+--++
|Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|

+---+--++
|Golf Equipment|E-mail|92.25|
|Golf Equipment|Fax|null|
|Golf Equipment|Mail|0.0|
|Golf Equipment|Sales visit|143.5|
|Golf Equipment|Special|0.0|
|Golf Equipment|Telephone|123.0|
|Golf Equipment|Web|3034.0|
|Camping Equipment|E-mail|1303.3|
|Camping Equipment|Fax|null|
|Camping Equipment|Sales visit|4754.87|
|Camping Equipment|Mail|0.0|
|Camping Equipment|Special|null|
|Camping Equipment|Telephone|5169.65|
|Camping Equipment|Web|32469.03|
|Mountaineering Eq...|E-mail|null|
|Mountaineering Eq...|Fax|null|
|Mountaineering Eq...|Mail|0.0|
|Mountaineering Eq...|Special|null|
|Mountaineering Eq...|Sales visit|null|
|Mountaineering Eq...|Telephone|null|

+---+--++

Notice how in the expected output all the grouping columns should be bucketed 
together without necessarily being in order, which is not the case with the 
output that spark produces.

  was:
When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I 
expect to see all the grouping columns being grouped together to the same 
buckets. However, this is not the case.

Steps to reproduce:
 1. Start spark-shell as follows:
 bin\spark-shell.cmd --master local[4] --conf 
spark.sql.catalogImplementation=in-memory
 2. Load the attached csv file:
 val gosales = spark.read.format("csv").option("header", 
"true").option("inferSchema", 
"true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
 3. Create a temp view:
 gosales.createOrReplaceTempView("gosales")
 4. Execute the following sql statement:
 spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM 
`gosales` GROUP BY `Product line`, `Order method type`").show()

Output: 
 +--+---++
|Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|

+--+---++
|Golf Equipment|E-mail|92.25|
|Camping Equipment|Mail|0.0|
|Camping Equipment|Fax|null|
|Golf Equipment|Telephone|123.0|
|Camping Equipment|Special|null|
|Outdoor Protection|Telephone|34218.19|
|Mountaineering Eq...|Mail|0.0|
|Camping Equipment|Web|32469.03|
|Personal Accessories|Fax|3318.7|
|Golf Equipment|Sales visit|143.5|
|Mountaineering Eq...|Telephone|null|
|Mountaineering Eq...|E-mail|null|
|Outdoor Protection|Sales visit|20522.42|
|Outdoor Protection|Fax|5857.54|
|Personal Accessories|E-mail|26679.6403|
|Mountaineering Eq...|Fax|null|
|Outdoor Protection|Web|340836.853|
|Golf Equipment|Special|0.0|
|Outdoor Protection|E-mail|28505.93|
|Golf Equipment|Web|3034.0|

+--+---++

Expected output:
 +--+---++---

[jira] [Updated] (SPARK-31059) Spark's SQL "group by" local processing operator is broken.

2020-03-05 Thread Michail Giannakopoulos (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michail Giannakopoulos updated SPARK-31059:
---
Description: 
When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I 
expect to see all the grouping columns being grouped together to the same 
buckets. However, this is not the case.

Steps to reproduce:
 1. Start spark-shell as follows:
 bin\spark-shell.cmd --master local[4] --conf 
spark.sql.catalogImplementation=in-memory
 2. Load the attached csv file:
 val gosales = spark.read.format("csv").option("header", 
"true").option("inferSchema", 
"true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
 3. Create a temp view:
 gosales.createOrReplaceTempView("gosales")
 4. Execute the following sql statement:
 spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM 
`gosales` GROUP BY `Product line`, `Order method type`").show()

Output: 
 +--+---++
|Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|

+--+---++
|Golf Equipment|E-mail|92.25|
|Camping Equipment|Mail|0.0|
|Camping Equipment|Fax|null|
|Golf Equipment|Telephone|123.0|
|Camping Equipment|Special|null|
|Outdoor Protection|Telephone|34218.19|
|Mountaineering Eq...|Mail|0.0|
|Camping Equipment|Web|32469.03|
|Personal Accessories|Fax|3318.7|
|Golf Equipment|Sales visit|143.5|
|Mountaineering Eq...|Telephone|null|
|Mountaineering Eq...|E-mail|null|
|Outdoor Protection|Sales visit|20522.42|
|Outdoor Protection|Fax|5857.54|
|Personal Accessories|E-mail|26679.6403|
|Mountaineering Eq...|Fax|null|
|Outdoor Protection|Web|340836.853|
|Golf Equipment|Special|0.0|
|Outdoor Protection|E-mail|28505.93|
|Golf Equipment|Web|3034.0|

+--+---++

Expected output:
 +--+---++
|Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|

+--+---++
|Golf Equipment|E-mail|92.25|
|Golf Equipment|Fax|null|
|Golf Equipment|Mail|0.0|
|Golf Equipment|Sales visit|143.5|
|Golf Equipment|Special|0.0|
|Golf Equipment|Telephone|123.0|
|Golf Equipment|Web|3034.0|
|Camping Equipment|E-mail|1303.3|
|Camping Equipment|Fax|null|
|Camping Equipment|Sales visit|4754.87|
|Camping Equipment|Mail|0.0|
|Camping Equipment|Special|null|
|Camping Equipment|Telephone|5169.65|
|Camping Equipment|Web|32469.03|
|Mountaineering Eq...|E-mail|null|
|Mountaineering Eq...|Fax|null|
|Mountaineering Eq...|Mail|0.0|
|Mountaineering Eq...|Special|null|
|Mountaineering Eq...|Sales visit|null|
|Mountaineering Eq...|Telephone|null|

+--+---++

Notice how in the expected output all the grouping columns should be bucketed 
together without necessarily being in order, which is not the case with output 
that spark produces.

  was:
When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I 
expect to see all the grouping columns being grouped together to the same 
buckets. However, this is not the case.

Steps to reproduce:
 1. Start spark-shell as follows:
 bin\spark-shell.cmd --master local[4] --conf 
spark.sql.catalogImplementation=in-memory
 2. Load the attached csv file:
 val gosales = spark.read.format("csv").option("header", 
"true").option("inferSchema", 
"true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
 3. Create a temp view:
 gosales.createOrReplaceTempView("gosales")
 4. Execute the following sql statement:
 spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM 
`gosales` GROUP BY `Product line`, `Order method type`").show()

Output: 
 +-+++
|Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|

+-+++
|Golf Equipment|E-mail|92.25|
|Camping Equipment|Mail|0.0|
|Camping Equipment|Fax|null|
|Golf Equipment|Telephone|123.0|
|Camping Equipment|Special|null|
|Outdoor Protection|Telephone|34218.19|
|Mountaineering Eq...|Mail|0.0|
|Camping Equipment|Web|32469.03|
|Personal Accessories|Fax|3318.7|
|Golf Equipment|Sales visit|143.5|
|Mountaineering Eq...|Telephone|null|
|Mountaineering Eq...|E-mail|null|
|Outdoor Protection|Sales visit|20522.42|
|Outdoor Protection|Fax|5857.54|
|Personal Accessories|E-mail|26679.6403|
|Mountaineering Eq...|Fax|null|
|Outdoor Protection|Web|340836.853|
|Golf Equipment|Special|0.0|
|Outdoor Protection|E-mail|28505.93|
|Golf Equipment|Web|3034.0|

+-+++

Expected output:
 +-+++---

[jira] [Updated] (SPARK-31059) Spark's SQL "group by" local processing operator is broken.

2020-03-05 Thread Michail Giannakopoulos (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michail Giannakopoulos updated SPARK-31059:
---
Attachment: SampleFile_GOSales.csv

> Spark's SQL "group by" local processing operator is broken.
> ---
>
> Key: SPARK-31059
> URL: https://issues.apache.org/jira/browse/SPARK-31059
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3, 2.4.5
> Environment: Windows 10.
>Reporter: Michail Giannakopoulos
>Priority: Blocker
> Attachments: SampleFile_GOSales.csv
>
>
> When applying "GROUP BY" processing operator (without an "ORDER BY" clause), 
> I expect to see all the grouping columns being grouped together to the same 
> buckets. However, this is not the case.
> Steps to reproduce:
>  1. Start spark-shell as follows:
>  bin\spark-shell.cmd --master local[4] --conf 
> spark.sql.catalogImplementation=in-memory
>  2. Load the attached csv file:
>  val gosales = spark.read.format("csv").option("header", 
> "true").option("inferSchema", 
> "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
>  3. Create a temp view:
>  gosales.createOrReplaceTempView("gosales")
>  4. Execute the following sql statement:
>  spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM 
> `gosales` GROUP BY `Product line`, `Order method type`").show()
> Output: 
>  +-+++
> |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
> +-+++
> |Golf Equipment|E-mail|92.25|
> |Camping Equipment|Mail|0.0|
> |Camping Equipment|Fax|null|
> |Golf Equipment|Telephone|123.0|
> |Camping Equipment|Special|null|
> |Outdoor Protection|Telephone|34218.19|
> |Mountaineering Eq...|Mail|0.0|
> |Camping Equipment|Web|32469.03|
> |Personal Accessories|Fax|3318.7|
> |Golf Equipment|Sales visit|143.5|
> |Mountaineering Eq...|Telephone|null|
> |Mountaineering Eq...|E-mail|null|
> |Outdoor Protection|Sales visit|20522.42|
> |Outdoor Protection|Fax|5857.54|
> |Personal Accessories|E-mail|26679.6403|
> |Mountaineering Eq...|Fax|null|
> |Outdoor Protection|Web|340836.853|
> |Golf Equipment|Special|0.0|
> |Outdoor Protection|E-mail|28505.93|
> |Golf Equipment|Web|3034.0|
> +-+++
> Expected output:
>  +-+++
> |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
> +-+++
> |Golf Equipment|E-mail|92.25|
> |Golf Equipment|Fax|null|
> |Golf Equipment|Mail|0.0|
> |Golf Equipment|Sales visit|143.5|
> |Golf Equipment|Special|0.0|
> |Golf Equipment|Telephone|123.0|
> |Golf Equipment|Web|3034.0|
> |Camping Equipment|E-mail|1303.3|
> |Camping Equipment|Fax|null|
> |Camping Equipment|Sales visit|4754.87|
> |Camping Equipment|Mail|0.0|
> |Camping Equipment|Special|null|
> |Camping Equipment|Telephone|5169.65|
> |Camping Equipment|Web|32469.03|
> |Mountaineering Eq...|E-mail|null|
> |Mountaineering Eq...|Fax|null|
> |Mountaineering Eq...|Mail|0.0|
> |Mountaineering Eq...|Special|null|
> |Mountaineering Eq...|Sales visit|null|
> |Mountaineering Eq...|Telephone|null|
> +-+++
> Notice how in the expected output all the grouping columns should be bucketed 
> together without necessary being in order, which is not the case with output 
> that spark produces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31059) Spark's SQL "group by" local processing operator is broken.

2020-03-05 Thread Michail Giannakopoulos (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michail Giannakopoulos updated SPARK-31059:
---
Description: 
When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I 
expect to see all the grouping columns being grouped together to the same 
buckets. However, this is not the case.

Steps to reproduce:
 1. Start spark-shell as follows:
 bin\spark-shell.cmd --master local[4] --conf 
spark.sql.catalogImplementation=in-memory
 2. Load the attached csv file:
 val gosales = spark.read.format("csv").option("header", 
"true").option("inferSchema", 
"true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
 3. Create a temp view:
 gosales.createOrReplaceTempView("gosales")
 4. Execute the following sql statement:
 spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM 
`gosales` GROUP BY `Product line`, `Order method type`").show()

Output: 
 +-+++
|Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|

+-+++
|Golf Equipment|E-mail|92.25|
|Camping Equipment|Mail|0.0|
|Camping Equipment|Fax|null|
|Golf Equipment|Telephone|123.0|
|Camping Equipment|Special|null|
|Outdoor Protection|Telephone|34218.19|
|Mountaineering Eq...|Mail|0.0|
|Camping Equipment|Web|32469.03|
|Personal Accessories|Fax|3318.7|
|Golf Equipment|Sales visit|143.5|
|Mountaineering Eq...|Telephone|null|
|Mountaineering Eq...|E-mail|null|
|Outdoor Protection|Sales visit|20522.42|
|Outdoor Protection|Fax|5857.54|
|Personal Accessories|E-mail|26679.6403|
|Mountaineering Eq...|Fax|null|
|Outdoor Protection|Web|340836.853|
|Golf Equipment|Special|0.0|
|Outdoor Protection|E-mail|28505.93|
|Golf Equipment|Web|3034.0|

+-+++

Expected output:
 +-+++
|Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|

+-+++
|Golf Equipment|E-mail|92.25|
|Golf Equipment|Fax|null|
|Golf Equipment|Mail|0.0|
|Golf Equipment|Sales visit|143.5|
|Golf Equipment|Special|0.0|
|Golf Equipment|Telephone|123.0|
|Golf Equipment|Web|3034.0|
|Camping Equipment|E-mail|1303.3|
|Camping Equipment|Fax|null|
|Camping Equipment|Sales visit|4754.87|
|Camping Equipment|Mail|0.0|
|Camping Equipment|Special|null|
|Camping Equipment|Telephone|5169.65|
|Camping Equipment|Web|32469.03|
|Mountaineering Eq...|E-mail|null|
|Mountaineering Eq...|Fax|null|
|Mountaineering Eq...|Mail|0.0|
|Mountaineering Eq...|Special|null|
|Mountaineering Eq...|Sales visit|null|
|Mountaineering Eq...|Telephone|null|

+-+++

Notice how in the expected output all the grouping columns should be bucketed 
together without necessary being in order, which is not the case with output 
that spark produces.

  was:
When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I 
expect to see all the grouping columns being grouped together to the same 
buckets. However, this is not the case.

Steps to reproduce:
1. Start spark-shell as follows:
bin\spark-shell.cmd --master local[4] --conf 
spark.sql.catalogImplementation=in-memory
2. Load the attached csv file:
val gosales = spark.read.format("csv").option("header", 
"true").option("inferSchema", 
"true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
3. Create a temp view:
gosales.createOrReplaceTempView("gosales")
4. Execute the following sql statement:
spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM 
`gosales` GROUP BY `Product line`, `Order method type`").show()

Output: 
++-++
| Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
++-++
| Golf Equipment| E-mail| 92.25|
| Camping Equipment| Mail| 0.0|
| Camping Equipment| Fax| null|
| Golf Equipment| Telephone| 123.0|
| Camping Equipment| Special| null|
| Outdoor Protection| Telephone| 34218.19|
|Mountaineering Eq...| Mail| 0.0|
| Camping Equipment| Web| 32469.03|
|Personal Accessories| Fax| 3318.7|
| Golf Equipment| Sales visit| 143.5|
|Mountaineering Eq...| Telephone| null|
|Mountaineering Eq...| E-mail| null|
| Outdoor Protection| Sales visit| 20522.42|
| Outdoor Protection| Fax| 5857.54|
|Personal Accessories| E-mail| 26679.6403|
|Mountaineering Eq...| Fax| null|
| Outdoor Protection| Web| 340836.853|
| Golf Equipment| Special| 0.0|
| Outdoor Protection| E-mail| 28505.93|
| Golf Equipment| Web| 3034.0|
++-++

Expected output:
+--

[jira] [Created] (SPARK-31059) Spark's SQL "group by" local processing operator is broken.

2020-03-05 Thread Michail Giannakopoulos (Jira)
Michail Giannakopoulos created SPARK-31059:
--

 Summary: Spark's SQL "group by" local processing operator is 
broken.
 Key: SPARK-31059
 URL: https://issues.apache.org/jira/browse/SPARK-31059
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5, 2.4.3
 Environment: Windows 10.
Reporter: Michail Giannakopoulos


When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I 
expect to see all the grouping columns being grouped together to the same 
buckets. However, this is not the case.

Steps to reproduce:
1. Start spark-shell as follows:
bin\spark-shell.cmd --master local[4] --conf 
spark.sql.catalogImplementation=in-memory
2. Load the attached csv file:
val gosales = spark.read.format("csv").option("header", 
"true").option("inferSchema", 
"true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
3. Create a temp view:
gosales.createOrReplaceTempView("gosales")
4. Execute the following sql statement:
spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM 
`gosales` GROUP BY `Product line`, `Order method type`").show()

Output: 
++-++
| Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
++-++
| Golf Equipment| E-mail| 92.25|
| Camping Equipment| Mail| 0.0|
| Camping Equipment| Fax| null|
| Golf Equipment| Telephone| 123.0|
| Camping Equipment| Special| null|
| Outdoor Protection| Telephone| 34218.19|
|Mountaineering Eq...| Mail| 0.0|
| Camping Equipment| Web| 32469.03|
|Personal Accessories| Fax| 3318.7|
| Golf Equipment| Sales visit| 143.5|
|Mountaineering Eq...| Telephone| null|
|Mountaineering Eq...| E-mail| null|
| Outdoor Protection| Sales visit| 20522.42|
| Outdoor Protection| Fax| 5857.54|
|Personal Accessories| E-mail| 26679.6403|
|Mountaineering Eq...| Fax| null|
| Outdoor Protection| Web| 340836.853|
| Golf Equipment| Special| 0.0|
| Outdoor Protection| E-mail| 28505.93|
| Golf Equipment| Web| 3034.0|
++-++

Expected output:
++-++
| Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
++-++
| Golf Equipment| E-mail| 92.25|
| Golf Equipment| Fax| null|
| Golf Equipment| Mail| 0.0|
| Golf Equipment| Sales visit| 143.5|
| Golf Equipment| Special| 0.0|
| Golf Equipment| Telephone| 123.0|
| Golf Equipment| Web| 3034.0|
| Camping Equipment| E-mail| 1303.3|
| Camping Equipment| Fax| null|
| Camping Equipment| Sales visit| 4754.87|
| Camping Equipment| Mail| 0.0|
| Camping Equipment| Special| null|
| Camping Equipment| Telephone| 5169.65|
| Camping Equipment| Web| 32469.03|
|Mountaineering Eq...| E-mail| null|
|Mountaineering Eq...| Fax| null|
|Mountaineering Eq...| Mail| 0.0|
|Mountaineering Eq...| Special| null|
|Mountaineering Eq...| Sales visit| null|
|Mountaineering Eq...| Telephone| null|
++-++

Notice how all the grouping columns should be bucketed together without being 
in order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20901) Feature parity for ORC with Parquet

2020-03-05 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052476#comment-17052476
 ] 

Felix Kizhakkel Jose commented on SPARK-20901:
--

Thank you [~dongjoon]. 

> Feature parity for ORC with Parquet
> ---
>
> Key: SPARK-20901
> URL: https://issues.apache.org/jira/browse/SPARK-20901
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to track the feature parity for ORC with Parquet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20901) Feature parity for ORC with Parquet

2020-03-05 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052473#comment-17052473
 ] 

Dongjoon Hyun commented on SPARK-20901:
---

Hi, [~FelixKJose]. Apache Spark community is trying to provide a seamless user 
experience and this issue aims to track those kind of difference. You may want 
to link something more if you want. Basically, new feature development speed is 
different among Apache Parquet and ORC. For example, ORC bloom filter is used 
in Apache Spark already, but Parquet boom filter is not applicable yet. For 
ZStandard support, it's opposite. Please note that Apache Spark also didn't 
consume the latest versions; Apache ORC 1.6.2 and Apache Parquet 1.11.0. The 
difference table is changed time to time. You had better investigate both one 
in your use cases.

> Feature parity for ORC with Parquet
> ---
>
> Key: SPARK-20901
> URL: https://issues.apache.org/jira/browse/SPARK-20901
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to track the feature parity for ORC with Parquet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30961) Arrow enabled: to_pandas with date column fails

2020-03-05 Thread Kevin Appel (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052464#comment-17052464
 ] 

Kevin Appel commented on SPARK-30961:
-

python 3.6, pyarrow 0.8.0, pandas 0.21.0 is a combination I found that is still 
working correctly for Date in both Spark 2.3 and Spark 2.4, in additional all 
the examples listed on the pandas udf spark documentation also works with this 
setup

> Arrow enabled: to_pandas with date column fails
> ---
>
> Key: SPARK-30961
> URL: https://issues.apache.org/jira/browse/SPARK-30961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5
>Reporter: Nicolas Renkamp
>Priority: Major
>  Labels: ready-to-commit
>
> Hi,
> there seems to be a bug in the arrow enabled to_pandas conversion from spark 
> dataframe to pandas dataframe when the dataframe has a column of type 
> DateType. Here is a minimal example to reproduce the issue:
> {code:java}
> spark = SparkSession.builder.getOrCreate()
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> spark_df = spark.createDataFrame(
> [['2019-12-06']], 'created_at: string') \
> .withColumn('created_at', F.to_date('created_at'))
> # works
> spark_df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", 'true')
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> # raises AttributeError: Can only use .dt accessor with datetimelike values
> # series is still of type object, .dt does not exist
> spark_df.toPandas(){code}
> A fix would be to modify the _check_series_convert_date function in 
> pyspark.sql.types to:
> {code:java}
> def _check_series_convert_date(series, data_type):
> """
> Cast the series to datetime.date if it's a date type, otherwise returns 
> the original series.:param series: pandas.Series
> :param data_type: a Spark data type for the series
> """
> from pyspark.sql.utils import require_minimum_pandas_version
> require_minimum_pandas_version()from pandas import to_datetime
> if type(data_type) == DateType:
> return to_datetime(series).dt.date
> else:
> return series
> {code}
> Let me know if I should prepare a Pull Request for the 2.4.5 branch.
> I have not tested the behavior on master branch.
>  
> Thanks,
> Nicolas



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31058) Consolidate the implementation of quoteIfNeeded

2020-03-05 Thread DB Tsai (Jira)
DB Tsai created SPARK-31058:
---

 Summary: Consolidate the implementation of quoteIfNeeded
 Key: SPARK-31058
 URL: https://issues.apache.org/jira/browse/SPARK-31058
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.5
Reporter: DB Tsai


There are two implementation of quoteIfNeeded, and one is in 
*org.apache.spark.sql.connector.catalog.CatalogV2Implicits.quote* and the other 
is in *OrcFiltersBase.quoteAttributeNameIfNeeded* This PR will consolidate them 
into one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31052) Fix flaky test "DAGSchedulerSuite.shuffle fetch failed on speculative task, but original task succeed"

2020-03-05 Thread Xingbo Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang resolved SPARK-31052.
--
   Fix Version/s: 3.0.0
Target Version/s: 3.0.0
Assignee: wuyi
  Resolution: Fixed

Fixed by https://github.com/apache/spark/pull/27809

> Fix flaky test "DAGSchedulerSuite.shuffle fetch failed on speculative task, 
> but original task succeed"
> --
>
> Key: SPARK-31052
> URL: https://issues.apache.org/jira/browse/SPARK-31052
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Test "shuffle fetch failed on speculative task, but original task succeed" is 
> flaky.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31052) Fix flaky test "DAGSchedulerSuite.shuffle fetch failed on speculative task, but original task succeed"

2020-03-05 Thread Xingbo Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang updated SPARK-31052:
-
Summary: Fix flaky test "DAGSchedulerSuite.shuffle fetch failed on 
speculative task, but original task succeed"  (was: Fix flaky test of 
SPARK-30388)

> Fix flaky test "DAGSchedulerSuite.shuffle fetch failed on speculative task, 
> but original task succeed"
> --
>
> Key: SPARK-31052
> URL: https://issues.apache.org/jira/browse/SPARK-31052
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> Test "shuffle fetch failed on speculative task, but original task succeed" is 
> flaky.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30886) Deprecate two-parameter TRIM/LTRIM/RTRIM functions

2020-03-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30886:
--
Summary: Deprecate two-parameter TRIM/LTRIM/RTRIM functions  (was: 
Deprecate two-parameter TRIM/LTRIM/RTRIM function)

> Deprecate two-parameter TRIM/LTRIM/RTRIM functions
> --
>
> Key: SPARK-30886
> URL: https://issues.apache.org/jira/browse/SPARK-30886
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Apache Spark community decided to keep the existing esoteric two-parameter 
> use cases with a proper warning. This JIRA aims to show warning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30886) Deprecate two-parameter TRIM/LTRIM/RTRIM function

2020-03-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30886:
--
Summary: Deprecate two-parameter TRIM/LTRIM/RTRIM function  (was: Deprecate 
two-parameter TRIM function)

> Deprecate two-parameter TRIM/LTRIM/RTRIM function
> -
>
> Key: SPARK-30886
> URL: https://issues.apache.org/jira/browse/SPARK-30886
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Apache Spark community decided to keep the existing esoteric two-parameter 
> use cases with a proper warning. This JIRA aims to show warning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31057) approxQuantile function of spark , not taking List as first parameter

2020-03-05 Thread Shyam (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shyam updated SPARK-31057:
--
Shepherd: Sean R. Owen

> approxQuantile function  of spark , not taking List as first parameter
> --
>
> Key: SPARK-31057
> URL: https://issues.apache.org/jira/browse/SPARK-31057
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1, 2.4.3
> Environment: spark-sql-2.4.3 v and eclipse neon ide.
>  
>Reporter: Shyam
>Priority: Major
>
> 0
> I am using spark-sql-2.4.1v in my project with java8.
> I need to calculate the quantiles on the some of the (calculated) columns 
> (i.e. con_dist_1 , con_dist_2 ) of below given dataframe df.
> {{List calcColmns = Arrays.asList("con_dist_1","con_dist_2")}}
> When I am trying to use first version of approxQuantile i.e. 
> approxQuantile(List, List, double) as below
> Dataset df = //dataset
> {{List> quants = df.stat().approxQuantile(calcColmns , 
> Array(0.0,0.1,0.5),0.0);}}
> *It is giving error :*
> {quote}The method approxQuantile(String, double[], double) in the type 
> DataFrameStatFunctions is not applicable for the arguments (List, List, 
> double)
> {quote}
> so what is wrong here , I am doing it in my eclipseIDE. Why it is not 
> invoking List even though i am passing List ??
> Really appreciate any help on this.
> more details are added here
> [https://stackoverflow.com/questions/60550152/issue-with-approxquantile-of-spark-not-recognizing-liststring]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31057) approxQuantile function of spark , not taking List as first parameter

2020-03-05 Thread Shyam (Jira)
Shyam created SPARK-31057:
-

 Summary: approxQuantile function  of spark , not taking 
List as first parameter
 Key: SPARK-31057
 URL: https://issues.apache.org/jira/browse/SPARK-31057
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.3, 2.4.1
 Environment: spark-sql-2.4.3 v and eclipse neon ide.

 
Reporter: Shyam


0
I am using spark-sql-2.4.1v in my project with java8.

I need to calculate the quantiles on the some of the (calculated) columns (i.e. 
con_dist_1 , con_dist_2 ) of below given dataframe df.
{{List calcColmns = Arrays.asList("con_dist_1","con_dist_2")}}
When I am trying to use first version of approxQuantile i.e. 
approxQuantile(List, List, double) as below

Dataset df = //dataset
{{List> quants = df.stat().approxQuantile(calcColmns , 
Array(0.0,0.1,0.5),0.0);}}
*It is giving error :*
{quote}The method approxQuantile(String, double[], double) in the type 
DataFrameStatFunctions is not applicable for the arguments (List, List, double)
{quote}
so what is wrong here , I am doing it in my eclipseIDE. Why it is not invoking 
List even though i am passing List ??

Really appreciate any help on this.

more details are added here

[https://stackoverflow.com/questions/60550152/issue-with-approxquantile-of-spark-not-recognizing-liststring]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20901) Feature parity for ORC with Parquet

2020-03-05 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052384#comment-17052384
 ] 

Felix Kizhakkel Jose commented on SPARK-20901:
--

Hi [~dongjoon],
I was trying to choose between ORC and Parquet formats (using AWS Glue /Spark). 
While researching I came across this parity feature SPARK-20901.
What features are not implemented  for ORC to have complete feature parity as 
Parquet in Spark? Here I could see everything (issues linked) listed  here are 
either Resolved or Closed, so I am confused. Could you please provide some 
insights?

> Feature parity for ORC with Parquet
> ---
>
> Key: SPARK-20901
> URL: https://issues.apache.org/jira/browse/SPARK-20901
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to track the feature parity for ORC with Parquet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host

2020-03-05 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052348#comment-17052348
 ] 

Attila Zsolt Piros commented on SPARK-27651:


well in case of dynamic allocation and a recalculation the executors could be 
already gone.

> Avoid the network when block manager fetches shuffle blocks from the same host
> --
>
> Key: SPARK-27651
> URL: https://issues.apache.org/jira/browse/SPARK-27651
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.0
>
>
> When a shuffle block (content) is fetched the network is always used even 
> when it is fetched from the external shuffle service running on the same 
> host. This can be avoided by getting the local directories of the same host 
> executors from the external shuffle service and accessing those blocks from 
> the disk directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30886) Deprecate two-parameter TRIM function

2020-03-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30886:
--
Summary: Deprecate two-parameter TRIM function  (was: Deprecate LTRIM, 
RTRIM, and two-parameter TRIM functions)

> Deprecate two-parameter TRIM function
> -
>
> Key: SPARK-30886
> URL: https://issues.apache.org/jira/browse/SPARK-30886
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Apache Spark community decided to keep the existing esoteric two-parameter 
> use cases with a proper warning. This JIRA aims to show warning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31056) Add CalendarIntervals division

2020-03-05 Thread Enrico Minack (Jira)
Enrico Minack created SPARK-31056:
-

 Summary: Add CalendarIntervals division
 Key: SPARK-31056
 URL: https://issues.apache.org/jira/browse/SPARK-31056
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Enrico Minack


{{CalendarInterval}} should be allowed for division. The {{CalendarInterval}} 
consists of three time components: {{months}}, {{days}} and {{microseconds}}. 
The division can only be defined between intervals that have a single non-zero 
time component, while both intervals have the same non-zero time component. 
Otherwise the division expression would be ambiguous.

This allows to evaluate the magnitude of {{CalendarInterval}} in SQL 
expressions:
{code}
Seq((Timestamp.valueOf("2020-02-01 12:00:00"), Timestamp.valueOf("2020-02-01 
13:30:25")))
  .toDF("start", "end")
  .withColumn("interval", $"end" - $"start")
  .withColumn("interval [h]", $"interval" / lit("1 
hour").cast(CalendarIntervalType))
  .withColumn("rate [€/h]", lit(1.45))
  .withColumn("price [€]", $"interval [h]" * $"rate [€/h]")
  .show(false)
+---+---+-+--+--+--+
|start  |end|interval |interval 
[h]  |rate [€/h]|price [€] |
+---+---+-+--+--+--+
|2020-02-01 12:00:00|2020-02-01 13:30:25|1 hours 30 minutes 25 
seconds|1.5069|1.45  |2.18506943|
+---+---+-+--+--+--+
{code}

The currently available approach is

{code}
Seq((Timestamp.valueOf("2020-02-01 12:00:00"), Timestamp.valueOf("2020-02-01 
13:30:25")))
  .toDF("start", "end")
  .withColumn("interval [s]", unix_timestamp($"end") - unix_timestamp($"start"))
  .withColumn("interval [h]", $"interval [s]" / 3600)
  .withColumn("rate [€/h]", lit(1.45))
  .withColumn("price [€]", $"interval [h]" * $"rate [€/h]")
  .show(false)
{code}

Going through {{unix_timestamp}} is a hack and it pollutes the SQL query with 
unrelated semantics (unix timestamp is completely irrelevant for this 
computation). It is merely there because there is currently no way to access 
the length of an {{CalendarInterval}}. Dividing an interval by another interval 
provides means to measure the length in an arbitrary unit (minutes, hours, 
quarter hours).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31055) Update config docs for shuffle local host reads to have dep on external shuffle service

2020-03-05 Thread Thomas Graves (Jira)
Thomas Graves created SPARK-31055:
-

 Summary: Update config docs for shuffle local host reads to have 
dep on external shuffle service
 Key: SPARK-31055
 URL: https://issues.apache.org/jira/browse/SPARK-31055
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Thomas Graves


with SPARK-27651 we now support host local reads for shuffle, but only when 
external shuffle service is enabled. Update the config docs to state that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31054) Turn on deprecation in Scala REPL/spark-shell by default

2020-03-05 Thread wuyi (Jira)
wuyi created SPARK-31054:


 Summary: Turn on deprecation in Scala REPL/spark-shell  by default
 Key: SPARK-31054
 URL: https://issues.apache.org/jira/browse/SPARK-31054
 Project: Spark
  Issue Type: Improvement
  Components: Spark Shell
Affects Versions: 3.0.0
Reporter: wuyi


Turn on deprecation in Scala REPL/spark-shell by default, so user can aways see 
the details about deprecated API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host

2020-03-05 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052251#comment-17052251
 ] 

Thomas Graves commented on SPARK-27651:
---

thanks, that makes sense.

I can look in more details at the code, but I assume the executors could ask 
the other executors for the directory list rather than going to the external 
shuffle service if we wanted to support it.

> Avoid the network when block manager fetches shuffle blocks from the same host
> --
>
> Key: SPARK-27651
> URL: https://issues.apache.org/jira/browse/SPARK-27651
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.0
>
>
> When a shuffle block (content) is fetched the network is always used even 
> when it is fetched from the external shuffle service running on the same 
> host. This can be avoided by getting the local directories of the same host 
> executors from the external shuffle service and accessing those blocks from 
> the disk directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31053) mark connector API as Evolving

2020-03-05 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-31053:
---

 Summary: mark connector API as Evolving
 Key: SPARK-31053
 URL: https://issues.apache.org/jira/browse/SPARK-31053
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31034) ShuffleBlockFetcherIterator may can't create request for last group

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31034.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27786
[https://github.com/apache/spark/pull/27786]

> ShuffleBlockFetcherIterator may can't create request for last group
> ---
>
> Key: SPARK-31034
> URL: https://issues.apache.org/jira/browse/SPARK-31034
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> When the size of all blocks is less than targetRemoteRequestSize and the size 
> of last block group is less than maxBlocksInFlightPerAddress, 
> ShuffleBlockFetcherIterator will not create a request for the last group.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31034) ShuffleBlockFetcherIterator may can't create request for last group

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31034:
---

Assignee: wuyi

> ShuffleBlockFetcherIterator may can't create request for last group
> ---
>
> Key: SPARK-31034
> URL: https://issues.apache.org/jira/browse/SPARK-31034
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> When the size of all blocks is less than targetRemoteRequestSize and the size 
> of last block group is less than maxBlocksInFlightPerAddress, 
> ShuffleBlockFetcherIterator will not create a request for the last group.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31052) Fix flaky test of SPARK-30388

2020-03-05 Thread wuyi (Jira)
wuyi created SPARK-31052:


 Summary: Fix flaky test of SPARK-30388
 Key: SPARK-31052
 URL: https://issues.apache.org/jira/browse/SPARK-31052
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: wuyi


Test "shuffle fetch failed on speculative task, but original task succeed" is 
flaky.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31005) Support time zone ids in casting strings to timestamps

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31005.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27753
[https://github.com/apache/spark/pull/27753]

> Support time zone ids in casting strings to timestamps
> --
>
> Key: SPARK-31005
> URL: https://issues.apache.org/jira/browse/SPARK-31005
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, Spark supports only time zone offsets in the formats:
> * -[h]h:[m]m
> * +[h]h:[m]m
> * Z
> The ticket aims to support any valid time zone ids at the end of timestamp 
> strings, for instance:
> {code}
> 2015-03-18T12:03:17.123456 Europe/Moscow
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31005) Support time zone ids in casting strings to timestamps

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31005:
---

Assignee: Maxim Gekk

> Support time zone ids in casting strings to timestamps
> --
>
> Key: SPARK-31005
> URL: https://issues.apache.org/jira/browse/SPARK-31005
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Currently, Spark supports only time zone offsets in the formats:
> * -[h]h:[m]m
> * +[h]h:[m]m
> * Z
> The ticket aims to support any valid time zone ids at the end of timestamp 
> strings, for instance:
> {code}
> 2015-03-18T12:03:17.123456 Europe/Moscow
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31051) Thriftserver operations other than SparkExecuteStatementOperation do not call onOperationClosed

2020-03-05 Thread Juliusz Sompolski (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski resolved SPARK-31051.
---
Resolution: Not A Problem

I'm just blind. They are there.

> Thriftserver operations other than SparkExecuteStatementOperation do not call 
> onOperationClosed
> ---
>
> Key: SPARK-31051
> URL: https://issues.apache.org/jira/browse/SPARK-31051
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> In Spark 3.0 onOperationClosed was implemented in HIveThriftServer2Listener 
> to track closing the operation in the thriftserver (after client finishes 
> fetching).
> However, it seems that only SparkExecuteStatementOperation calls it in it's 
> close() function. Other operations need to do this as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31051) Thriftserver operations other than SparkExecuteStatementOperation do not call onOperationClosed

2020-03-05 Thread Juliusz Sompolski (Jira)
Juliusz Sompolski created SPARK-31051:
-

 Summary: Thriftserver operations other than 
SparkExecuteStatementOperation do not call onOperationClosed
 Key: SPARK-31051
 URL: https://issues.apache.org/jira/browse/SPARK-31051
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Juliusz Sompolski


In Spark 3.0 onOperationClosed was implemented in HIveThriftServer2Listener to 
track closing the operation in the thriftserver (after client finishes 
fetching).
However, it seems that only SparkExecuteStatementOperation calls it in it's 
close() function. Other operations need to do this as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30100) Decimal Precision Inferred from JDBC via Spark

2020-03-05 Thread Joby Joje (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052050#comment-17052050
 ] 

Joby Joje commented on SPARK-30100:
---

[~hyukjin.kwon] is there any workaround for this precision data loss ?

> Decimal Precision Inferred from JDBC via Spark
> --
>
> Key: SPARK-30100
> URL: https://issues.apache.org/jira/browse/SPARK-30100
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Joby Joje
>Priority: Major
>
> When trying to load data from JDBC(Oracle) into Spark, there seems to be 
> precision loss in the decimal field, as per my understanding Spark supports 
> *DECIMAL(38,18)*. The field from the Oracle is DECIMAL(38,14), whereas Spark 
> rounds off the last four digits making it a precision of DECIMAL(38,10). This 
> is happening to few fields in the dataframe where the column is fetched using 
> a CASE statement whereas in the same query another field populates the right 
> schema.
> Tried to pass the
> {code:java}
> spark.sql.decimalOperations.allowPrecisionLoss=false{code}
> conf in the Spark-submit though didn't get the desired results.
> {code:java}
> jdbcDF = spark.read \ 
> .format("jdbc") \ 
> .option("url", "ORACLE") \ 
> .option("dbtable", "QUERY") \ 
> .option("user", "USERNAME") \ 
> .option("password", "PASSWORD") \ 
> .load(){code}
> So considering that the Spark infers the schema from a sample records, how 
> does this work here? Does it use the results of the query i.e (SELECT * FROM 
> TABLE_NAME JOIN ...) or does it take a different route to guess the schema 
> for itself? Can someone throw some light on this and advise how to achieve 
> the right decimal precision on this regards without manipulating the query as 
> doing a CAST on the query does solve the issue, but would prefer to get some 
> alternatives.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31046) Make more efficient and clean up AQE update UI code

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31046.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27799
[https://github.com/apache/spark/pull/27799]

> Make more efficient and clean up AQE update UI code
> ---
>
> Key: SPARK-31046
> URL: https://issues.apache.org/jira/browse/SPARK-31046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31046) Make more efficient and clean up AQE update UI code

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31046:
---

Assignee: Wei Xue

> Make more efficient and clean up AQE update UI code
> ---
>
> Key: SPARK-31046
> URL: https://issues.apache.org/jira/browse/SPARK-31046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host

2020-03-05 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27651:
---
Description: When a shuffle block (content) is fetched the network is 
always used even when it is fetched from the external shuffle service running 
on the same host. This can be avoided by getting the local directories of the 
same host executors from the external shuffle service and accessing those 
blocks from the disk directly.  (was: When a shuffle block (content) is fetched 
the network is always used even when it is fetched from an executor (or the 
external shuffle service) running on the same host.)

> Avoid the network when block manager fetches shuffle blocks from the same host
> --
>
> Key: SPARK-27651
> URL: https://issues.apache.org/jira/browse/SPARK-27651
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.0
>
>
> When a shuffle block (content) is fetched the network is always used even 
> when it is fetched from the external shuffle service running on the same 
> host. This can be avoided by getting the local directories of the same host 
> executors from the external shuffle service and accessing those blocks from 
> the disk directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host

2020-03-05 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051995#comment-17051995
 ] 

Attila Zsolt Piros commented on SPARK-27651:


Yes, the final implementation works only when the external shuffle service is 
used as the local directories of the other host local executors are asked from 
the external shuffle service. 
The initial implementation when the PR was opened was using the driver to get 
the host local directories.

The technical reasons of asking the external shuffle service was:
 * decreasing network pressure on the driver (main reason).  
 * getting rid of an unbounded (or bounded but in that case complex fall back 
logic at the fetcher) map which maps the executors to local dirs. In addition 
does that redundantly as this information is already available at the external 
shuffle service just stored in distributed way I mean at a running ext shuffle 
service process only for those executor data are stored which are on the same 
host. 

> Avoid the network when block manager fetches shuffle blocks from the same host
> --
>
> Key: SPARK-27651
> URL: https://issues.apache.org/jira/browse/SPARK-27651
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.0
>
>
> When a shuffle block (content) is fetched the network is always used even 
> when it is fetched from an executor (or the external shuffle service) running 
> on the same host.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31024) Allow specifying session catalog name (spark_catalog) in qualified column names

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31024.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27776
[https://github.com/apache/spark/pull/27776]

> Allow specifying session catalog name (spark_catalog) in qualified column 
> names
> ---
>
> Key: SPARK-31024
> URL: https://issues.apache.org/jira/browse/SPARK-31024
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, the user cannot specify the session catalog name when using 
> qualified column names for v1 tables:
> {code:java}
> SELECT spark_catalog.default.t.i FROM spark_catalog.default.t
> {code}
> fails with "cannot resolve '`spark_catalog.default.t.i`".
> This is inconsistent with v2 tables where catalog name can be used:
> {code:java}
> SELECT testcat.ns1.tbl.id FROM testcat.ns1.tbl.id
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31024) Allow specifying session catalog name (spark_catalog) in qualified column names

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31024:
---

Assignee: Terry Kim

> Allow specifying session catalog name (spark_catalog) in qualified column 
> names
> ---
>
> Key: SPARK-31024
> URL: https://issues.apache.org/jira/browse/SPARK-31024
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>
> Currently, the user cannot specify the session catalog name when using 
> qualified column names for v1 tables:
> {code:java}
> SELECT spark_catalog.default.t.i FROM spark_catalog.default.t
> {code}
> fails with "cannot resolve '`spark_catalog.default.t.i`".
> This is inconsistent with v2 tables where catalog name can be used:
> {code:java}
> SELECT testcat.ns1.tbl.id FROM testcat.ns1.tbl.id
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15798) Secondary sort in Dataset/DataFrame

2020-03-05 Thread Melitta Dragaschnig (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051990#comment-17051990
 ] 

Melitta Dragaschnig commented on SPARK-15798:
-

Hi all,


I am a frequent user of tresata's spark-sorted library (thank you [~koert]!) to 
get the Secondary Sort functionality (for large groups, in order to avoid 
memory issues), so I tried to figure out whether there are plans to merge this 
useful functionality into the core library.

 

After checking the progression of this Jira issue and seeing that it's marked 
as Incomplete, has been closed but no Fix versions are given, my conclusion was 
that presently it is not provided by the core library, and it's advisable to 
continue using spark-sorted for the time being. Is my assumption correct?

Also any further information on additional ways to stay informed about the 
development of this topic would be greatly appreciated!

> Secondary sort in Dataset/DataFrame
> ---
>
> Key: SPARK-15798
> URL: https://issues.apache.org/jira/browse/SPARK-15798
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: koert kuipers
>Priority: Major
>  Labels: bulk-closed
>
> Secondary sort for Spark RDDs was discussed in 
> https://issues.apache.org/jira/browse/SPARK-3655
> Since the RDD API allows for easy extensions outside the core library this 
> was implemented separately here:
> https://github.com/tresata/spark-sorted
> However it seems to me that with Dataset an implementation in a 3rd party 
> library of such a feature is not really an option.
> Dataset already has methods that suggest a secondary sort is present, such as 
> in KeyValueGroupedDataset:
> {noformat}
> def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): 
> Dataset[U]
> {noformat}
> This operation pushes all the data to the reducer, something you only would 
> want to do if you need the elements in a particular order.
> How about as an API sortBy methods in KeyValueGroupedDataset and 
> RelationalGroupedDataset?
> {noformat}
> dataFrame.groupBy("a").sortBy("b").fold(...)
> {noformat}
> (yes i know RelationalGroupedDataset doesnt have a fold yet... but it should 
> :))
> {noformat}
> dataset.groupBy(_._1).sortBy(_._3).flatMapGroups(...)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"

2020-03-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30668.
-
Resolution: Fixed

Issue resolved by pull request 27537
[https://github.com/apache/spark/pull/27537]

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> "-MM-dd'T'HH:mm:ss.SSSz"
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Maxim Gekk
>Priority: Blocker
> Fix For: 3.0.0
>
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master
> **2.4.5 RC2**
> {code}
> scala> sql("""SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")""").show
> ++
> |to_timestamp('2020-01-27T20:06:11.847-0800', '-MM-dd\'T\'HH:mm:ss.SSSz')|
> ++
> | 2020-01-27 20:06:11|
> ++
> {code}
> **2.2.3 ~ 2.4.4** (2.0.2 ~ 2.1.3 doesn't have `to_timestamp`).
> {code}
> spark-sql> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz");
> 2020-01-27 20:06:11
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31050) Disable flaky KafkaDelegationTokenSuite

2020-03-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31050.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27789
[https://github.com/apache/spark/pull/27789]

> Disable flaky KafkaDelegationTokenSuite
> ---
>
> Key: SPARK-31050
> URL: https://issues.apache.org/jira/browse/SPARK-31050
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Disable flaky KafkaDelegationTokenSuite since it's too flaky.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31050) Disable flaky KafkaDelegationTokenSuite

2020-03-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31050:
-

Assignee: wuyi

> Disable flaky KafkaDelegationTokenSuite
> ---
>
> Key: SPARK-31050
> URL: https://issues.apache.org/jira/browse/SPARK-31050
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> Disable flaky KafkaDelegationTokenSuite since it's too flaky.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30541) Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite

2020-03-05 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051877#comment-17051877
 ] 

Dongjoon Hyun commented on SPARK-30541:
---

Please see the PR. This is decided to be a blocker for 3.0.0.

> Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite
> ---
>
> Key: SPARK-30541
> URL: https://issues.apache.org/jira/browse/SPARK-30541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>
> The test suite has been failing intermittently as of now:
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116862/testReport/]
>  
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.(It is not a test it 
> is a sbt.testing.SuiteSelector)
>   
> {noformat}
> Error Details
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 3939 times over 
> 1.000122353532 minutes. Last failure message: KeeperErrorCode = 
> AuthFailed for /brokers/ids.
> Stack Trace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 3939 times over 
> 1.000122353532 minutes. Last failure message: KeeperErrorCode = 
> AuthFailed for /brokers/ids.
>   at 
> org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432)
>   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439)
>   at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)
>   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:337)
>   at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:336)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:292)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:58)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: sbt.ForkMain$ForkError: 
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = 
> AuthFailed for /brokers/ids
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
>   at 
> kafka.zookeeper.AsyncResponse.resultException(ZooKeeperClient.scala:554)
>   at kafka.zk.KafkaZkClient.getChildren(KafkaZkClient.scala:719)
>   at kafka.zk.KafkaZkClient.getSortedBrokerList(KafkaZkClient.scala:455)
>   at 
> kafka.zk.KafkaZkClient.getAllBrokersInCluster(KafkaZkClient.scala:404)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.$anonfun$setup$3(KafkaTestUtils.scala:293)
>   at 
> org.scalatest.concurrent.Eventually.makeAValiantAttempt$1(Eventually.scala:395)
>   at 
> org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:409)
>   ... 20 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30541) Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite

2020-03-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30541:
--
Target Version/s: 3.0.0

> Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite
> ---
>
> Key: SPARK-30541
> URL: https://issues.apache.org/jira/browse/SPARK-30541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>
> The test suite has been failing intermittently as of now:
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116862/testReport/]
>  
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.(It is not a test it 
> is a sbt.testing.SuiteSelector)
>   
> {noformat}
> Error Details
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 3939 times over 
> 1.000122353532 minutes. Last failure message: KeeperErrorCode = 
> AuthFailed for /brokers/ids.
> Stack Trace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 3939 times over 
> 1.000122353532 minutes. Last failure message: KeeperErrorCode = 
> AuthFailed for /brokers/ids.
>   at 
> org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432)
>   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439)
>   at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)
>   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:337)
>   at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:336)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:292)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:58)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: sbt.ForkMain$ForkError: 
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = 
> AuthFailed for /brokers/ids
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
>   at 
> kafka.zookeeper.AsyncResponse.resultException(ZooKeeperClient.scala:554)
>   at kafka.zk.KafkaZkClient.getChildren(KafkaZkClient.scala:719)
>   at kafka.zk.KafkaZkClient.getSortedBrokerList(KafkaZkClient.scala:455)
>   at 
> kafka.zk.KafkaZkClient.getAllBrokersInCluster(KafkaZkClient.scala:404)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.$anonfun$setup$3(KafkaTestUtils.scala:293)
>   at 
> org.scalatest.concurrent.Eventually.makeAValiantAttempt$1(Eventually.scala:395)
>   at 
> org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:409)
>   ... 20 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31050) Disable flaky KafkaDelegationTokenSuite

2020-03-05 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-31050:
-
Description: Disable flaky KafkaDelegationTokenSuite since it's too flaky.

> Disable flaky KafkaDelegationTokenSuite
> ---
>
> Key: SPARK-31050
> URL: https://issues.apache.org/jira/browse/SPARK-31050
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
> Environment: Disable flaky KafkaDelegationTokenSuite since it's too 
> flaky.
>Reporter: wuyi
>Priority: Major
>
> Disable flaky KafkaDelegationTokenSuite since it's too flaky.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31050) Disable flaky KafkaDelegationTokenSuite

2020-03-05 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-31050:
-
Environment: (was: Disable flaky KafkaDelegationTokenSuite since it's 
too flaky.)

> Disable flaky KafkaDelegationTokenSuite
> ---
>
> Key: SPARK-31050
> URL: https://issues.apache.org/jira/browse/SPARK-31050
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> Disable flaky KafkaDelegationTokenSuite since it's too flaky.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org