[jira] [Assigned] (SPARK-31597) extracting day from intervals should be interval.days + days in interval.microsecond
[ https://issues.apache.org/jira/browse/SPARK-31597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31597: --- Assignee: Kent Yao > extracting day from intervals should be interval.days + days in > interval.microsecond > > > Key: SPARK-31597 > URL: https://issues.apache.org/jira/browse/SPARK-31597 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > checked with both Presto and PostgresSQL, one is implemented intervals with > ANSI style year-month/day-time, the other is mixed and Non-ANSI. They both > add the exceeded days in interval time part to the total days of the > operation which extracts day from interval values > > ```sql > presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - > cast('2020-01-01 00:00:00' as timestamp))); > _col0 > --- > 14 > (1 row) > Query 20200428_135239_0_ahn7x, FINISHED, 1 node > Splits: 17 total, 17 done (100.00%) > 0:01 [0 rows, 0B] [0 rows/s, 0B/s] > presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - > cast('2020-01-01 00:00:01' as timestamp))); > _col0 > --- > 13 > (1 row) > Query 20200428_135246_1_ahn7x, FINISHED, 1 node > Splits: 17 total, 17 done (100.00%) > 0:00 [0 rows, 0B] [0 rows/s, 0B/s] > presto> > ``` > ```scala > postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) > - cast('2020-01-01 00:00:00' as timestamp))); > date_part > --- > 14 > (1 row) > postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) > - cast('2020-01-01 00:00:01' as timestamp))); > date_part > --- > 13 > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31597) extracting day from intervals should be interval.days + days in interval.microsecond
[ https://issues.apache.org/jira/browse/SPARK-31597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31597. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28396 [https://github.com/apache/spark/pull/28396] > extracting day from intervals should be interval.days + days in > interval.microsecond > > > Key: SPARK-31597 > URL: https://issues.apache.org/jira/browse/SPARK-31597 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > checked with both Presto and PostgresSQL, one is implemented intervals with > ANSI style year-month/day-time, the other is mixed and Non-ANSI. They both > add the exceeded days in interval time part to the total days of the > operation which extracts day from interval values > > ```sql > presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - > cast('2020-01-01 00:00:00' as timestamp))); > _col0 > --- > 14 > (1 row) > Query 20200428_135239_0_ahn7x, FINISHED, 1 node > Splits: 17 total, 17 done (100.00%) > 0:01 [0 rows, 0B] [0 rows/s, 0B/s] > presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - > cast('2020-01-01 00:00:01' as timestamp))); > _col0 > --- > 13 > (1 row) > Query 20200428_135246_1_ahn7x, FINISHED, 1 node > Splits: 17 total, 17 done (100.00%) > 0:00 [0 rows, 0B] [0 rows/s, 0B/s] > presto> > ``` > ```scala > postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) > - cast('2020-01-01 00:00:00' as timestamp))); > date_part > --- > 14 > (1 row) > postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) > - cast('2020-01-01 00:00:01' as timestamp))); > date_part > --- > 13 > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29492) SparkThriftServer can't support jar class as table serde class when executestatement in sync mode
[ https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29492. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26141 [https://github.com/apache/spark/pull/26141] > SparkThriftServer can't support jar class as table serde class when > executestatement in sync mode > -- > > Key: SPARK-29492 > URL: https://issues.apache.org/jira/browse/SPARK-29492 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.0.0 > > > Add UT in HiveThriftBinaryServerSuit: > {code} > test("jar in sync mode") { > withCLIServiceClient { client => > val user = System.getProperty("user.name") > val sessionHandle = client.openSession(user, "") > val confOverlay = new java.util.HashMap[java.lang.String, > java.lang.String] > val jarFile = HiveTestJars.getHiveHcatalogCoreJar().getCanonicalPath > Seq(s"ADD JAR $jarFile", > "CREATE TABLE smallKV(key INT, val STRING)", > s"LOAD DATA LOCAL INPATH '${TestData.smallKv}' OVERWRITE INTO TABLE > smallKV") > .foreach(query => client.executeStatement(sessionHandle, query, > confOverlay)) > client.executeStatement(sessionHandle, > """CREATE TABLE addJar(key string) > |ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' > """.stripMargin, confOverlay) > client.executeStatement(sessionHandle, > "INSERT INTO TABLE addJar SELECT 'k1' as key FROM smallKV limit 1", > confOverlay) > val operationHandle = client.executeStatement( > sessionHandle, > "SELECT key FROM addJar", > confOverlay) > // Fetch result first time > assertResult(1, "Fetching result first time from next row") { > val rows_next = client.fetchResults( > operationHandle, > FetchOrientation.FETCH_NEXT, > 1000, > FetchType.QUERY_OUTPUT) > rows_next.numRows() > } > } > } > {code} > Run it then got ClassNotFound error. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29492) SparkThriftServer can't support jar class as table serde class when executestatement in sync mode
[ https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29492: --- Assignee: angerszhu > SparkThriftServer can't support jar class as table serde class when > executestatement in sync mode > -- > > Key: SPARK-29492 > URL: https://issues.apache.org/jira/browse/SPARK-29492 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > > Add UT in HiveThriftBinaryServerSuit: > {code} > test("jar in sync mode") { > withCLIServiceClient { client => > val user = System.getProperty("user.name") > val sessionHandle = client.openSession(user, "") > val confOverlay = new java.util.HashMap[java.lang.String, > java.lang.String] > val jarFile = HiveTestJars.getHiveHcatalogCoreJar().getCanonicalPath > Seq(s"ADD JAR $jarFile", > "CREATE TABLE smallKV(key INT, val STRING)", > s"LOAD DATA LOCAL INPATH '${TestData.smallKv}' OVERWRITE INTO TABLE > smallKV") > .foreach(query => client.executeStatement(sessionHandle, query, > confOverlay)) > client.executeStatement(sessionHandle, > """CREATE TABLE addJar(key string) > |ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' > """.stripMargin, confOverlay) > client.executeStatement(sessionHandle, > "INSERT INTO TABLE addJar SELECT 'k1' as key FROM smallKV limit 1", > confOverlay) > val operationHandle = client.executeStatement( > sessionHandle, > "SELECT key FROM addJar", > confOverlay) > // Fetch result first time > assertResult(1, "Fetching result first time from next row") { > val rows_next = client.fetchResults( > operationHandle, > FetchOrientation.FETCH_NEXT, > 1000, > FetchType.QUERY_OUTPUT) > rows_next.numRows() > } > } > } > {code} > Run it then got ClassNotFound error. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31596) Generate SQL Configurations from hive module to configuration doc
[ https://issues.apache.org/jira/browse/SPARK-31596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31596. -- Fix Version/s: 3.0.0 Assignee: Kent Yao Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28394 > Generate SQL Configurations from hive module to configuration doc > - > > Key: SPARK-31596 > URL: https://issues.apache.org/jira/browse/SPARK-31596 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 3.0.0 > > > ATT -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31602) memory leak of JobConf
[ https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17095124#comment-17095124 ] angerszhu commented on SPARK-31602: --- cc [~cloud_fan] > memory leak of JobConf > -- > > Key: SPARK-31602 > URL: https://issues.apache.org/jira/browse/SPARK-31602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2020-04-29-14-34-39-496.png, > image-2020-04-29-14-35-55-986.png > > > !image-2020-04-29-14-34-39-496.png! > !image-2020-04-29-14-35-55-986.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31602) memory leak of JobConf
[ https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31602: -- Description: !image-2020-04-29-14-34-39-496.png! !image-2020-04-29-14-35-55-986.png! was: !image-2020-04-29-14-34-39-496.png! Screen Shot 2020-04-29 at 2.08.28 PM > memory leak of JobConf > -- > > Key: SPARK-31602 > URL: https://issues.apache.org/jira/browse/SPARK-31602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2020-04-29-14-34-39-496.png, > image-2020-04-29-14-35-55-986.png > > > !image-2020-04-29-14-34-39-496.png! > !image-2020-04-29-14-35-55-986.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31602) memory leak of JobConf
[ https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31602: -- Description: !image-2020-04-29-14-34-39-496.png! Screen Shot 2020-04-29 at 2.08.28 PM was: !image-2020-04-29-14-30-46-213.png! !image-2020-04-29-14-30-55-964.png! > memory leak of JobConf > -- > > Key: SPARK-31602 > URL: https://issues.apache.org/jira/browse/SPARK-31602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2020-04-29-14-34-39-496.png, > image-2020-04-29-14-35-55-986.png > > > !image-2020-04-29-14-34-39-496.png! > Screen Shot 2020-04-29 at 2.08.28 PM -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31602) memory leak of JobConf
[ https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31602: -- Attachment: image-2020-04-29-14-35-55-986.png > memory leak of JobConf > -- > > Key: SPARK-31602 > URL: https://issues.apache.org/jira/browse/SPARK-31602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2020-04-29-14-34-39-496.png, > image-2020-04-29-14-35-55-986.png > > > !image-2020-04-29-14-34-39-496.png! > Screen Shot 2020-04-29 at 2.08.28 PM -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31602) memory leak of JobConf
[ https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31602: -- Attachment: image-2020-04-29-14-34-39-496.png > memory leak of JobConf > -- > > Key: SPARK-31602 > URL: https://issues.apache.org/jira/browse/SPARK-31602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2020-04-29-14-34-39-496.png > > > !image-2020-04-29-14-30-46-213.png! > > !image-2020-04-29-14-30-55-964.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31602) memory leak of JobConf
[ https://issues.apache.org/jira/browse/SPARK-31602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17095122#comment-17095122 ] angerszhu commented on SPARK-31602: --- In HadoopRDD , if you don't set spark.hadoop.cloneConf=true, it will put new JobConf to cached metadata and won't remove, maybe we should add a clear method? {code:java} // Returns a JobConf that will be used on slaves to obtain input splits for Hadoop reads. protected def getJobConf(): JobConf = { val conf: Configuration = broadcastedConf.value.value if (shouldCloneJobConf) { // Hadoop Configuration objects are not thread-safe, which may lead to various problems if // one job modifies a configuration while another reads it (SPARK-2546). This problem occurs // somewhat rarely because most jobs treat the configuration as though it's immutable. One // solution, implemented here, is to clone the Configuration object. Unfortunately, this // clone can be very expensive. To avoid unexpected performance regressions for workloads and // Hadoop versions that do not suffer from these thread-safety issues, this cloning is // disabled by default. HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized { logDebug("Cloning Hadoop Configuration") val newJobConf = new JobConf(conf) if (!conf.isInstanceOf[JobConf]) { initLocalJobConfFuncOpt.foreach(f => f(newJobConf)) } newJobConf } } else { if (conf.isInstanceOf[JobConf]) { logDebug("Re-using user-broadcasted JobConf") conf.asInstanceOf[JobConf] } else { Option(HadoopRDD.getCachedMetadata(jobConfCacheKey)) .map { conf => logDebug("Re-using cached JobConf") conf.asInstanceOf[JobConf] } .getOrElse { // Create a JobConf that will be cached and used across this RDD's getJobConf() calls in // the local process. The local cache is accessed through HadoopRDD.putCachedMetadata(). // The caching helps minimize GC, since a JobConf can contain ~10KB of temporary // objects. Synchronize to prevent ConcurrentModificationException (SPARK-1097, // HADOOP-10456). HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized { logDebug("Creating new JobConf and caching it for later re-use") val newJobConf = new JobConf(conf) initLocalJobConfFuncOpt.foreach(f => f(newJobConf)) HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf) newJobConf } } } } } {code} No remove for this cached Job metadata {code:java} /** * The three methods below are helpers for accessing the local map, a property of the SparkEnv of * the local process. */ def getCachedMetadata(key: String): Any = SparkEnv.get.hadoopJobMetadata.get(key) private def putCachedMetadata(key: String, value: Any): Unit = SparkEnv.get.hadoopJobMetadata.put(key, value) {code} for SQL on hive data, each partition will generate one JobConf, it's heave. > memory leak of JobConf > -- > > Key: SPARK-31602 > URL: https://issues.apache.org/jira/browse/SPARK-31602 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2020-04-29-14-34-39-496.png > > > !image-2020-04-29-14-30-46-213.png! > > !image-2020-04-29-14-30-55-964.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31602) memory leak of JobConf
angerszhu created SPARK-31602: - Summary: memory leak of JobConf Key: SPARK-31602 URL: https://issues.apache.org/jira/browse/SPARK-31602 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: angerszhu !image-2020-04-29-14-30-46-213.png! !image-2020-04-29-14-30-55-964.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31601) Fix spark.kubernetes.executor.podNamePrefix to work
Dongjoon Hyun created SPARK-31601: - Summary: Fix spark.kubernetes.executor.podNamePrefix to work Key: SPARK-31601 URL: https://issues.apache.org/jira/browse/SPARK-31601 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31480) Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
[ https://issues.apache.org/jira/browse/SPARK-31480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17095093#comment-17095093 ] Xiao Li commented on SPARK-31480: - let me assign it to [~dkbiswal] > Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node > --- > > Key: SPARK-31480 > URL: https://issues.apache.org/jira/browse/SPARK-31480 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > Below is the EXPLAIN OUTPUT when using the *DSV2* > *Output of EXPLAIN EXTENDED* > {code:java} > +- BatchScan[col.dots#39L] JsonScan DataFilters: [isnotnull(col.dots#39L), > (col.dots#39L = 500)], Location: > InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-7dad6f63-dc..., > PartitionFilters: [], ReadSchema: struct > {code} > *Output of EXPLAIN FORMATTED* > {code:java} > (1) BatchScan > Output [1]: [col.dots#39L] > Arguments: [col.dots#39L], > JsonScan(org.apache.spark.sql.test.TestSparkSession@45eab322,org.apache.spark.sql.execution.datasources.InMemoryFileIndex@72065f16,StructType(StructField(col.dots,LongType,true)),StructType(StructField(col.dots,LongType,true)),StructType(),org.apache.spark.sql.util.CaseInsensitiveStringMap@8822c5e0,Vector(),List(isnotnull(col.dots#39L), > (col.dots#39L = 500))) > {code} > When using *DSV1*, the output is much cleaner than the output of DSV2, > especially for EXPLAIN FORMATTED. > *Output of EXPLAIN EXTENDED* > {code:java} > +- FileScan json [col.dots#37L] Batched: false, DataFilters: > [isnotnull(col.dots#37L), (col.dots#37L = 500)], Format: JSON, Location: > InMemoryFileIndex[file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-59..., > PartitionFilters: [], PushedFilters: [IsNotNull(`col.dots`), > EqualTo(`col.dots`,500)], ReadSchema: struct > {code} > *Output of EXPLAIN FORMATTED* > {code:java} > (1) Scan json > Output [1]: [col.dots#37L] > Batched: false > Location: InMemoryFileIndex > [file:/private/var/folders/nr/j6hw4kr51wv0zynvr6srwgr0gp/T/spark-89021d76-5971-4a96-bf10-0730873f6ce0] > PushedFilters: [IsNotNull(`col.dots`), EqualTo(`col.dots`,500)] > ReadSchema: struct{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31567) Update AppVeyor Rtools to 4.0.0
[ https://issues.apache.org/jira/browse/SPARK-31567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31567: Assignee: Dongjoon Hyun > Update AppVeyor Rtools to 4.0.0 > --- > > Key: SPARK-31567 > URL: https://issues.apache.org/jira/browse/SPARK-31567 > Project: Spark > Issue Type: Improvement > Components: SparkR, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > This is a preparation for upgrade to R. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31567) Update AppVeyor Rtools to 4.0.0
[ https://issues.apache.org/jira/browse/SPARK-31567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31567. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28358 [https://github.com/apache/spark/pull/28358] > Update AppVeyor Rtools to 4.0.0 > --- > > Key: SPARK-31567 > URL: https://issues.apache.org/jira/browse/SPARK-31567 > Project: Spark > Issue Type: Improvement > Components: SparkR, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > > This is a preparation for upgrade to R. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31567) Update AppVeyor Rtools to 4.0.0
[ https://issues.apache.org/jira/browse/SPARK-31567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31567: -- Summary: Update AppVeyor Rtools to 4.0.0 (was: Update AppVeyor R version to 4.0.0) > Update AppVeyor Rtools to 4.0.0 > --- > > Key: SPARK-31567 > URL: https://issues.apache.org/jira/browse/SPARK-31567 > Project: Spark > Issue Type: Improvement > Components: SparkR, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31567) Update AppVeyor Rtools to 4.0.0
[ https://issues.apache.org/jira/browse/SPARK-31567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31567: -- Description: This is a preparation for upgrade to R. > Update AppVeyor Rtools to 4.0.0 > --- > > Key: SPARK-31567 > URL: https://issues.apache.org/jira/browse/SPARK-31567 > Project: Spark > Issue Type: Improvement > Components: SparkR, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > This is a preparation for upgrade to R. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-31591) namePrefix could be null in Utils.createDirectory
[ https://issues.apache.org/jira/browse/SPARK-31591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-31591. - > namePrefix could be null in Utils.createDirectory > - > > Key: SPARK-31591 > URL: https://issues.apache.org/jira/browse/SPARK-31591 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Minor > > In our production, we find that many shuffle files could be located in > /hadoop/2/yarn/local/usercache/b_carmel/appcache/application_1586487864336_4602/*null*-107d4e9c-d3c7-419e-9743-a21dc4eaeb3f/3a > The Util.createDirectory() uses a default parameter "spark" > {code} > def createDirectory(root: String, namePrefix: String = "spark"): File = { > {code} > But in some cases, the actual namePrefix is null. If the method is called > with null, then the default value would not be applied. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31182) PairRDD support aggregateByKeyWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-31182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-31182. -- Resolution: Not A Problem > PairRDD support aggregateByKeyWithinPartitions > -- > > Key: SPARK-31182 > URL: https://issues.apache.org/jira/browse/SPARK-31182 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > > When implementing `RobustScaler`, I was looking for a way to guarantee that > the {{QuantileSummaries in {{}}{{aggregateByKey}}{{ are compressed before > network communication. > Then I only found a tricky method to work around (however not applied), and > there was no method for this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31600) Error message from DataFrame creation is misleading.
Olexiy Oryeshko created SPARK-31600: --- Summary: Error message from DataFrame creation is misleading. Key: SPARK-31600 URL: https://issues.apache.org/jira/browse/SPARK-31600 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.5 Environment: DataBricks 6.4, Spark 2.4.5, Scala 2.11 Reporter: Olexiy Oryeshko *Description:* DataFrame creation from pandas.DataFrame fails when one of the features contains only NaN values (which is ok). However, error message mentions wrong feature as the culprit, which makes it hard to find the root cause. *How to reproduce:* {code:java} import numpy as np import pandas as pd df2 = pd.DataFrame({'a': np.array([np.nan, np.nan], dtype=np.object_), 'b': [np.nan, 'aaa']}) display(spark.createDataFrame(df2[['b']])) # Works fine spark.createDataFrame(df2) # Raises TypeError. {code} In the code above, column 'a' is bad. However, the `TypeError` raised in the last command mentions feature 'b' as the culprit: TypeError: field b: Can not merge type and -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30261) Should not change owner of hive table for some commands like 'alter' operation
[ https://issues.apache.org/jira/browse/SPARK-30261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-30261. -- Target Version/s: 2.4.3, 2.3.0 (was: 2.3.0, 2.4.3) Resolution: Duplicate > Should not change owner of hive table for some commands like 'alter' > operation > > > Key: SPARK-30261 > URL: https://issues.apache.org/jira/browse/SPARK-30261 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0, 2.4.3 >Reporter: chenliang >Priority: Critical > > For SparkSQL,When we do some alter operations on hive table, the owner of > hive table would be changed to someone who invoked the operation, it's > unresonable. And in fact, the owner should not changed for the real > prodcution environment, otherwise the authority check is out of order. > The problem can be reproduced as described in the below: > 1.First I create a table with username='xie' and then \{{desc formatted table > }},the owner is 'xiepengjie' > {code:java} > spark-sql> desc formatted bigdata_test.tt1; > col_name data_type comment c int NULL > # Detailed Table Information > Database bigdata_test Table tt1 > Owner xie > Created Time Wed Sep 11 11:30:49 CST 2019 > Last Access Thu Jan 01 08:00:00 CST 1970 > Created By Spark 2.2 or prior > Type MANAGED > Provider hive > Table Properties [PART_LIMIT=1, transient_lastDdlTime=1568172649, > LEVEL=1, TTL=60] > Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 > Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde > InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat > Storage Properties [serialization.format=1] > Partition Provider Catalog Time taken: 0.371 seconds, Fetched 18 row(s) > {code} > 2.Then I use another username='johnchen' and execute {{alter table > bigdata_test.tt1 set location > 'hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1'}}, check the owner > of hive table is 'johnchen', it's unresonable > {code:java} > spark-sql> desc formatted bigdata_test.tt1; > col_name data_type comment c int NULL > # Detailed Table Information > Database bigdata_test > Table tt1 > Owner johnchen > Created Time Wed Sep 11 11:30:49 CST 2019 > Last Access Thu Jan 01 08:00:00 CST 1970 > Created By Spark 2.2 or prior > Type MANAGED > Provider hive > Table Properties [transient_lastDdlTime=1568871017, PART_LIMIT=1, > LEVEL=1, TTL=60] > Location hdfs://NS1/user/hive_admin/warehouse/bigdata_test.db/tt1 > Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde > InputFormat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat > Storage Properties [serialization.format=1] > Partition Provider Catalog > Time taken: 0.041 seconds, Fetched 18 row(s){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31591) namePrefix could be null in Utils.createDirectory
[ https://issues.apache.org/jira/browse/SPARK-31591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31591. -- Resolution: Not A Problem See: https://github.com/apache/spark/pull/28385#issuecomment-620941771 > namePrefix could be null in Utils.createDirectory > - > > Key: SPARK-31591 > URL: https://issues.apache.org/jira/browse/SPARK-31591 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Minor > > In our production, we find that many shuffle files could be located in > /hadoop/2/yarn/local/usercache/b_carmel/appcache/application_1586487864336_4602/*null*-107d4e9c-d3c7-419e-9743-a21dc4eaeb3f/3a > The Util.createDirectory() uses a default parameter "spark" > {code} > def createDirectory(root: String, namePrefix: String = "spark"): File = { > {code} > But in some cases, the actual namePrefix is null. If the method is called > with null, then the default value would not be applied. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31595) Spark sql cli should allow unescaped quote mark in quoted string
[ https://issues.apache.org/jira/browse/SPARK-31595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094980#comment-17094980 ] Adrian Wang commented on SPARK-31595: - [~Ankitraj] Thanks, I have already created a pull request on this. > Spark sql cli should allow unescaped quote mark in quoted string > > > Key: SPARK-31595 > URL: https://issues.apache.org/jira/browse/SPARK-31595 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Adrian Wang >Priority: Major > > spark-sql> select "'"; > spark-sql> select '"'; > In Spark parser if we pass a text of `select "'";`, there will be > ParserCancellationException, which will be handled by PredictionMode.LL. By > dropping `;` correctly we can avoid that retry. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31584) NullPointerException when parsing event log with InMemoryStore
[ https://issues.apache.org/jira/browse/SPARK-31584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-31584. Target Version/s: 3.0.0, 3.0.1, 3.1.0 (was: 3.0.1) Assignee: Baohe Zhang Resolution: Fixed The issue is resolved in https://github.com/apache/spark/pull/28378 > NullPointerException when parsing event log with InMemoryStore > -- > > Key: SPARK-31584 > URL: https://issues.apache.org/jira/browse/SPARK-31584 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 >Reporter: Baohe Zhang >Assignee: Baohe Zhang >Priority: Minor > Fix For: 3.0.1 > > Attachments: errorstack.txt > > > I compiled with the current branch-3.0 source and tested it in mac os. A > java.lang.NullPointerException will be thrown when below conditions are met: > # Using InMemoryStore as kvstore when parsing the event log file (e.g., when > spark.history.store.path is unset). > # At least one stage in this event log has task number greater than > spark.ui.retainedTasks (by default is 10). In this case, kvstore needs to > delete extra task records. > # The job has more than one stage, so parentToChildrenMap in > InMemoryStore.java will have more than one key. > The java.lang.NullPointerException is thrown in InMemoryStore.java :296. In > the method deleteParentIndex(). > {code:java} > private void deleteParentIndex(Object key) { > if (hasNaturalParentIndex) { > for (NaturalKeys v : parentToChildrenMap.values()) { > if (v.remove(asKey(key))) { > // `v` can be empty after removing the natural key and we can > remove it from > // `parentToChildrenMap`. However, `parentToChildrenMap` is a > ConcurrentMap and such > // checking and deleting can be slow. > // This method is to delete one object with certain key, let's > make it simple here. > break; > } > } > } > }{code} > In “if (v.remove(asKey(key)))”, if the key is not contained in v, > "v.remove(asKey(key))" will return null, and java will throw a > NullPointerException when executing "if (null)". > An exception stack trace is attached. > This issue can be fixed by updating if statement to > {code:java} > if (v.remove(asKey(key)) != null){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31556) Document LIKE clause in SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-31556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31556. -- Fix Version/s: 3.0.0 Assignee: Huaxin Gao Resolution: Fixed Resolved by https://issues.apache.org/jira/browse/SPARK-31556 > Document LIKE clause in SQL Reference > - > > Key: SPARK-31556 > URL: https://issues.apache.org/jira/browse/SPARK-31556 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > Document LIKE clause in SQL Reference. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code
[ https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094932#comment-17094932 ] Lorenzo Pisani commented on SPARK-26365: I'm also seeing this behavior specifically with a "cluster" deploy mode. The driver pod is failing properly but the pod that executed spark-submit is exiting with a status code of 0. This makes it very difficult to monitor the job and detect failures. > spark-submit for k8s cluster doesn't propagate exit code > > > Key: SPARK-26365 > URL: https://issues.apache.org/jira/browse/SPARK-26365 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Submit >Affects Versions: 2.3.2, 2.4.0 >Reporter: Oscar Bonilla >Priority: Minor > > When launching apps using spark-submit in a kubernetes cluster, if the Spark > applications fails (returns exit code = 1 for example), spark-submit will > still exit gracefully and return exit code = 0. > This is problematic, since there's no way to know if there's been a problem > with the Spark application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction
[ https://issues.apache.org/jira/browse/SPARK-31599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094925#comment-17094925 ] Jungtaek Lim commented on SPARK-31599: -- Oh sorry I should guide to user@ mailing list, my bad. Please have your time to go through the page http://spark.apache.org/community.html > Reading from S3 (Structured Streaming Bucket) Fails after Compaction > > > Key: SPARK-31599 > URL: https://issues.apache.org/jira/browse/SPARK-31599 > Project: Spark > Issue Type: New Feature > Components: SQL, Structured Streaming >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > I have a S3 bucket which has data streamed (Parquet format) to it by Spark > Structured Streaming Framework from Kafka. Periodically I try to run > compaction on this bucket (a separate Spark Job), and on successful > compaction delete the non compacted (parquet) files. After which I am getting > following error on Spark jobs which read from that bucket: > *Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet* > How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need > to delete the un-compacted files after successful compaction to save space. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction
[ https://issues.apache.org/jira/browse/SPARK-31599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094917#comment-17094917 ] Felix Kizhakkel Jose commented on SPARK-31599: -- How do I do that? > Reading from S3 (Structured Streaming Bucket) Fails after Compaction > > > Key: SPARK-31599 > URL: https://issues.apache.org/jira/browse/SPARK-31599 > Project: Spark > Issue Type: New Feature > Components: SQL, Structured Streaming >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > I have a S3 bucket which has data streamed (Parquet format) to it by Spark > Structured Streaming Framework from Kafka. Periodically I try to run > compaction on this bucket (a separate Spark Job), and on successful > compaction delete the non compacted (parquet) files. After which I am getting > following error on Spark jobs which read from that bucket: > *Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet* > How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need > to delete the un-compacted files after successful compaction to save space. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction
[ https://issues.apache.org/jira/browse/SPARK-31599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094915#comment-17094915 ] Jungtaek Lim commented on SPARK-31599: -- Please post a mail thread on dev@ mailing list. This looks to be a question instead of actual bug report. > Reading from S3 (Structured Streaming Bucket) Fails after Compaction > > > Key: SPARK-31599 > URL: https://issues.apache.org/jira/browse/SPARK-31599 > Project: Spark > Issue Type: New Feature > Components: SQL, Structured Streaming >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > I have a S3 bucket which has data streamed (Parquet format) to it by Spark > Structured Streaming Framework from Kafka. Periodically I try to run > compaction on this bucket (a separate Spark Job), and on successful > compaction delete the non compacted (parquet) files. After which I am getting > following error on Spark jobs which read from that bucket: > *Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet* > How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need > to delete the un-compacted files after successful compaction to save space. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8333) Spark failed to delete temp directory created by HiveContext
[ https://issues.apache.org/jira/browse/SPARK-8333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094893#comment-17094893 ] Sunil Kumar Chakrapani commented on SPARK-8333: --- Any plans to fix this issue for Spark 2.4.5, issue still exists on Windows 10 20/04/26 12:39:12 ERROR ShutdownHookManager: Exception while deleting Spark temp dir: C:\Users\\AppData\Local\Temp\2\spark-1583d46e-c31f-444a-91f1-572c0726b6b1 java.io.IOException: Failed to delete: C:\Users\\AppData\Local\Temp\2\spark-1583d46e-c31f-444a-91f1-572c0726b6b1\userFiles-b001454b-80e1-4414-896b-6aee986174e5\test_jar_2.11-0.1.jar at org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingJavaIO(JavaUtils.java:144) at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:118) at org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingJavaIO(JavaUtils.java:128) at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:118) at org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingJavaIO(JavaUtils.java:128) at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:118) at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:91) at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1062) at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:65) at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:62) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:62) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) > Spark failed to delete temp directory created by HiveContext > > > Key: SPARK-8333 > URL: https://issues.apache.org/jira/browse/SPARK-8333 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0 > Environment: Windows7 64bit >Reporter: sheng >Priority: Minor > Labels: Hive, bulk-closed, metastore, sparksql > Attachments: test.tar > > > Spark 1.4.0 failed to stop SparkContext. > {code:title=LocalHiveTest.scala|borderStyle=solid} > val sc = new SparkContext("local", "local-hive-test", new SparkConf()) > val hc = Utils.createHiveContext(sc) > ... // execute some HiveQL statements > sc.stop() > {code} > sc.stop() failed to execute, it threw the following exception: > {quote} > 15/06/13 03:19:06 INFO Utils: Shutdown hook called > 15/06/13 03:19:06 INFO Utils: Deleting directory > C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea > 15/06/13 03:19:06 ERROR Utils: Exception while deleting Spark temp dir: > C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea > java.io.IOException: Failed to delete: > C:\Users\moshangcheng\AppData\Local\Temp\spark-d6d3c30e-512e-4693-a436-485e2af4baea > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:963) > at > org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:204) > at > org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:201) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at org.apache.spark.util.Utils$$anonfun$1.apply$mcV$sp(Utils.scala:201) > at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2292) > at > org.apache.spark.util.SparkShutdownHookManager$$anon
[jira] [Updated] (SPARK-31549) Pyspark SparkContext.cancelJobGroup do not work correctly
[ https://issues.apache.org/jira/browse/SPARK-31549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-31549: -- Target Version/s: 3.0.0 > Pyspark SparkContext.cancelJobGroup do not work correctly > - > > Key: SPARK-31549 > URL: https://issues.apache.org/jira/browse/SPARK-31549 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0 >Reporter: Weichen Xu >Priority: Critical > > Pyspark SparkContext.cancelJobGroup do not work correctly. This is an issue > existing for a long time. This is because of pyspark thread didn't pinned to > jvm thread when invoking java side methods, which leads to all pyspark API > which related to java local thread variables do not work correctly. > (Including `sc.setLocalProperty`, `sc.cancelJobGroup`, `sc.setJobDescription` > and so on.) > This is serious issue. Now there's an experimental pyspark 'PIN_THREAD' mode > added in spark-3.0 which address it, but the 'PIN_THREAD' mode exists two > issue: > * It is disabled by default. We need to set additional environment variable > to enable it. > * There's memory leak issue which haven't been addressed. > Now there's a series of project like hyperopt-spark, spark-joblib which rely > on `sc.cancelJobGroup` API (use it to stop running jobs in their code). So it > is critical to address this issue and we hope it work under default pyspark > mode. An optional approach is implementing methods like > `rdd.setGroupAndCollect`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31595) Spark sql cli should allow unescaped quote mark in quoted string
[ https://issues.apache.org/jira/browse/SPARK-31595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094830#comment-17094830 ] Ankit Raj Boudh commented on SPARK-31595: - [~adrian-wang], can i start working on this issue ? > Spark sql cli should allow unescaped quote mark in quoted string > > > Key: SPARK-31595 > URL: https://issues.apache.org/jira/browse/SPARK-31595 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Adrian Wang >Priority: Major > > spark-sql> select "'"; > spark-sql> select '"'; > In Spark parser if we pass a text of `select "'";`, there will be > ParserCancellationException, which will be handled by PredictionMode.LL. By > dropping `;` correctly we can avoid that retry. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31591) namePrefix could be null in Utils.createDirectory
[ https://issues.apache.org/jira/browse/SPARK-31591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094829#comment-17094829 ] Ankit Raj Boudh commented on SPARK-31591: - [~cltlfcjin], It's ok Thank you for raising PR :) > namePrefix could be null in Utils.createDirectory > - > > Key: SPARK-31591 > URL: https://issues.apache.org/jira/browse/SPARK-31591 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Minor > > In our production, we find that many shuffle files could be located in > /hadoop/2/yarn/local/usercache/b_carmel/appcache/application_1586487864336_4602/*null*-107d4e9c-d3c7-419e-9743-a21dc4eaeb3f/3a > The Util.createDirectory() uses a default parameter "spark" > {code} > def createDirectory(root: String, namePrefix: String = "spark"): File = { > {code} > But in some cases, the actual namePrefix is null. If the method is called > with null, then the default value would not be applied. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction
[ https://issues.apache.org/jira/browse/SPARK-31599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated SPARK-31599: - Description: I have a S3 bucket which has data streamed (Parquet format) to it by Spark Structured Streaming Framework from Kafka. Periodically I try to run compaction on this bucket (a separate Spark Job), and on successful compaction delete the non compacted (parquet) files. After which I am getting following error on Spark jobs which read from that bucket: *Caused by: java.io.FileNotFoundException: No such file or directory: s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet* How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need to delete the un-compacted files after successful compaction to save space. was: I have a S3 bucket which has data streamed (Parquet format) to it by Spark Structured Streaming Framework from Kafka. Periodically I try to run compaction on this bucket (a separate Spark Job), and on successful compaction delete the non compacted (parquet) files. After which I am getting following error on Spark jobs which read from that bucket: *Caused by: java.io.FileNotFoundException: No such file or directory: s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet* How do we run *_c__ompaction on Structured Streaming S3 bucket_s*. Also I need to delete the un-compacted files after successful compaction to save space. > Reading from S3 (Structured Streaming Bucket) Fails after Compaction > > > Key: SPARK-31599 > URL: https://issues.apache.org/jira/browse/SPARK-31599 > Project: Spark > Issue Type: New Feature > Components: SQL, Structured Streaming >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > I have a S3 bucket which has data streamed (Parquet format) to it by Spark > Structured Streaming Framework from Kafka. Periodically I try to run > compaction on this bucket (a separate Spark Job), and on successful > compaction delete the non compacted (parquet) files. After which I am getting > following error on Spark jobs which read from that bucket: > *Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet* > How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need > to delete the un-compacted files after successful compaction to save space. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction
Felix Kizhakkel Jose created SPARK-31599: Summary: Reading from S3 (Structured Streaming Bucket) Fails after Compaction Key: SPARK-31599 URL: https://issues.apache.org/jira/browse/SPARK-31599 Project: Spark Issue Type: New Feature Components: SQL, Structured Streaming Affects Versions: 2.4.5 Reporter: Felix Kizhakkel Jose I have a S3 bucket which has data streamed (Parquet format) to it by Spark Structured Streaming Framework from Kafka. Periodically I try to run compaction on this bucket (a separate Spark Job), and on successful compaction delete the non compacted (parquet) files. After which I am getting following error on Spark jobs which read from that bucket: *Caused by: java.io.FileNotFoundException: No such file or directory: s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet* How do we run *_c__ompaction on Structured Streaming S3 bucket_s*. Also I need to delete the un-compacted files after successful compaction to save space. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30741) The data returned from SAS using JDBC reader contains column label
[ https://issues.apache.org/jira/browse/SPARK-30741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Liu updated SPARK-30741: - Attachment: ExamplesFromSASSupport.png > The data returned from SAS using JDBC reader contains column label > -- > > Key: SPARK-30741 > URL: https://issues.apache.org/jira/browse/SPARK-30741 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark >Affects Versions: 2.1.1, 2.3.4, 2.4.5 >Reporter: Gary Liu >Priority: Major > Attachments: ExamplesFromSASSupport.png, ReplyFromSASSupport.png, > SparkBug.png > > > When read SAS data using JDBC with SAS SHARE driver, the returned data > contains column labels, rather data. > According to testing result from SAS Support, the results are correct using > Java. So they believe it is due to spark reading. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30741) The data returned from SAS using JDBC reader contains column label
[ https://issues.apache.org/jira/browse/SPARK-30741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Liu updated SPARK-30741: - Attachment: ReplyFromSASSupport.png > The data returned from SAS using JDBC reader contains column label > -- > > Key: SPARK-30741 > URL: https://issues.apache.org/jira/browse/SPARK-30741 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark >Affects Versions: 2.1.1, 2.3.4, 2.4.5 >Reporter: Gary Liu >Priority: Major > Attachments: ReplyFromSASSupport.png, SparkBug.png > > > When read SAS data using JDBC with SAS SHARE driver, the returned data > contains column labels, rather data. > According to testing result from SAS Support, the results are correct using > Java. So they believe it is due to spark reading. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-30741) The data returned from SAS using JDBC reader contains column label
[ https://issues.apache.org/jira/browse/SPARK-30741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Liu reopened SPARK-30741: -- *Problem:* The spark JDBC reader read SAS data incorrectly, and returned the data with the column names as data values. *Possible Reason:* After discussed with SAS Support team, they think spark JDBC reader does not compliant with [JDBC specs|[https://docs.oracle.com/javase/7/docs/api/java/sql/DatabaseMetaData.html#getIdentifierQuoteString()]], where getIdentifierQuoteString() should be called to get the quoted SQL identifiers used by the source system. This function in SAS JDBC driver returned a blank string. SAS Support team think spark does not call this function, but uses default double quote '"' to generate the query, so the query ' select var_a from table_a' is passed as 'select "var_a" from table_a', and "table_a" string is populated as data values. > The data returned from SAS using JDBC reader contains column label > -- > > Key: SPARK-30741 > URL: https://issues.apache.org/jira/browse/SPARK-30741 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark >Affects Versions: 2.1.1, 2.3.4, 2.4.5 >Reporter: Gary Liu >Priority: Major > Attachments: SparkBug.png > > > When read SAS data using JDBC with SAS SHARE driver, the returned data > contains column labels, rather data. > According to testing result from SAS Support, the results are correct using > Java. So they believe it is due to spark reading. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31339) Changed PipelineModel(...) to self.cls(...) in pyspark.ml.pipeline.PipelineModelReader.load()
[ https://issues.apache.org/jira/browse/SPARK-31339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-31339. -- Resolution: Not A Problem > Changed PipelineModel(...) to self.cls(...) in > pyspark.ml.pipeline.PipelineModelReader.load() > - > > Key: SPARK-31339 > URL: https://issues.apache.org/jira/browse/SPARK-31339 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.5 >Reporter: Suraj >Priority: Minor > Labels: pull-request-available > Original Estimate: 0h > Remaining Estimate: 0h > > PR: [https://github.com/apache/spark/pull/28110] > * What changes were proposed in this pull request? > pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...) > * Why are the changes needed? > This change fixes the loading of class (which inherits from PipelineModel > class) from file. > E.g. Current issue: > {code:java} > CustomPipelineModel(PipelineModel): > def _transform(self, df): > ... > CustomPipelineModel.save('path/to/file') # works > CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() > instead of CustomPipelineModel() > CustomPipelineModel.transform() # wrong: results in calling > PipelineModel.transform() instead of CustomPipelineModel.transform(){code} > * Does this introduce any user-facing change? > No. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31165) Multiple wrong references in Dockerfile for k8s
[ https://issues.apache.org/jira/browse/SPARK-31165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-31165. -- Resolution: Not A Problem > Multiple wrong references in Dockerfile for k8s > > > Key: SPARK-31165 > URL: https://issues.apache.org/jira/browse/SPARK-31165 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.5, 3.0.0 >Reporter: Nikolay Dimolarov >Priority: Minor > > I am currently trying to follow the k8s instructions for Spark: > [https://spark.apache.org/docs/latest/running-on-kubernetes.html] and when I > clone apache/spark on GitHub on the master branch I saw multiple wrong folder > references after trying to build my Docker image: > > *Issue 1: The comments in the Dockerfile reference the wrong folder for the > Dockerfile:* > {code:java} > # If this docker file is being used in the context of building your images > from a Spark # distribution, the docker build command should be invoked from > the top level directory # of the Spark distribution. E.g.: # docker build -t > spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .{code} > Well that docker build command simply won't run. I only got the following to > run: > {code:java} > docker build -t spark:latest -f > resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile . > {code} > which is the actual path to the Dockerfile. > > *Issue 2: jars folder does not exist* > After I read the tutorial I of course build spark first as per the > instructions with: > {code:java} > ./build/mvn -Pkubernetes -DskipTests clean package{code} > Nonetheless, in the Dockerfile I get this error when building: > {code:java} > Step 5/18 : COPY jars /opt/spark/jars > COPY failed: stat /var/lib/docker/tmp/docker-builder402673637/jars: no such > file or directory{code} > for which I may have found a similar issue here: > [https://stackoverflow.com/questions/52451538/spark-for-kubernetes-test-on-mac] > I am new to Spark but I assume that this jars folder - if the build step > would actually make it and I ran the maven build of the master branch > successfully with the command I mentioned above - would exist in the root > folder of the project. Turns out it's here: > spark/assembly/target/scala-2.12/jars > > *Issue 3: missing entrypoint.sh and decom.sh due to wrong reference* > While Issue 2 remains unresolved as I can't wrap my head around the missing > jars folder (bin and sbin got copied successfully after I made a dummy jars > folder) I then got stuck on these 2 steps: > {code:java} > COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/ COPY > kubernetes/dockerfiles/spark/decom.sh /opt/{code} > > with: > > {code:java} > Step 8/18 : COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/ > COPY failed: stat > /var/lib/docker/tmp/docker-builder638219776/kubernetes/dockerfiles/spark/entrypoint.sh: > no such file or directory{code} > > which makes sense since the path should actually be: > > resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh > resource-managers/kubernetes/docker/src/main/dockerfiles/spark/decom.sh > > *Issue 4: /tests/ has been renamed in /integration-tests/* > **And the location is wrong. > {code:java} > COPY kubernetes/tests /opt/spark/tests > {code} > has to be changed to: > {code:java} > COPY resource-managers/kubernetes/integration-tests /opt/spark/tests{code} > *Remark* > > I only created one issue since this seems like somebody cleaned up the repo > and forgot to change these. Am I missing something here? If I am, I apologise > in advance since I am new to the Spark project. I also saw that some of these > references were handled through vars in previous branches: > [https://github.com/apache/spark/blob/branch-2.4/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile] > (e.g. 2.4) but that also does not run out of the box. > > I am also really not sure about the affected versions since that was not > transparent enough for me on GH - feel free to edit that field :) > > Thanks in advance! > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31149) PySpark job not killing Spark Daemon processes after the executor is killed due to OOM
[ https://issues.apache.org/jira/browse/SPARK-31149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-31149. -- Resolution: Won't Fix > PySpark job not killing Spark Daemon processes after the executor is killed > due to OOM > -- > > Key: SPARK-31149 > URL: https://issues.apache.org/jira/browse/SPARK-31149 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5 >Reporter: Arsenii Venherak >Priority: Major > > {code:java} > 2020-03-10 10:15:00,257 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Memory usage of ProcessTree 327523 for container-id container_e25_1583 > 485217113_0347_01_42: 1.9 GB of 2 GB physical memory used; 39.5 GB of 4.2 > GB virtual memory used > 2020-03-10 10:15:05,135 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Memory usage of ProcessTree 327523 for container-id container_e25_1583 > 485217113_0347_01_42: 3.6 GB of 2 GB physical memory used; 41.1 GB of 4.2 > GB virtual memory used > 2020-03-10 10:15:05,136 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Process tree for container: container_e25_1583485217113_0347_01_42 > has processes older than 1 iteration running over the configured limit. > Limit=2147483648, current usage = 3915513856 > 2020-03-10 10:15:05,136 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Container [pid=327523,containerID=container_e25_1583485217113_0347_01_ > 42] is running beyond physical memory limits. Current usage: 3.6 GB of 2 > GB physical memory used; 41.1 GB of 4.2 GB virtual memory used. Killing > container. > Dump of the process-tree for container_e25_1583485217113_0347_01_42 : > |- 327535 327523 327523 327523 (java) 1611 111 4044427264 172306 > /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el7_7.x86_64/jre/bin/java > -server -Xmx1024m -Djava.io.tmpdir=/data/s > cratch/yarn/usercache/u689299/appcache/application_1583485217113_0347/container_e25_1583485217113_0347_01_42/tmp > -Dspark.ssl.trustStore=/opt/mapr/conf/ssl_truststore -Dspark.authenticat > e.enableSaslEncryption=true -Dspark.driver.port=40653 > -Dspark.network.timeout=7200 -Dspark.ssl.keyStore=/opt/mapr/conf/ssl_keystore > -Dspark.network.sasl.serverAlwaysEncrypt=true -Dspark.ssl > .enabled=true -Dspark.ssl.protocol=TLSv1.2 -Dspark.ssl.fs.enabled=true > -Dspark.ssl.ui.enabled=false -Dspark.authenticate=true > -Dspark.yarn.app.container.log.dir=/opt/mapr/hadoop/hadoop-2.7. > 0/logs/userlogs/application_1583485217113_0347/container_e25_1583485217113_0347_01_42 > -XX:OnOutOfMemoryError=kill %p > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@bd02slse0201.wellsfargo.com:40653 > --executor-id 40 --hostname bd02slsc0519.wellsfargo.com --cores 1 --app-id > application_1583485217113_0347 --user-class-path > file:/data/scratch/yarn/usercache/u689299/appcache/application_1583485217113_0347/container_e25_1583485217113_0347_01_42/__app__.jar > {code} > > > After that, there are lots of pyspark.daemon process left. > eg: > /apps/anaconda3-5.3.0/bin/python -m pyspark.daemon -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30133) Support DELETE Jar and DELETE File functionality in spark
[ https://issues.apache.org/jira/browse/SPARK-30133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30133. -- Resolution: Won't Fix > Support DELETE Jar and DELETE File functionality in spark > - > > Key: SPARK-30133 > URL: https://issues.apache.org/jira/browse/SPARK-30133 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Sandeep Katta >Priority: Major > Labels: Umbrella > > SPARK should support delete jar feature > This feature aims at solving below use case. > Currently in spark add jar API supports to add the jar to executor and Driver > ClassPath at runtime, if there is any change in this jar definition there is > no way user can update the jar to executor and Driver classPath. User needs > to restart the application to solve this problem which is costly operation. > After this JIRA fix user can use delete jar API to remove the jar from Driver > and Executor ClassPath without the need of restarting the any spark > application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30135) Add documentation for DELETE JAR and DELETE File command
[ https://issues.apache.org/jira/browse/SPARK-30135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30135. -- Resolution: Won't Fix > Add documentation for DELETE JAR and DELETE File command > > > Key: SPARK-30135 > URL: https://issues.apache.org/jira/browse/SPARK-30135 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30134) DELETE JAR should remove from addedJars list and from classpath
[ https://issues.apache.org/jira/browse/SPARK-30134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30134. -- Resolution: Won't Fix > DELETE JAR should remove from addedJars list and from classpath > > > Key: SPARK-30134 > URL: https://issues.apache.org/jira/browse/SPARK-30134 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30137) Support DELETE file
[ https://issues.apache.org/jira/browse/SPARK-30137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30137. -- Resolution: Won't Fix > Support DELETE file > > > Key: SPARK-30137 > URL: https://issues.apache.org/jira/browse/SPARK-30137 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30136) DELETE JAR should also remove the jar from executor classPath
[ https://issues.apache.org/jira/browse/SPARK-30136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30136. -- Resolution: Won't Fix > DELETE JAR should also remove the jar from executor classPath > - > > Key: SPARK-30136 > URL: https://issues.apache.org/jira/browse/SPARK-30136 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31598) LegacySimpleTimestampFormatter incorrectly interprets pre-Gregorian timestamps
Bruce Robbins created SPARK-31598: - Summary: LegacySimpleTimestampFormatter incorrectly interprets pre-Gregorian timestamps Key: SPARK-31598 URL: https://issues.apache.org/jira/browse/SPARK-31598 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Bruce Robbins As per discussion with [~maxgekk]: {{LegacySimpleTimestampFormatter#parse}} misinterprets pre-Gregorian timestamps: {noformat} scala> sql("set spark.sql.legacy.timeParserPolicy=LEGACY") res0: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> val df1 = Seq("0002-01-01 00:00:00", "1000-01-01 00:00:00", "1800-01-01 00:00:00").toDF("expected") df1: org.apache.spark.sql.DataFrame = [expected: string] scala> val df2 = df1.select('expected, to_timestamp('expected, "-MM-dd HH:mm:ss").as("actual")) df2: org.apache.spark.sql.DataFrame = [expected: string, actual: timestamp] scala> df2.show(truncate=false) +---+---+ |expected |actual | +---+---+ |0002-01-01 00:00:00|0001-12-30 00:00:00| |1000-01-01 00:00:00|1000-01-06 00:00:00| |1800-01-01 00:00:00|1800-01-01 00:00:00| +---+---+ scala> {noformat} Legacy timestamp parsing with JSON and CSV files is correct, so apparently {{LegacyFastTimestampFormatter}} does not have this issue (need to double check). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31592) bufferPoolsBySize in HeapMemoryAllocator should be thread safe
[ https://issues.apache.org/jira/browse/SPARK-31592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094669#comment-17094669 ] Yunbo Fan commented on SPARK-31592: --- I checked my executor log again and I find the executor got NPE first {code} java.lang.NullPointerException at org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:58) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:302) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:96) at org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap:800) ... {code} And later got the NoSuchElementExceptionException. {code} java.util.NoSuchElementExceptionException at java.util.LinkedList.removeFirst(LinkedList.java:270) at java.util.LinkedList.remove(LinkedList.java:685) at org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:57) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:302) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:96) at org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap:800) ... {code} But I can't find out why NPE error here. Maybe a null WeakReference added? > bufferPoolsBySize in HeapMemoryAllocator should be thread safe > -- > > Key: SPARK-31592 > URL: https://issues.apache.org/jira/browse/SPARK-31592 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Yunbo Fan >Priority: Major > > Currently, bufferPoolsBySize in HeapMemoryAllocator uses a Map type whose > value type is LinkedList. > LinkedList is not thread safe and may hit the error below > {code:java} > java.util.NoSuchElementExceptionException > at java.util.LinkedList.removeFirst(LinkedList.java:270) > at java.util.LinkedList.remove(LinkedList.java:685) > at > org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:57){code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31592) bufferPoolsBySize in HeapMemoryAllocator should be thread safe
[ https://issues.apache.org/jira/browse/SPARK-31592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yunbo Fan updated SPARK-31592: -- Affects Version/s: (was: 2.4.5) 2.4.3 > bufferPoolsBySize in HeapMemoryAllocator should be thread safe > -- > > Key: SPARK-31592 > URL: https://issues.apache.org/jira/browse/SPARK-31592 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Yunbo Fan >Priority: Major > > Currently, bufferPoolsBySize in HeapMemoryAllocator uses a Map type whose > value type is LinkedList. > LinkedList is not thread safe and may hit the error below > {code:java} > java.util.NoSuchElementExceptionException > at java.util.LinkedList.removeFirst(LinkedList.java:270) > at java.util.LinkedList.remove(LinkedList.java:685) > at > org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:57){code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29458) Document scalar functions usage in APIs in SQL getting started.
[ https://issues.apache.org/jira/browse/SPARK-29458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29458. -- Fix Version/s: 3.0.0 Assignee: Huaxin Gao Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28290 > Document scalar functions usage in APIs in SQL getting started. > --- > > Key: SPARK-29458 > URL: https://issues.apache.org/jira/browse/SPARK-29458 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Dilip Biswal >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31583) grouping_id calculation should be improved
[ https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094651#comment-17094651 ] Costas Piliotis edited comment on SPARK-31583 at 4/28/20, 4:16 PM: --- [~maropu] I'm trying to avoid referencing the SPARK-21858 which already addresses the flipped bits. Specifically this is about how spark decides where to allocate the grouping_id based on the ordinal position in the grouping sets rather than the ordinal position in the select clause. Does that make sense? So if I have SELECT a,b,c,d FROM... GROUPING SETS ( (a,b,d), (a,b,c) ) the grouping_id bits would be determined as cdba the or instead of dcba.I believe if we look at most RDBMS that has grouping sets identified, my only suggestion is that it would be more predictable if the bit order in the grouping_id were determined by the ordinal position in the select. The flipped bits, is a separate ticket and I do believe the implementation should be predictably the same as other implementation in established RDBMS SQL implementations where 1=included, 0=excluded, but that matter is closed to discussion. was (Author: cpiliotis): [~maropu] I'm trying to avoid referencing the SPARK-21858 which already addresses the flipped bits. Specifically this is about how spark decides where to allocate the grouping_id based on the ordinal position in the grouping sets rather than the ordinal position in the select clause. Does that make sense? So if I have SELECT a,b,c,d FROM... GROUPING SETS ( (a,b,d), (a,b,c) ) the grouping_id would be abdc instead of abcd.I believe if we look at most RDBMS that has grouping sets identified, my only suggestion is that it would be more predictable if the bit order in the grouping_id were determined by the ordinal position in the select. > grouping_id calculation should be improved > -- > > Key: SPARK-31583 > URL: https://issues.apache.org/jira/browse/SPARK-31583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Costas Piliotis >Priority: Minor > > Unrelated to SPARK-21858 which identifies that grouping_id is determined by > exclusion from a grouping_set rather than inclusion, when performing complex > grouping_sets that are not in the order of the base select statement, > flipping the bit in the grouping_id seems to be happen when the grouping set > is identified rather than when the columns are selected in the sql. I will > of course use the exclusion strategy identified in SPARK-21858 as the > baseline for this. > > {code:scala} > import spark.implicits._ > val df= Seq( > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d") > ).toDF("a","b","c","d").createOrReplaceTempView("abc") > {code} > expected to have these references in the grouping_id: > d=1 > c=2 > b=4 > a=8 > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > This returns: > {noformat} > ++++++---+---+ > |a |b |c |d |count(1)|gid|gid_bin| > ++++++---+---+ > |a |null|c |null|4 |6 |110| > |null|null|null|null|4 |15 | | > |a |null|null|d |4 |5 |101| > |a |b |null|d |4 |1 |1 | > ++++++---+---+ > {noformat} > > In other words, I would have expected the excluded values one way but I > received them excluded in the order they were first seen in the specified > grouping sets. > a,b,d included = excldes c = 2; expected gid=2. received gid=1 > a,d included = excludes b=4, c=2 expected gid=6, received gid=5 > The grouping_id that actually is expected is (a,b,d,c) > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, > bin(grouping_id(a,b,d,c)) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > columns forming groupingid seem to be creatred as the grouping sets are > identified rather than ordinal position in parent query. > I'd like to at least point out that grouping_id is documented in many other > rdbms and I believe the spark project should use a policy of flipping the > bits so 1=inclusion; 0=exclusion in the grouping set. > However many rdms that do have the feature of a grouping_id do implement it > by the ordinal position recognized as fields in the select clause, rather > than allocating them as they are observed in the grouping sets. -- This message was sent by Atlassian Jira (v8.3.4#803005) ---
[jira] [Commented] (SPARK-31583) grouping_id calculation should be improved
[ https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094651#comment-17094651 ] Costas Piliotis commented on SPARK-31583: - [~maropu] I'm trying to avoid referencing the SPARK-21858 which already addresses the flipped bits. Specifically this is about how spark decides where to allocate the grouping_id based on the ordinal position in the grouping sets rather than the ordinal position in the select clause. Does that make sense? So if I have SELECT a,b,c,d FROM... GROUPING SETS ( (a,b,d), (a,b,c) ) the grouping_id would be abdc instead of abcd.I believe if we look at most RDBMS that has grouping sets identified, my only suggestion is that it would be more predictable if the bit order in the grouping_id were determined by the ordinal position in the select. > grouping_id calculation should be improved > -- > > Key: SPARK-31583 > URL: https://issues.apache.org/jira/browse/SPARK-31583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Costas Piliotis >Priority: Minor > > Unrelated to SPARK-21858 which identifies that grouping_id is determined by > exclusion from a grouping_set rather than inclusion, when performing complex > grouping_sets that are not in the order of the base select statement, > flipping the bit in the grouping_id seems to be happen when the grouping set > is identified rather than when the columns are selected in the sql. I will > of course use the exclusion strategy identified in SPARK-21858 as the > baseline for this. > > {code:scala} > import spark.implicits._ > val df= Seq( > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d") > ).toDF("a","b","c","d").createOrReplaceTempView("abc") > {code} > expected to have these references in the grouping_id: > d=1 > c=2 > b=4 > a=8 > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > This returns: > {noformat} > ++++++---+---+ > |a |b |c |d |count(1)|gid|gid_bin| > ++++++---+---+ > |a |null|c |null|4 |6 |110| > |null|null|null|null|4 |15 | | > |a |null|null|d |4 |5 |101| > |a |b |null|d |4 |1 |1 | > ++++++---+---+ > {noformat} > > In other words, I would have expected the excluded values one way but I > received them excluded in the order they were first seen in the specified > grouping sets. > a,b,d included = excldes c = 2; expected gid=2. received gid=1 > a,d included = excludes b=4, c=2 expected gid=6, received gid=5 > The grouping_id that actually is expected is (a,b,d,c) > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, > bin(grouping_id(a,b,d,c)) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > columns forming groupingid seem to be creatred as the grouping sets are > identified rather than ordinal position in parent query. > I'd like to at least point out that grouping_id is documented in many other > rdbms and I believe the spark project should use a policy of flipping the > bits so 1=inclusion; 0=exclusion in the grouping set. > However many rdms that do have the feature of a grouping_id do implement it > by the ordinal position recognized as fields in the select clause, rather > than allocating them as they are observed in the grouping sets. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31519) Cast in having aggregate expressions returns the wrong result
[ https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31519: -- Labels: correctness (was: ) > Cast in having aggregate expressions returns the wrong result > - > > Key: SPARK-31519 > URL: https://issues.apache.org/jira/browse/SPARK-31519 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Labels: correctness > Fix For: 3.0.0 > > > Cast in having aggregate expressions returns the wrong result. > See the below tests: > {code:java} > scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)") > res0: org.apache.spark.sql.DataFrame = [] > scala> val query = """ > | select sum(a) as b, '2020-01-01' as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---+--+ > | b| fake| > +---+--+ > | 2|2020-01-01| > +---+--+ > scala> val query = """ > | select sum(a) as b, cast('2020-01-01' as date) as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---++ > | b|fake| > +---++ > +---++ > {code} > The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING > query, and Spark has a special analyzer rule ResolveAggregateFunctions to > resolve the aggregate functions and grouping columns in the Filter operator. > > It works for simple cases in a very tricky way as it relies on rule execution > order: > 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes > inside aggregate functions, but the function itself is still unresolved as > it's an UnresolvedFunction. This stops resolving the Filter operator as the > child Aggrege operator is still unresolved. > 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege > operator resolved. > 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child > is a resolved Aggregate. This rule can correctly resolve the grouping columns. > > In the example query, I put a CAST, which needs to be resolved by rule > ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step > 3 as the Aggregate operator is unresolved at that time. Then the analyzer > starts next round and the Filter operator is resolved by ResolveReferences, > which wrongly resolves the grouping columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31534) Text for tooltip should be escaped
[ https://issues.apache.org/jira/browse/SPARK-31534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31534: -- Fix Version/s: 3.0.0 > Text for tooltip should be escaped > -- > > Key: SPARK-31534 > URL: https://issues.apache.org/jira/browse/SPARK-31534 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.0.0, 3.1.0 > > > Timeline View for application and job, and DAG Viz for job show tooltip but > its text are not escaped for HTML so they cannot be shown properly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29048) Query optimizer slow when using Column.isInCollection() with a large size collection
[ https://issues.apache.org/jira/browse/SPARK-29048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094627#comment-17094627 ] Dongjoon Hyun commented on SPARK-29048: --- This is reverted via https://github.com/apache/spark/commit/b7cabc80e6df523f0377b651fdbdc2a669c11550 > Query optimizer slow when using Column.isInCollection() with a large size > collection > > > Key: SPARK-29048 > URL: https://issues.apache.org/jira/browse/SPARK-29048 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Weichen Xu >Priority: Major > > Query optimizer slow when using Column.isInCollection() with a large size > collection. > The query optimizer takes a long time to do its thing and on the UI all I see > is "Running commands". This can take from 10s of minutes to 11 hours > depending on how many values there are. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29048) Query optimizer slow when using Column.isInCollection() with a large size collection
[ https://issues.apache.org/jira/browse/SPARK-29048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29048: -- Fix Version/s: (was: 3.0.0) > Query optimizer slow when using Column.isInCollection() with a large size > collection > > > Key: SPARK-29048 > URL: https://issues.apache.org/jira/browse/SPARK-29048 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Weichen Xu >Priority: Major > > Query optimizer slow when using Column.isInCollection() with a large size > collection. > The query optimizer takes a long time to do its thing and on the UI all I see > is "Running commands". This can take from 10s of minutes to 11 hours > depending on how many values there are. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-29048) Query optimizer slow when using Column.isInCollection() with a large size collection
[ https://issues.apache.org/jira/browse/SPARK-29048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-29048: --- Assignee: (was: Weichen Xu) > Query optimizer slow when using Column.isInCollection() with a large size > collection > > > Key: SPARK-29048 > URL: https://issues.apache.org/jira/browse/SPARK-29048 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Query optimizer slow when using Column.isInCollection() with a large size > collection. > The query optimizer takes a long time to do its thing and on the UI all I see > is "Running commands". This can take from 10s of minutes to 11 hours > depending on how many values there are. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31404) file source backward compatibility after calendar switch
[ https://issues.apache.org/jira/browse/SPARK-31404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31404: Summary: file source backward compatibility after calendar switch (was: file source backward compatibility issues after switching to Proleptic Gregorian calendar) > file source backward compatibility after calendar switch > > > Key: SPARK-31404 > URL: https://issues.apache.org/jira/browse/SPARK-31404 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Blocker > > In Spark 3.0, we switch to the Proleptic Gregorian calendar by using the Java > 8 datetime APIs. This makes Spark follow the ISO and SQL standard, but > introduces some backward compatibility problems: > 1. may read wrong data from the data files written by Spark 2.4 > 2. may have perf regression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31404) file source backward compatibility issues after switching to Proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-31404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31404: Summary: file source backward compatibility issues after switching to Proleptic Gregorian calendar (was: backward compatibility issues after switching to Proleptic Gregorian calendar) > file source backward compatibility issues after switching to Proleptic > Gregorian calendar > - > > Key: SPARK-31404 > URL: https://issues.apache.org/jira/browse/SPARK-31404 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Blocker > > In Spark 3.0, we switch to the Proleptic Gregorian calendar by using the Java > 8 datetime APIs. This makes Spark follow the ISO and SQL standard, but > introduces some backward compatibility problems: > 1. may read wrong data from the data files written by Spark 2.4 > 2. may have perf regression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31597) extracting day from intervals should be interval.days + days in interval.microsecond
[ https://issues.apache.org/jira/browse/SPARK-31597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094560#comment-17094560 ] Kent Yao commented on SPARK-31597: -- work log manually [https://github.com/apache/spark/pull/28396] > extracting day from intervals should be interval.days + days in > interval.microsecond > > > Key: SPARK-31597 > URL: https://issues.apache.org/jira/browse/SPARK-31597 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > checked with both Presto and PostgresSQL, one is implemented intervals with > ANSI style year-month/day-time, the other is mixed and Non-ANSI. They both > add the exceeded days in interval time part to the total days of the > operation which extracts day from interval values > > ```sql > presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - > cast('2020-01-01 00:00:00' as timestamp))); > _col0 > --- > 14 > (1 row) > Query 20200428_135239_0_ahn7x, FINISHED, 1 node > Splits: 17 total, 17 done (100.00%) > 0:01 [0 rows, 0B] [0 rows/s, 0B/s] > presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - > cast('2020-01-01 00:00:01' as timestamp))); > _col0 > --- > 13 > (1 row) > Query 20200428_135246_1_ahn7x, FINISHED, 1 node > Splits: 17 total, 17 done (100.00%) > 0:00 [0 rows, 0B] [0 rows/s, 0B/s] > presto> > ``` > ```scala > postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) > - cast('2020-01-01 00:00:00' as timestamp))); > date_part > --- > 14 > (1 row) > postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) > - cast('2020-01-01 00:00:01' as timestamp))); > date_part > --- > 13 > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31597) extracting day from intervals should be interval.days + days in interval.microsecond
[ https://issues.apache.org/jira/browse/SPARK-31597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094560#comment-17094560 ] Kent Yao edited comment on SPARK-31597 at 4/28/20, 2:30 PM: work logged manually [https://github.com/apache/spark/pull/28396] was (Author: qin yao): work log manually [https://github.com/apache/spark/pull/28396] > extracting day from intervals should be interval.days + days in > interval.microsecond > > > Key: SPARK-31597 > URL: https://issues.apache.org/jira/browse/SPARK-31597 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > checked with both Presto and PostgresSQL, one is implemented intervals with > ANSI style year-month/day-time, the other is mixed and Non-ANSI. They both > add the exceeded days in interval time part to the total days of the > operation which extracts day from interval values > > ```sql > presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - > cast('2020-01-01 00:00:00' as timestamp))); > _col0 > --- > 14 > (1 row) > Query 20200428_135239_0_ahn7x, FINISHED, 1 node > Splits: 17 total, 17 done (100.00%) > 0:01 [0 rows, 0B] [0 rows/s, 0B/s] > presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - > cast('2020-01-01 00:00:01' as timestamp))); > _col0 > --- > 13 > (1 row) > Query 20200428_135246_1_ahn7x, FINISHED, 1 node > Splits: 17 total, 17 done (100.00%) > 0:00 [0 rows, 0B] [0 rows/s, 0B/s] > presto> > ``` > ```scala > postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) > - cast('2020-01-01 00:00:00' as timestamp))); > date_part > --- > 14 > (1 row) > postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) > - cast('2020-01-01 00:00:01' as timestamp))); > date_part > --- > 13 > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31553) Wrong result of isInCollection for large collections
[ https://issues.apache.org/jira/browse/SPARK-31553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31553: --- Assignee: Maxim Gekk > Wrong result of isInCollection for large collections > > > Key: SPARK-31553 > URL: https://issues.apache.org/jira/browse/SPARK-31553 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: correctness > > If the size of a collection passed to isInCollection is bigger than > spark.sql.optimizer.inSetConversionThreshold, the method can return wrong > results for some inputs. For example: > {code:scala} > val set = (0 to 20).map(_.toString).toSet > val data = Seq("1").toDF("x") > println(set.contains("1")) > data.select($"x".isInCollection(set).as("isInCollection")).show() > {code} > {code} > true > +--+ > |isInCollection| > +--+ > | false| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31553) Wrong result of isInCollection for large collections
[ https://issues.apache.org/jira/browse/SPARK-31553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31553. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28388 [https://github.com/apache/spark/pull/28388] > Wrong result of isInCollection for large collections > > > Key: SPARK-31553 > URL: https://issues.apache.org/jira/browse/SPARK-31553 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: correctness > Fix For: 3.0.0 > > > If the size of a collection passed to isInCollection is bigger than > spark.sql.optimizer.inSetConversionThreshold, the method can return wrong > results for some inputs. For example: > {code:scala} > val set = (0 to 20).map(_.toString).toSet > val data = Seq("1").toDF("x") > println(set.contains("1")) > data.select($"x".isInCollection(set).as("isInCollection")).show() > {code} > {code} > true > +--+ > |isInCollection| > +--+ > | false| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31586) Replace expression TimeSub(l, r) with TimeAdd(l -r)
[ https://issues.apache.org/jira/browse/SPARK-31586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31586. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28381 [https://github.com/apache/spark/pull/28381] > Replace expression TimeSub(l, r) with TimeAdd(l -r) > --- > > Key: SPARK-31586 > URL: https://issues.apache.org/jira/browse/SPARK-31586 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 3.1.0 > > > The implementation of TimeSub for the operation of timestamp subtracting > interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, > -r) since there are equivalent. > Suggestion from > https://github.com/apache/spark/pull/28310#discussion_r414259239 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31586) Replace expression TimeSub(l, r) with TimeAdd(l -r)
[ https://issues.apache.org/jira/browse/SPARK-31586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31586: --- Assignee: Kent Yao > Replace expression TimeSub(l, r) with TimeAdd(l -r) > --- > > Key: SPARK-31586 > URL: https://issues.apache.org/jira/browse/SPARK-31586 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > > The implementation of TimeSub for the operation of timestamp subtracting > interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, > -r) since there are equivalent. > Suggestion from > https://github.com/apache/spark/pull/28310#discussion_r414259239 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31597) extracting day from intervals should be interval.days + days in interval.microsecond
Kent Yao created SPARK-31597: Summary: extracting day from intervals should be interval.days + days in interval.microsecond Key: SPARK-31597 URL: https://issues.apache.org/jira/browse/SPARK-31597 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Kent Yao checked with both Presto and PostgresSQL, one is implemented intervals with ANSI style year-month/day-time, the other is mixed and Non-ANSI. They both add the exceeded days in interval time part to the total days of the operation which extracts day from interval values ```sql presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp))); _col0 --- 14 (1 row) Query 20200428_135239_0_ahn7x, FINISHED, 1 node Splits: 17 total, 17 done (100.00%) 0:01 [0 rows, 0B] [0 rows/s, 0B/s] presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp))); _col0 --- 13 (1 row) Query 20200428_135246_1_ahn7x, FINISHED, 1 node Splits: 17 total, 17 done (100.00%) 0:00 [0 rows, 0B] [0 rows/s, 0B/s] presto> ``` ```scala postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp))); date_part --- 14 (1 row) postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp))); date_part --- 13 ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31596) Generate SQL Configurations from hive module to configuration doc
[ https://issues.apache.org/jira/browse/SPARK-31596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-31596: - Description: ATT > Generate SQL Configurations from hive module to configuration doc > - > > Key: SPARK-31596 > URL: https://issues.apache.org/jira/browse/SPARK-31596 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Minor > > ATT -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31596) Generate SQL Configurations from hive module to configuration doc
Kent Yao created SPARK-31596: Summary: Generate SQL Configurations from hive module to configuration doc Key: SPARK-31596 URL: https://issues.apache.org/jira/browse/SPARK-31596 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31595) Spark sql cli should allow unescaped quote mark in quoted string
Adrian Wang created SPARK-31595: --- Summary: Spark sql cli should allow unescaped quote mark in quoted string Key: SPARK-31595 URL: https://issues.apache.org/jira/browse/SPARK-31595 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Adrian Wang spark-sql> select "'"; spark-sql> select '"'; In Spark parser if we pass a text of `select "'";`, there will be ParserCancellationException, which will be handled by PredictionMode.LL. By dropping `;` correctly we can avoid that retry. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31594) Do not display rand/randn seed numbers in schema
[ https://issues.apache.org/jira/browse/SPARK-31594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094362#comment-17094362 ] Takeshi Yamamuro commented on SPARK-31594: -- I'm working on this https://github.com/apache/spark/pull/28392 > Do not display rand/randn seed numbers in schema > > > Key: SPARK-31594 > URL: https://issues.apache.org/jira/browse/SPARK-31594 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31594) Do not display rand/randn seed numbers in schema
[ https://issues.apache.org/jira/browse/SPARK-31594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-31594: - Summary: Do not display rand/randn seed numbers in schema (was: Do not display rand/randn seed in schema) > Do not display rand/randn seed numbers in schema > > > Key: SPARK-31594 > URL: https://issues.apache.org/jira/browse/SPARK-31594 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31594) Do not display rand/randn seed in schema
Takeshi Yamamuro created SPARK-31594: Summary: Do not display rand/randn seed in schema Key: SPARK-31594 URL: https://issues.apache.org/jira/browse/SPARK-31594 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Takeshi Yamamuro -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31593) Remove unnecessary streaming query progress update
Genmao Yu created SPARK-31593: - Summary: Remove unnecessary streaming query progress update Key: SPARK-31593 URL: https://issues.apache.org/jira/browse/SPARK-31593 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.5, 3.0.0 Reporter: Genmao Yu Structured Streaming progress reporter will always report an `empty` progress when there is no new data. As design, we should provide progress updates every 10s (default) when there is no new data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31592) bufferPoolsBySize in HeapMemoryAllocator should be thread safe
[ https://issues.apache.org/jira/browse/SPARK-31592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yunbo Fan updated SPARK-31592: -- Description: Currently, bufferPoolsBySize in HeapMemoryAllocator uses a Map type whose value type is LinkedList. LinkedList is not thread safe and may hit the error below {code:java} java.util.NoSuchElementExceptionException at java.util.LinkedList.removeFirst(LinkedList.java:270) at java.util.LinkedList.remove(LinkedList.java:685) at org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:57){code} was: Currently, bufferPoolsBySize in HeapMemoryAllocator uses a Map type whose value type is LinkedList. LinkedList is not thread safe and may hit the error below {code:java} java.util.NoSuchElementExceptionException at java.util.LinkedList.removeFirst(LinkedList.java:270) at java.util.LinkedList.remove(LinkedList.java:685) at org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:57){code} > bufferPoolsBySize in HeapMemoryAllocator should be thread safe > -- > > Key: SPARK-31592 > URL: https://issues.apache.org/jira/browse/SPARK-31592 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Yunbo Fan >Priority: Major > > Currently, bufferPoolsBySize in HeapMemoryAllocator uses a Map type whose > value type is LinkedList. > LinkedList is not thread safe and may hit the error below > {code:java} > java.util.NoSuchElementExceptionException > at java.util.LinkedList.removeFirst(LinkedList.java:270) > at java.util.LinkedList.remove(LinkedList.java:685) > at > org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:57){code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31592) bufferPoolsBySize in HeapMemoryAllocator should be thread safe
Yunbo Fan created SPARK-31592: - Summary: bufferPoolsBySize in HeapMemoryAllocator should be thread safe Key: SPARK-31592 URL: https://issues.apache.org/jira/browse/SPARK-31592 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.5 Reporter: Yunbo Fan Currently, bufferPoolsBySize in HeapMemoryAllocator uses a Map type whose value type is LinkedList. LinkedList is not thread safe and may hit the error below {code:java} java.util.NoSuchElementExceptionException at java.util.LinkedList.removeFirst(LinkedList.java:270) at java.util.LinkedList.remove(LinkedList.java:685) at org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate(HeapMemoryAllocator.java:57){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26924) Fix CRAN hack as soon as Arrow is available on CRAN
[ https://issues.apache.org/jira/browse/SPARK-26924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26924. -- Resolution: Duplicate > Fix CRAN hack as soon as Arrow is available on CRAN > --- > > Key: SPARK-26924 > URL: https://issues.apache.org/jira/browse/SPARK-26924 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Arrow optimization was added but Arrow is not available on CRAN. > So, it had to add some hacks to avoid CRAN check in SparkR side. For example, > see > https://github.com/apache/spark/search?q=requireNamespace1&unscoped_q=requireNamespace1 > These should be removed to properly check CRAN in SparkR > See also ARROW-3204 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31583) grouping_id calculation should be improved
[ https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094284#comment-17094284 ] Takeshi Yamamuro commented on SPARK-31583: -- [~cpiliotis] Hi, thanks for your report! Just a question; you proposed the two things below in this JIRA? - reordering bit positions in grouping_id corresponding to a projection list in select - flipping the current output in grouping_id > grouping_id calculation should be improved > -- > > Key: SPARK-31583 > URL: https://issues.apache.org/jira/browse/SPARK-31583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Costas Piliotis >Priority: Minor > > Unrelated to SPARK-21858 which identifies that grouping_id is determined by > exclusion from a grouping_set rather than inclusion, when performing complex > grouping_sets that are not in the order of the base select statement, > flipping the bit in the grouping_id seems to be happen when the grouping set > is identified rather than when the columns are selected in the sql. I will > of course use the exclusion strategy identified in SPARK-21858 as the > baseline for this. > > {code:scala} > import spark.implicits._ > val df= Seq( > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d") > ).toDF("a","b","c","d").createOrReplaceTempView("abc") > {code} > expected to have these references in the grouping_id: > d=1 > c=2 > b=4 > a=8 > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > This returns: > {noformat} > ++++++---+---+ > |a |b |c |d |count(1)|gid|gid_bin| > ++++++---+---+ > |a |null|c |null|4 |6 |110| > |null|null|null|null|4 |15 | | > |a |null|null|d |4 |5 |101| > |a |b |null|d |4 |1 |1 | > ++++++---+---+ > {noformat} > > In other words, I would have expected the excluded values one way but I > received them excluded in the order they were first seen in the specified > grouping sets. > a,b,d included = excldes c = 2; expected gid=2. received gid=1 > a,d included = excludes b=4, c=2 expected gid=6, received gid=5 > The grouping_id that actually is expected is (a,b,d,c) > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, > bin(grouping_id(a,b,d,c)) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > columns forming groupingid seem to be creatred as the grouping sets are > identified rather than ordinal position in parent query. > I'd like to at least point out that grouping_id is documented in many other > rdbms and I believe the spark project should use a policy of flipping the > bits so 1=inclusion; 0=exclusion in the grouping set. > However many rdms that do have the feature of a grouping_id do implement it > by the ordinal position recognized as fields in the select clause, rather > than allocating them as they are observed in the grouping sets. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31583) grouping_id calculation should be improved
[ https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094284#comment-17094284 ] Takeshi Yamamuro edited comment on SPARK-31583 at 4/28/20, 8:28 AM: [~cpiliotis] Hi, thanks for your report! Just to check; you proposed the two things below in this JIRA? - reordering bit positions in grouping_id corresponding to a projection list in select - flipping the current output in grouping_id was (Author: maropu): [~cpiliotis] Hi, thanks for your report! Just a question; you proposed the two things below in this JIRA? - reordering bit positions in grouping_id corresponding to a projection list in select - flipping the current output in grouping_id > grouping_id calculation should be improved > -- > > Key: SPARK-31583 > URL: https://issues.apache.org/jira/browse/SPARK-31583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Costas Piliotis >Priority: Minor > > Unrelated to SPARK-21858 which identifies that grouping_id is determined by > exclusion from a grouping_set rather than inclusion, when performing complex > grouping_sets that are not in the order of the base select statement, > flipping the bit in the grouping_id seems to be happen when the grouping set > is identified rather than when the columns are selected in the sql. I will > of course use the exclusion strategy identified in SPARK-21858 as the > baseline for this. > > {code:scala} > import spark.implicits._ > val df= Seq( > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d") > ).toDF("a","b","c","d").createOrReplaceTempView("abc") > {code} > expected to have these references in the grouping_id: > d=1 > c=2 > b=4 > a=8 > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > This returns: > {noformat} > ++++++---+---+ > |a |b |c |d |count(1)|gid|gid_bin| > ++++++---+---+ > |a |null|c |null|4 |6 |110| > |null|null|null|null|4 |15 | | > |a |null|null|d |4 |5 |101| > |a |b |null|d |4 |1 |1 | > ++++++---+---+ > {noformat} > > In other words, I would have expected the excluded values one way but I > received them excluded in the order they were first seen in the specified > grouping sets. > a,b,d included = excldes c = 2; expected gid=2. received gid=1 > a,d included = excludes b=4, c=2 expected gid=6, received gid=5 > The grouping_id that actually is expected is (a,b,d,c) > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, > bin(grouping_id(a,b,d,c)) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > columns forming groupingid seem to be creatred as the grouping sets are > identified rather than ordinal position in parent query. > I'd like to at least point out that grouping_id is documented in many other > rdbms and I believe the spark project should use a policy of flipping the > bits so 1=inclusion; 0=exclusion in the grouping set. > However many rdms that do have the feature of a grouping_id do implement it > by the ordinal position recognized as fields in the select clause, rather > than allocating them as they are observed in the grouping sets. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31573) Use fixed=TRUE where possible for internal efficiency
[ https://issues.apache.org/jira/browse/SPARK-31573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31573: - Issue Type: Bug (was: Documentation) > Use fixed=TRUE where possible for internal efficiency > - > > Key: SPARK-31573 > URL: https://issues.apache.org/jira/browse/SPARK-31573 > Project: Spark > Issue Type: Bug > Components: R >Affects Versions: 2.4.5 >Reporter: Michael Chirico >Assignee: Michael Chirico >Priority: Minor > Fix For: 3.0.0 > > > gsub('_', '', x) is more efficient if we signal there's no regex: gsub('_', > '', x, fixed = TRUE) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31573) Use fixed=TRUE where possible for internal efficiency
[ https://issues.apache.org/jira/browse/SPARK-31573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31573. -- Assignee: Michael Chirico Resolution: Fixed Fixed in https://github.com/apache/spark/pull/28367 > Use fixed=TRUE where possible for internal efficiency > - > > Key: SPARK-31573 > URL: https://issues.apache.org/jira/browse/SPARK-31573 > Project: Spark > Issue Type: Documentation > Components: R >Affects Versions: 2.4.5 >Reporter: Michael Chirico >Assignee: Michael Chirico >Priority: Minor > > gsub('_', '', x) is more efficient if we signal there's no regex: gsub('_', > '', x, fixed = TRUE) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31573) Use fixed=TRUE where possible for internal efficiency
[ https://issues.apache.org/jira/browse/SPARK-31573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31573: - Fix Version/s: 3.0.0 > Use fixed=TRUE where possible for internal efficiency > - > > Key: SPARK-31573 > URL: https://issues.apache.org/jira/browse/SPARK-31573 > Project: Spark > Issue Type: Documentation > Components: R >Affects Versions: 2.4.5 >Reporter: Michael Chirico >Assignee: Michael Chirico >Priority: Minor > Fix For: 3.0.0 > > > gsub('_', '', x) is more efficient if we signal there's no regex: gsub('_', > '', x, fixed = TRUE) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31519) Cast in having aggregate expressions returns the wrong result
[ https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31519. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28294 [https://github.com/apache/spark/pull/28294] > Cast in having aggregate expressions returns the wrong result > - > > Key: SPARK-31519 > URL: https://issues.apache.org/jira/browse/SPARK-31519 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.0.0 > > > Cast in having aggregate expressions returns the wrong result. > See the below tests: > {code:java} > scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)") > res0: org.apache.spark.sql.DataFrame = [] > scala> val query = """ > | select sum(a) as b, '2020-01-01' as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---+--+ > | b| fake| > +---+--+ > | 2|2020-01-01| > +---+--+ > scala> val query = """ > | select sum(a) as b, cast('2020-01-01' as date) as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---++ > | b|fake| > +---++ > +---++ > {code} > The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING > query, and Spark has a special analyzer rule ResolveAggregateFunctions to > resolve the aggregate functions and grouping columns in the Filter operator. > > It works for simple cases in a very tricky way as it relies on rule execution > order: > 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes > inside aggregate functions, but the function itself is still unresolved as > it's an UnresolvedFunction. This stops resolving the Filter operator as the > child Aggrege operator is still unresolved. > 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege > operator resolved. > 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child > is a resolved Aggregate. This rule can correctly resolve the grouping columns. > > In the example query, I put a CAST, which needs to be resolved by rule > ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step > 3 as the Aggregate operator is unresolved at that time. Then the analyzer > starts next round and the Filter operator is resolved by ResolveReferences, > which wrongly resolves the grouping columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31519) Cast in having aggregate expressions returns the wrong result
[ https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31519: --- Assignee: Yuanjian Li > Cast in having aggregate expressions returns the wrong result > - > > Key: SPARK-31519 > URL: https://issues.apache.org/jira/browse/SPARK-31519 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > > Cast in having aggregate expressions returns the wrong result. > See the below tests: > {code:java} > scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)") > res0: org.apache.spark.sql.DataFrame = [] > scala> val query = """ > | select sum(a) as b, '2020-01-01' as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---+--+ > | b| fake| > +---+--+ > | 2|2020-01-01| > +---+--+ > scala> val query = """ > | select sum(a) as b, cast('2020-01-01' as date) as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---++ > | b|fake| > +---++ > +---++ > {code} > The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING > query, and Spark has a special analyzer rule ResolveAggregateFunctions to > resolve the aggregate functions and grouping columns in the Filter operator. > > It works for simple cases in a very tricky way as it relies on rule execution > order: > 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes > inside aggregate functions, but the function itself is still unresolved as > it's an UnresolvedFunction. This stops resolving the Filter operator as the > child Aggrege operator is still unresolved. > 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege > operator resolved. > 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child > is a resolved Aggregate. This rule can correctly resolve the grouping columns. > > In the example query, I put a CAST, which needs to be resolved by rule > ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step > 3 as the Aggregate operator is unresolved at that time. Then the analyzer > starts next round and the Filter operator is resolved by ResolveReferences, > which wrongly resolves the grouping columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30868) Throw Exception if runHive(sql) failed
[ https://issues.apache.org/jira/browse/SPARK-30868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094274#comment-17094274 ] Yuming Wang commented on SPARK-30868: - Issue resolved by pull request 27644 https://github.com/apache/spark/pull/27644 > Throw Exception if runHive(sql) failed > -- > > Key: SPARK-30868 > URL: https://issues.apache.org/jira/browse/SPARK-30868 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Jackey Lee >Assignee: Jackey Lee >Priority: Major > Fix For: 3.0.0 > > > At present, HiveClientImpl.runHive will not throw an exception when it runs > incorrectly, which will cause it to fail to feedback error information > normally. > Example > {code:scala} > spark.sql("add jar file:///tmp/test.jar") > spark.sql("show databases").show() > {code} > /tmp/test.jar doesn't exist, thus add jar is failed. However this code will > run completely without causing application failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30868) Throw Exception if runHive(sql) failed
[ https://issues.apache.org/jira/browse/SPARK-30868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-30868. - Fix Version/s: 3.0.0 Resolution: Fixed > Throw Exception if runHive(sql) failed > -- > > Key: SPARK-30868 > URL: https://issues.apache.org/jira/browse/SPARK-30868 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Jackey Lee >Assignee: Jackey Lee >Priority: Major > Fix For: 3.0.0 > > > At present, HiveClientImpl.runHive will not throw an exception when it runs > incorrectly, which will cause it to fail to feedback error information > normally. > Example > {code:scala} > spark.sql("add jar file:///tmp/test.jar") > spark.sql("show databases").show() > {code} > /tmp/test.jar doesn't exist, thus add jar is failed. However this code will > run completely without causing application failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30868) Throw Exception if runHive(sql) failed
[ https://issues.apache.org/jira/browse/SPARK-30868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-30868: --- Assignee: Jackey Lee > Throw Exception if runHive(sql) failed > -- > > Key: SPARK-30868 > URL: https://issues.apache.org/jira/browse/SPARK-30868 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Jackey Lee >Assignee: Jackey Lee >Priority: Major > > At present, HiveClientImpl.runHive will not throw an exception when it runs > incorrectly, which will cause it to fail to feedback error information > normally. > Example > {code:scala} > spark.sql("add jar file:///tmp/test.jar") > spark.sql("show databases").show() > {code} > /tmp/test.jar doesn't exist, thus add jar is failed. However this code will > run completely without causing application failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31524) Add metric to the split number for skew partition when enable AQE
[ https://issues.apache.org/jira/browse/SPARK-31524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31524. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28109 [https://github.com/apache/spark/pull/28109] > Add metric to the split number for skew partition when enable AQE > -- > > Key: SPARK-31524 > URL: https://issues.apache.org/jira/browse/SPARK-31524 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Fix For: 3.1.0 > > > Add the details metrics for the split number in skewed partitions when enable > AQE and skew join optimization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31524) Add metric to the split number for skew partition when enable AQE
[ https://issues.apache.org/jira/browse/SPARK-31524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31524: --- Assignee: Ke Jia > Add metric to the split number for skew partition when enable AQE > -- > > Key: SPARK-31524 > URL: https://issues.apache.org/jira/browse/SPARK-31524 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > > Add the details metrics for the split number in skewed partitions when enable > AQE and skew join optimization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26199) Long expressions cause mutate to fail
[ https://issues.apache.org/jira/browse/SPARK-26199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094232#comment-17094232 ] Michael Chirico commented on SPARK-26199: - Just saw this. https://issues.apache.org/jira/browse/SPARK-31517 is a duplicate of this. PR to fix it is here: https://github.com/apache/spark/pull/28386 I'll tag this Jira as well. > Long expressions cause mutate to fail > - > > Key: SPARK-26199 > URL: https://issues.apache.org/jira/browse/SPARK-26199 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: João Rafael >Priority: Minor > > Calling {{mutate(df, field = expr)}} fails when expr is very long. > Example: > {code:R} > df <- mutate(df, field = ifelse( > lit(TRUE), > lit("A"), > ifelse( > lit(T), > lit("BB"), > lit("C") > ) > )) > {code} > Stack trace: > {code:R} > FATAL subscript out of bounds > at .handleSimpleError(function (obj) > { > level = sapply(class(obj), sw > at FUN(X[[i]], ...) > at lapply(seq_along(args), function(i) { > if (ns[[i]] != "") { > at lapply(seq_along(args), function(i) { > if (ns[[i]] != "") { > at mutate(df, field = ifelse(lit(TRUE), lit("A"), ifelse(lit(T), lit("BBB > at #78: mutate(df, field = ifelse(lit(TRUE), lit("A"), ifelse(lit(T > {code} > The root cause is in: > [DataFrame.R#LL2182|https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L2182] > When the expression is long {{deparse}} returns multiple lines, causing > {{args}} to have more elements than {{ns}}. The solution could be to set > {{nlines = 1}} or to collapse the lines together. > A simple work around exists, by first placing the expression in a variable > and using it instead: > {code:R} > tmp <- ifelse( > lit(TRUE), > lit("A"), > ifelse( > lit(T), > lit("BB"), > lit("C") > ) > ) > df <- mutate(df, field = tmp) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31586) Replace expression TimeSub(l, r) with TimeAdd(l -r)
[ https://issues.apache.org/jira/browse/SPARK-31586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094231#comment-17094231 ] Kent Yao commented on SPARK-31586: -- Hi [~Ankitraj], PR is ready [https://github.com/apache/spark/pull/28381] > Replace expression TimeSub(l, r) with TimeAdd(l -r) > --- > > Key: SPARK-31586 > URL: https://issues.apache.org/jira/browse/SPARK-31586 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Minor > > The implementation of TimeSub for the operation of timestamp subtracting > interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, > -r) since there are equivalent. > Suggestion from > https://github.com/apache/spark/pull/28310#discussion_r414259239 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31589) Use `r-lib/actions/setup-r` in GitHub Action
[ https://issues.apache.org/jira/browse/SPARK-31589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31589: -- Fix Version/s: 2.4.6 > Use `r-lib/actions/setup-r` in GitHub Action > > > Key: SPARK-31589 > URL: https://issues.apache.org/jira/browse/SPARK-31589 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.6, 3.0.0 > > > `r-lib/actions/setup-r` is more stabler and maintained 3rd party action. > I made this issue as `Bug` since the branch is currently broken. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31583) grouping_id calculation should be improved
[ https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-31583: - Component/s: (was: Spark Core) SQL > grouping_id calculation should be improved > -- > > Key: SPARK-31583 > URL: https://issues.apache.org/jira/browse/SPARK-31583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Costas Piliotis >Priority: Minor > > Unrelated to SPARK-21858 which identifies that grouping_id is determined by > exclusion from a grouping_set rather than inclusion, when performing complex > grouping_sets that are not in the order of the base select statement, > flipping the bit in the grouping_id seems to be happen when the grouping set > is identified rather than when the columns are selected in the sql. I will > of course use the exclusion strategy identified in SPARK-21858 as the > baseline for this. > > {code:scala} > import spark.implicits._ > val df= Seq( > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d") > ).toDF("a","b","c","d").createOrReplaceTempView("abc") > {code} > expected to have these references in the grouping_id: > d=1 > c=2 > b=4 > a=8 > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > This returns: > {noformat} > ++++++---+---+ > |a |b |c |d |count(1)|gid|gid_bin| > ++++++---+---+ > |a |null|c |null|4 |6 |110| > |null|null|null|null|4 |15 | | > |a |null|null|d |4 |5 |101| > |a |b |null|d |4 |1 |1 | > ++++++---+---+ > {noformat} > > In other words, I would have expected the excluded values one way but I > received them excluded in the order they were first seen in the specified > grouping sets. > a,b,d included = excldes c = 2; expected gid=2. received gid=1 > a,d included = excludes b=4, c=2 expected gid=6, received gid=5 > The grouping_id that actually is expected is (a,b,d,c) > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, > bin(grouping_id(a,b,d,c)) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > columns forming groupingid seem to be creatred as the grouping sets are > identified rather than ordinal position in parent query. > I'd like to at least point out that grouping_id is documented in many other > rdbms and I believe the spark project should use a policy of flipping the > bits so 1=inclusion; 0=exclusion in the grouping set. > However many rdms that do have the feature of a grouping_id do implement it > by the ordinal position recognized as fields in the select clause, rather > than allocating them as they are observed in the grouping sets. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31583) grouping_id calculation should be improved
[ https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-31583: - Affects Version/s: (was: 2.4.5) 3.1.0 > grouping_id calculation should be improved > -- > > Key: SPARK-31583 > URL: https://issues.apache.org/jira/browse/SPARK-31583 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Costas Piliotis >Priority: Minor > > Unrelated to SPARK-21858 which identifies that grouping_id is determined by > exclusion from a grouping_set rather than inclusion, when performing complex > grouping_sets that are not in the order of the base select statement, > flipping the bit in the grouping_id seems to be happen when the grouping set > is identified rather than when the columns are selected in the sql. I will > of course use the exclusion strategy identified in SPARK-21858 as the > baseline for this. > > {code:scala} > import spark.implicits._ > val df= Seq( > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d") > ).toDF("a","b","c","d").createOrReplaceTempView("abc") > {code} > expected to have these references in the grouping_id: > d=1 > c=2 > b=4 > a=8 > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > This returns: > {noformat} > ++++++---+---+ > |a |b |c |d |count(1)|gid|gid_bin| > ++++++---+---+ > |a |null|c |null|4 |6 |110| > |null|null|null|null|4 |15 | | > |a |null|null|d |4 |5 |101| > |a |b |null|d |4 |1 |1 | > ++++++---+---+ > {noformat} > > In other words, I would have expected the excluded values one way but I > received them excluded in the order they were first seen in the specified > grouping sets. > a,b,d included = excldes c = 2; expected gid=2. received gid=1 > a,d included = excludes b=4, c=2 expected gid=6, received gid=5 > The grouping_id that actually is expected is (a,b,d,c) > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, > bin(grouping_id(a,b,d,c)) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > columns forming groupingid seem to be creatred as the grouping sets are > identified rather than ordinal position in parent query. > I'd like to at least point out that grouping_id is documented in many other > rdbms and I believe the spark project should use a policy of flipping the > bits so 1=inclusion; 0=exclusion in the grouping set. > However many rdms that do have the feature of a grouping_id do implement it > by the ordinal position recognized as fields in the select clause, rather > than allocating them as they are observed in the grouping sets. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31586) Replace expression TimeSub(l, r) with TimeAdd(l -r)
[ https://issues.apache.org/jira/browse/SPARK-31586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094217#comment-17094217 ] Ankit Raj Boudh commented on SPARK-31586: - Hi Kent Yao, are you working on this issue ?, if not can i start working on this issue. > Replace expression TimeSub(l, r) with TimeAdd(l -r) > --- > > Key: SPARK-31586 > URL: https://issues.apache.org/jira/browse/SPARK-31586 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Priority: Minor > > The implementation of TimeSub for the operation of timestamp subtracting > interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, > -r) since there are equivalent. > Suggestion from > https://github.com/apache/spark/pull/28310#discussion_r414259239 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31591) namePrefix could be null in Utils.createDirectory
[ https://issues.apache.org/jira/browse/SPARK-31591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094215#comment-17094215 ] Lantao Jin commented on SPARK-31591: https://github.com/apache/spark/pull/28385 > namePrefix could be null in Utils.createDirectory > - > > Key: SPARK-31591 > URL: https://issues.apache.org/jira/browse/SPARK-31591 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Minor > > In our production, we find that many shuffle files could be located in > /hadoop/2/yarn/local/usercache/b_carmel/appcache/application_1586487864336_4602/*null*-107d4e9c-d3c7-419e-9743-a21dc4eaeb3f/3a > The Util.createDirectory() uses a default parameter "spark" > {code} > def createDirectory(root: String, namePrefix: String = "spark"): File = { > {code} > But in some cases, the actual namePrefix is null. If the method is called > with null, then the default value would not be applied. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31591) namePrefix could be null in Utils.createDirectory
[ https://issues.apache.org/jira/browse/SPARK-31591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094214#comment-17094214 ] Lantao Jin commented on SPARK-31591: [~Ankitraj] I have already filed a PR. > namePrefix could be null in Utils.createDirectory > - > > Key: SPARK-31591 > URL: https://issues.apache.org/jira/browse/SPARK-31591 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Minor > > In our production, we find that many shuffle files could be located in > /hadoop/2/yarn/local/usercache/b_carmel/appcache/application_1586487864336_4602/*null*-107d4e9c-d3c7-419e-9743-a21dc4eaeb3f/3a > The Util.createDirectory() uses a default parameter "spark" > {code} > def createDirectory(root: String, namePrefix: String = "spark"): File = { > {code} > But in some cases, the actual namePrefix is null. If the method is called > with null, then the default value would not be applied. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org