[jira] [Commented] (SPARK-31517) SparkR::orderBy with multiple columns descending produces error
[ https://issues.apache.org/jira/browse/SPARK-31517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094172#comment-17094172 ] Michael Chirico commented on SPARK-31517: - > Error in ns[[i]] : subscript out of bounds This error is coming from mutate https://github.com/apache/spark/blob/2d3e9601b58fbe33aeedb106be7e2a1fafa2e1fd/R/pkg/R/DataFrame.R#L2294 > SparkR::orderBy with multiple columns descending produces error > --- > > Key: SPARK-31517 > URL: https://issues.apache.org/jira/browse/SPARK-31517 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.5 > Environment: Databricks Runtime 6.5 >Reporter: Ross Bowen >Priority: Major > > When specifying two columns within an `orderBy()` function, to attempt to get > an ordering by two columns in descending order, an error is returned. > {code:java} > library(magrittr) > library(SparkR) > cars <- cbind(model = rownames(mtcars), mtcars) > carsDF <- createDataFrame(cars) > carsDF %>% > mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), > desc(column("mpg")), desc(column("disp") %>% > head() {code} > This returns an error: > {code:java} > Error in ns[[i]] : subscript out of bounds{code} > This seems to be related to the more general issue that the following code, > excluding the use of the `desc()` function also fails: > {code:java} > carsDF %>% > mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), > column("mpg"), column("disp" %>% > head(){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31583) grouping_id calculation should be improved
[ https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094153#comment-17094153 ] Wenchen Fan commented on SPARK-31583: - cc [~maropu] > grouping_id calculation should be improved > -- > > Key: SPARK-31583 > URL: https://issues.apache.org/jira/browse/SPARK-31583 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Costas Piliotis >Priority: Minor > > Unrelated to SPARK-21858 which identifies that grouping_id is determined by > exclusion from a grouping_set rather than inclusion, when performing complex > grouping_sets that are not in the order of the base select statement, > flipping the bit in the grouping_id seems to be happen when the grouping set > is identified rather than when the columns are selected in the sql. I will > of course use the exclusion strategy identified in SPARK-21858 as the > baseline for this. > > {code:scala} > import spark.implicits._ > val df= Seq( > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d") > ).toDF("a","b","c","d").createOrReplaceTempView("abc") > {code} > expected to have these references in the grouping_id: > d=1 > c=2 > b=4 > a=8 > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > This returns: > {noformat} > ++++++---+---+ > |a |b |c |d |count(1)|gid|gid_bin| > ++++++---+---+ > |a |null|c |null|4 |6 |110| > |null|null|null|null|4 |15 | | > |a |null|null|d |4 |5 |101| > |a |b |null|d |4 |1 |1 | > ++++++---+---+ > {noformat} > > In other words, I would have expected the excluded values one way but I > received them excluded in the order they were first seen in the specified > grouping sets. > a,b,d included = excldes c = 2; expected gid=2. received gid=1 > a,d included = excludes b=4, c=2 expected gid=6, received gid=5 > The grouping_id that actually is expected is (a,b,d,c) > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, > bin(grouping_id(a,b,d,c)) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > columns forming groupingid seem to be creatred as the grouping sets are > identified rather than ordinal position in parent query. > I'd like to at least point out that grouping_id is documented in many other > rdbms and I believe the spark project should use a policy of flipping the > bits so 1=inclusion; 0=exclusion in the grouping set. > However many rdms that do have the feature of a grouping_id do implement it > by the ordinal position recognized as fields in the select clause, rather > than allocating them as they are observed in the grouping sets. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31589) Use `r-lib/actions/setup-r` in GitHub Action
[ https://issues.apache.org/jira/browse/SPARK-31589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31589. -- Fix Version/s: 3.0.0 Assignee: Dongjoon Hyun Resolution: Fixed Fixed in https://github.com/apache/spark/pull/28382 > Use `r-lib/actions/setup-r` in GitHub Action > > > Key: SPARK-31589 > URL: https://issues.apache.org/jira/browse/SPARK-31589 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > `r-lib/actions/setup-r` is more stabler and maintained 3rd party action. > I made this issue as `Bug` since the branch is currently broken. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31589) Use `r-lib/actions/setup-r` in GitHub Action
[ https://issues.apache.org/jira/browse/SPARK-31589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31589: -- Description: `r-lib/actions/setup-r` is more stabler and maintained 3rd party action. I made this issue as `Bug` since the branch is currently broken. was:`r-lib/actions/setup-r` is more stabler and maintained 3rd party action. > Use `r-lib/actions/setup-r` in GitHub Action > > > Key: SPARK-31589 > URL: https://issues.apache.org/jira/browse/SPARK-31589 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > `r-lib/actions/setup-r` is more stabler and maintained 3rd party action. > I made this issue as `Bug` since the branch is currently broken. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31589) Use `r-lib/actions/setup-r` in GitHub Action
[ https://issues.apache.org/jira/browse/SPARK-31589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31589: -- Issue Type: Bug (was: Improvement) > Use `r-lib/actions/setup-r` in GitHub Action > > > Key: SPARK-31589 > URL: https://issues.apache.org/jira/browse/SPARK-31589 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > `r-lib/actions/setup-r` is more stabler and maintained 3rd party action. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31587) R installation in Github Actions is being failed
[ https://issues.apache.org/jira/browse/SPARK-31587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31587. -- Resolution: Duplicate > R installation in Github Actions is being failed > > > Key: SPARK-31587 > URL: https://issues.apache.org/jira/browse/SPARK-31587 > Project: Spark > Issue Type: Bug > Components: Project Infra, R >Affects Versions: 2.4.5, 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Currently, R installation seems being failed as below: > {code} > Get:61 https://dl.bintray.com/sbt/debian Packages [4174 B] > Get:62 http://ppa.launchpad.net/apt-fast/stable/ubuntu bionic/main amd64 > Packages [532 B] > Get:63 http://ppa.launchpad.net/git-core/ppa/ubuntu bionic/main amd64 > Packages [3036 B] > Get:64 http://ppa.launchpad.net/ondrej/php/ubuntu bionic/main amd64 Packages > [52.0 kB] > Get:65 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic/main > amd64 Packages [33.9 kB] > Get:66 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic/main > Translation-en [10.1 kB] > Reading package lists... > E: The repository 'https://cloud.r-project.org/bin/linux/ubuntu > bionic-cran35/ Release' does not have a Release file. > ##[error]Process completed with exit code 100. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31590) The filter used by Metadata-only queries should not have Unevaluable
[ https://issues.apache.org/jira/browse/SPARK-31590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-31590: --- Description: When using SPARK-23877, some sql execution errors. code: {code:scala} sql("set spark.sql.optimizer.metadataOnly=true") sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET PARTITIONED BY (d ,h)") sql(""" |INSERT OVERWRITE TABLE test_tbl PARTITION(d,h) |SELECT 1,'2020-01-01','23' |UNION ALL |SELECT 2,'2020-01-02','01' |UNION ALL |SELECT 3,'2020-01-02','02' """.stripMargin) sql( s""" |SELECT d, MAX(h) AS h |FROM test_tbl |WHERE d= ( | SELECT MAX(d) AS d | FROM test_tbl |) |GROUP BY d """.stripMargin).collect() {code} Exception: {code:java} java.lang.UnsupportedOperationException: Cannot evaluate expression: scalar-subquery#48 [] ... at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.prunePartitions(PartitioningAwareFileIndex.scala:180) {code} optimizedPlan: {code:java} Aggregate [d#245], [d#245, max(h#246) AS h#243] +- Project [d#245, h#246] +- Filter (isnotnull(d#245) AND (d#245 = scalar-subquery#242 [])) : +- Aggregate [max(d#245) AS d#241] : +- LocalRelation , [d#245] +- Relation[a#244,d#245,h#246] parquet {code} was: code: {code:scala} sql("set spark.sql.optimizer.metadataOnly=true") sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET PARTITIONED BY (d ,h)") sql(""" |INSERT OVERWRITE TABLE test_tbl PARTITION(d,h) |SELECT 1,'2020-01-01','23' |UNION ALL |SELECT 2,'2020-01-02','01' |UNION ALL |SELECT 3,'2020-01-02','02' """.stripMargin) sql( s""" |SELECT d, MAX(h) AS h |FROM test_tbl |WHERE d= ( | SELECT MAX(d) AS d | FROM test_tbl |) |GROUP BY d """.stripMargin).collect() {code} Exception: {code:java} java.lang.UnsupportedOperationException: Cannot evaluate expression: scalar-subquery#48 [] ... at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.prunePartitions(PartitioningAwareFileIndex.scala:180) {code} optimizedPlan: {code:java} Aggregate [d#245], [d#245, max(h#246) AS h#243] +- Project [d#245, h#246] +- Filter (isnotnull(d#245) AND (d#245 = scalar-subquery#242 [])) : +- Aggregate [max(d#245) AS d#241] : +- LocalRelation , [d#245] +- Relation[a#244,d#245,h#246] parquet {code} > The filter used by Metadata-only queries should not have Unevaluable > > > Key: SPARK-31590 > URL: https://issues.apache.org/jira/browse/SPARK-31590 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: dzcxzl >Priority: Trivial > > When using SPARK-23877, some sql execution errors. > code: > {code:scala} > sql("set spark.sql.optimizer.metadataOnly=true") > sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET > PARTITIONED BY (d ,h)") > sql(""" > |INSERT OVERWRITE TABLE test_tbl PARTITION(d,h) > |SELECT 1,'2020-01-01','23' > |UNION ALL > |SELECT 2,'2020-01-02','01' > |UNION ALL > |SELECT 3,'2020-01-02','02' > """.stripMargin) > sql( > s""" > |SELECT d, MAX(h) AS h > |FROM test_tbl > |WHERE d= ( > | SELECT MAX(d) AS d > | FROM test_tbl > |) > |GROUP BY d > """.stripMargin).collect() > {code} > Exception: > {code:java} > java.lang.UnsupportedOperationException: Cannot evaluate expression: > scalar-subquery#48 [] > ... > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.prunePartitions(PartitioningAwareFileIndex.scala:180) > {code} > optimizedPlan: > {code:java} > Aggregate [d#245], [d#245, max(h#246) AS h#243] > +- Project [d#245, h#246] >+- Filter (isnotnull(d#245) AND (d#245 = scalar-subquery#242 [])) > : +- Aggregate [max(d#245) AS d#241] > : +- LocalRelation , [d#245] > +- Relation[a#244,d#245,h#246] parquet > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31590) The filter used by Metadata-only queries should not have Unevaluable
dzcxzl created SPARK-31590: -- Summary: The filter used by Metadata-only queries should not have Unevaluable Key: SPARK-31590 URL: https://issues.apache.org/jira/browse/SPARK-31590 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: dzcxzl code: {code:scala} sql("set spark.sql.optimizer.metadataOnly=true") sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET PARTITIONED BY (d ,h)") sql(""" |INSERT OVERWRITE TABLE test_tbl PARTITION(d,h) |SELECT 1,'2020-01-01','23' |UNION ALL |SELECT 2,'2020-01-02','01' |UNION ALL |SELECT 3,'2020-01-02','02' """.stripMargin) sql( s""" |SELECT d, MAX(h) AS h |FROM test_tbl |WHERE d= ( | SELECT MAX(d) AS d | FROM test_tbl |) |GROUP BY d """.stripMargin).collect() {code} Exception: {code:java} java.lang.UnsupportedOperationException: Cannot evaluate expression: scalar-subquery#48 [] ... at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.prunePartitions(PartitioningAwareFileIndex.scala:180) {code} optimizedPlan: {code:java} Aggregate [d#245], [d#245, max(h#246) AS h#243] +- Project [d#245, h#246] +- Filter (isnotnull(d#245) AND (d#245 = scalar-subquery#242 [])) : +- Aggregate [max(d#245) AS d#241] : +- LocalRelation , [d#245] +- Relation[a#244,d#245,h#246] parquet {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31589) Use `r-lib/actions/setup-r` in GitHub Action
Dongjoon Hyun created SPARK-31589: - Summary: Use `r-lib/actions/setup-r` in GitHub Action Key: SPARK-31589 URL: https://issues.apache.org/jira/browse/SPARK-31589 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 2.4.5, 3.0.0, 3.1.0 Reporter: Dongjoon Hyun `r-lib/actions/setup-r` is more stabler and maintained 3rd party action. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31588) merge small files may need more common setting
philipse created SPARK-31588: Summary: merge small files may need more common setting Key: SPARK-31588 URL: https://issues.apache.org/jira/browse/SPARK-31588 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.5 Environment: spark:2.4.5 hdp:2.7 Reporter: philipse Hi , SparkSql now allow us to use repartition or coalesce to manually control the small files like the following /*+ REPARTITION(1) */ /*+ COALESCE(1) */ But it can only be tuning case by case ,we need to decide whether we need to use COALESCE or REPARTITION,can we try a more common way to reduce the decision by set the target size as hive did *Good points:* 1)we will also the new partitions number 2)with an ON-OFF parameter provided , user can close it if needed 3)the parmeter can be set at cluster level instand of user side,it will be more easier to controll samll files. 4)greatly reduce the pressue of namenode *Not good points:* 1)It will add a new task to calculate the target numbers by stastics the out files. I don't know whether we have planned this in future. Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31587) R installation in Github Actions is being failed
[ https://issues.apache.org/jira/browse/SPARK-31587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31587: - Summary: R installation in Github Actions is being failed (was: Fixes the repository for downloading R in Github Actions) > R installation in Github Actions is being failed > > > Key: SPARK-31587 > URL: https://issues.apache.org/jira/browse/SPARK-31587 > Project: Spark > Issue Type: Bug > Components: Project Infra, R >Affects Versions: 2.4.5, 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Currently, R installation seems being failed as below: > {code} > Get:61 https://dl.bintray.com/sbt/debian Packages [4174 B] > Get:62 http://ppa.launchpad.net/apt-fast/stable/ubuntu bionic/main amd64 > Packages [532 B] > Get:63 http://ppa.launchpad.net/git-core/ppa/ubuntu bionic/main amd64 > Packages [3036 B] > Get:64 http://ppa.launchpad.net/ondrej/php/ubuntu bionic/main amd64 Packages > [52.0 kB] > Get:65 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic/main > amd64 Packages [33.9 kB] > Get:66 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic/main > Translation-en [10.1 kB] > Reading package lists... > E: The repository 'https://cloud.r-project.org/bin/linux/ubuntu > bionic-cran35/ Release' does not have a Release file. > ##[error]Process completed with exit code 100. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31587) Fixes the repository for downloading R in Github Actions
Hyukjin Kwon created SPARK-31587: Summary: Fixes the repository for downloading R in Github Actions Key: SPARK-31587 URL: https://issues.apache.org/jira/browse/SPARK-31587 Project: Spark Issue Type: Bug Components: Project Infra, R Affects Versions: 2.4.5, 3.0.0 Reporter: Hyukjin Kwon Assignee: Hyukjin Kwon Currently, R installation seems being failed as below: {code} Get:61 https://dl.bintray.com/sbt/debian Packages [4174 B] Get:62 http://ppa.launchpad.net/apt-fast/stable/ubuntu bionic/main amd64 Packages [532 B] Get:63 http://ppa.launchpad.net/git-core/ppa/ubuntu bionic/main amd64 Packages [3036 B] Get:64 http://ppa.launchpad.net/ondrej/php/ubuntu bionic/main amd64 Packages [52.0 kB] Get:65 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic/main amd64 Packages [33.9 kB] Get:66 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic/main Translation-en [10.1 kB] Reading package lists... E: The repository 'https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ Release' does not have a Release file. ##[error]Process completed with exit code 100. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31585) Support Z-order curve
[ https://issues.apache.org/jira/browse/SPARK-31585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-31585: Description: Z-ordering is a technique that allows you to map multidimensional data to a single dimension. We can use this feature to improve query performance. More details: https://en.wikipedia.org/wiki/Z-order_curve https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/ Benchmark result: These. 2 tables ordered and z-ordered by AUCT_END_DT, AUCT_START_DT. Filter on the AUCT_START_DT column: !zorder.png! !lexicalorder.png! Filter on the auct_end_dt column: !zorder-AUCT_END_DT.png! !lexicalorder-AUCT_END_DT.png! was: Z-ordering is a technique that allows you to map multidimensional data to a single dimension. We can use this feature to improve query performance. More details: https://en.wikipedia.org/wiki/Z-order_curve https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/ > Support Z-order curve > - > > Key: SPARK-31585 > URL: https://issues.apache.org/jira/browse/SPARK-31585 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Attachments: lexicalorder-AUCT_END_DT.png, lexicalorder.png, > zorder-AUCT_END_DT.png, zorder.png > > > Z-ordering is a technique that allows you to map multidimensional data to a > single dimension. We can use this feature to improve query performance. > More details: > https://en.wikipedia.org/wiki/Z-order_curve > https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/ > Benchmark result: > These. 2 tables ordered and z-ordered by AUCT_END_DT, AUCT_START_DT. > Filter on the AUCT_START_DT column: > !zorder.png! > !lexicalorder.png! > Filter on the auct_end_dt column: > !zorder-AUCT_END_DT.png! > !lexicalorder-AUCT_END_DT.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31585) Support Z-order curve
[ https://issues.apache.org/jira/browse/SPARK-31585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-31585: Attachment: lexicalorder-AUCT_END_DT.png > Support Z-order curve > - > > Key: SPARK-31585 > URL: https://issues.apache.org/jira/browse/SPARK-31585 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Attachments: lexicalorder-AUCT_END_DT.png, lexicalorder.png, > zorder-AUCT_END_DT.png, zorder.png > > > Z-ordering is a technique that allows you to map multidimensional data to a > single dimension. We can use this feature to improve query performance. > More details: > https://en.wikipedia.org/wiki/Z-order_curve > https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31586) Replace expression TimeSub(l, r) with TimeAdd(l -r)
Kent Yao created SPARK-31586: Summary: Replace expression TimeSub(l, r) with TimeAdd(l -r) Key: SPARK-31586 URL: https://issues.apache.org/jira/browse/SPARK-31586 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Kent Yao The implementation of TimeSub for the operation of timestamp subtracting interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, -r) since there are equivalent. Suggestion from https://github.com/apache/spark/pull/28310#discussion_r414259239 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31585) Support Z-order curve
[ https://issues.apache.org/jira/browse/SPARK-31585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-31585: Attachment: zorder-AUCT_END_DT.png > Support Z-order curve > - > > Key: SPARK-31585 > URL: https://issues.apache.org/jira/browse/SPARK-31585 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Attachments: lexicalorder.png, zorder-AUCT_END_DT.png, zorder.png > > > Z-ordering is a technique that allows you to map multidimensional data to a > single dimension. We can use this feature to improve query performance. > More details: > https://en.wikipedia.org/wiki/Z-order_curve > https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31585) Support Z-order curve
[ https://issues.apache.org/jira/browse/SPARK-31585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-31585: Attachment: lexicalorder.png > Support Z-order curve > - > > Key: SPARK-31585 > URL: https://issues.apache.org/jira/browse/SPARK-31585 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Attachments: lexicalorder.png, zorder.png > > > Z-ordering is a technique that allows you to map multidimensional data to a > single dimension. We can use this feature to improve query performance. > More details: > https://en.wikipedia.org/wiki/Z-order_curve > https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31585) Support Z-order curve
[ https://issues.apache.org/jira/browse/SPARK-31585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-31585: Attachment: zorder.png > Support Z-order curve > - > > Key: SPARK-31585 > URL: https://issues.apache.org/jira/browse/SPARK-31585 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > Attachments: zorder.png > > > Z-ordering is a technique that allows you to map multidimensional data to a > single dimension. We can use this feature to improve query performance. > More details: > https://en.wikipedia.org/wiki/Z-order_curve > https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31581) paste0 is always better than paste(sep="")
[ https://issues.apache.org/jira/browse/SPARK-31581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31581: - Fix Version/s: (was: 2.4.6) > paste0 is always better than paste(sep="") > -- > > Key: SPARK-31581 > URL: https://issues.apache.org/jira/browse/SPARK-31581 > Project: Spark > Issue Type: Documentation > Components: R >Affects Versions: 2.4.5 >Reporter: Michael Chirico >Assignee: Michael Chirico >Priority: Minor > Fix For: 3.0.0 > > > paste0 is available in the stated R dependency (3.1), so we should use it > instead of paste(sep="") -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31585) Support Z-order curve
Yuming Wang created SPARK-31585: --- Summary: Support Z-order curve Key: SPARK-31585 URL: https://issues.apache.org/jira/browse/SPARK-31585 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Yuming Wang Z-ordering is a technique that allows you to map multidimensional data to a single dimension. We can use this feature to improve query performance. More details: https://en.wikipedia.org/wiki/Z-order_curve https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31578) Internal checkSchemaInArrow is inefficient
[ https://issues.apache.org/jira/browse/SPARK-31578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31578: Assignee: Michael Chirico > Internal checkSchemaInArrow is inefficient > -- > > Key: SPARK-31578 > URL: https://issues.apache.org/jira/browse/SPARK-31578 > Project: Spark > Issue Type: Documentation > Components: R >Affects Versions: 3.0.0 >Reporter: Michael Chirico >Assignee: Michael Chirico >Priority: Minor > > Current implementation is doubly inefficient: > Repeatedly doing the same (95%) sapply loop > Doing scalar == on a vector (== should be done over the whole vector for > efficiency) > See existing PR: > https://github.com/apache/spark/pull/28372 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31578) Internal checkSchemaInArrow is inefficient
[ https://issues.apache.org/jira/browse/SPARK-31578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31578. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28372 [https://github.com/apache/spark/pull/28372] > Internal checkSchemaInArrow is inefficient > -- > > Key: SPARK-31578 > URL: https://issues.apache.org/jira/browse/SPARK-31578 > Project: Spark > Issue Type: Documentation > Components: R >Affects Versions: 3.0.0 >Reporter: Michael Chirico >Assignee: Michael Chirico >Priority: Minor > Fix For: 3.0.0 > > > Current implementation is doubly inefficient: > Repeatedly doing the same (95%) sapply loop > Doing scalar == on a vector (== should be done over the whole vector for > efficiency) > See existing PR: > https://github.com/apache/spark/pull/28372 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31578) Internal checkSchemaInArrow is inefficient
[ https://issues.apache.org/jira/browse/SPARK-31578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31578: - Affects Version/s: (was: 2.4.5) 3.0.0 > Internal checkSchemaInArrow is inefficient > -- > > Key: SPARK-31578 > URL: https://issues.apache.org/jira/browse/SPARK-31578 > Project: Spark > Issue Type: Documentation > Components: R >Affects Versions: 3.0.0 >Reporter: Michael Chirico >Priority: Minor > > Current implementation is doubly inefficient: > Repeatedly doing the same (95%) sapply loop > Doing scalar == on a vector (== should be done over the whole vector for > efficiency) > See existing PR: > https://github.com/apache/spark/pull/28372 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31572) Improve task logs at executor side
[ https://issues.apache.org/jira/browse/SPARK-31572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi resolved SPARK-31572. -- Resolution: Won't Fix we shall use debug level to improve logs. > Improve task logs at executor side > -- > > Key: SPARK-31572 > URL: https://issues.apache.org/jira/browse/SPARK-31572 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > In some places, task names between driver and executor have different format, > which brings extra difficulty for user to debug task level slowness. And we > can add more logs to help the debug purpose. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31581) paste0 is always better than paste(sep="")
[ https://issues.apache.org/jira/browse/SPARK-31581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31581. -- Fix Version/s: 3.0.0 2.4.6 Assignee: Michael Chirico Resolution: Fixed > paste0 is always better than paste(sep="") > -- > > Key: SPARK-31581 > URL: https://issues.apache.org/jira/browse/SPARK-31581 > Project: Spark > Issue Type: Documentation > Components: R >Affects Versions: 2.4.5 >Reporter: Michael Chirico >Assignee: Michael Chirico >Priority: Minor > Fix For: 2.4.6, 3.0.0 > > > paste0 is available in the stated R dependency (3.1), so we should use it > instead of paste(sep="") -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31581) paste0 is always better than paste(sep="")
[ https://issues.apache.org/jira/browse/SPARK-31581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094086#comment-17094086 ] Hyukjin Kwon commented on SPARK-31581: -- Fixed in https://github.com/apache/spark/pull/28374 > paste0 is always better than paste(sep="") > -- > > Key: SPARK-31581 > URL: https://issues.apache.org/jira/browse/SPARK-31581 > Project: Spark > Issue Type: Documentation > Components: R >Affects Versions: 2.4.5 >Reporter: Michael Chirico >Priority: Minor > > paste0 is available in the stated R dependency (3.1), so we should use it > instead of paste(sep="") -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31580) Upgrade Apache ORC to 1.5.10
[ https://issues.apache.org/jira/browse/SPARK-31580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31580. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28373 [https://github.com/apache/spark/pull/28373] > Upgrade Apache ORC to 1.5.10 > > > Key: SPARK-31580 > URL: https://issues.apache.org/jira/browse/SPARK-31580 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31580) Upgrade Apache ORC to 1.5.10
[ https://issues.apache.org/jira/browse/SPARK-31580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31580: - Assignee: Dongjoon Hyun > Upgrade Apache ORC to 1.5.10 > > > Key: SPARK-31580 > URL: https://issues.apache.org/jira/browse/SPARK-31580 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31538) Backport SPARK-25338 Ensure to call super.beforeAll() and super.afterAll() in test cases
[ https://issues.apache.org/jira/browse/SPARK-31538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094081#comment-17094081 ] Hyukjin Kwon commented on SPARK-31538: -- Okay I understood the intention, but Spark 3.0.0 is still a unreleased version, though. I am okay with porting this back - wanted to ask why you picked this up out of curiosity. > Backport SPARK-25338 Ensure to call super.beforeAll() and > super.afterAll() in test cases > -- > > Key: SPARK-31538 > URL: https://issues.apache.org/jira/browse/SPARK-31538 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > Backport SPARK-25338 Ensure to call super.beforeAll() and > super.afterAll() in test cases -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31584) NullPointerException when parsing event log with InMemoryStore
[ https://issues.apache.org/jira/browse/SPARK-31584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Baohe Zhang updated SPARK-31584: Attachment: errorstack.txt > NullPointerException when parsing event log with InMemoryStore > -- > > Key: SPARK-31584 > URL: https://issues.apache.org/jira/browse/SPARK-31584 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 >Reporter: Baohe Zhang >Priority: Minor > Fix For: 3.0.1 > > Attachments: errorstack.txt > > > I compiled with the current branch-3.0 source and tested it in mac os. A > java.lang.NullPointerException will be thrown when below conditions are met: > # Using InMemoryStore as kvstore when parsing the event log file (e.g., when > spark.history.store.path is unset). > # At least one stage in this event log has task number greater than > spark.ui.retainedTasks (by default is 10). In this case, kvstore needs to > delete extra task records. > # The job has more than one stage, so parentToChildrenMap in > InMemoryStore.java will have more than one key. > The java.lang.NullPointerException is thrown in InMemoryStore.java :296. In > the method deleteParentIndex(). > {code:java} > private void deleteParentIndex(Object key) { > if (hasNaturalParentIndex) { > for (NaturalKeys v : parentToChildrenMap.values()) { > if (v.remove(asKey(key))) { > // `v` can be empty after removing the natural key and we can > remove it from > // `parentToChildrenMap`. However, `parentToChildrenMap` is a > ConcurrentMap and such > // checking and deleting can be slow. > // This method is to delete one object with certain key, let's > make it simple here. > break; > } > } > } > }{code} > In “if (v.remove(asKey(key)))”, if the key is not contained in v, > "v.remove(asKey(key))" will return null, and java will throw a > NullPointerException when executing "if (null)". > An exception stack trace is attached. > This issue can be fixed by updating if statement to > {code:java} > if (v.remove(asKey(key)) != null){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31584) NullPointerException when parsing event log with InMemoryStore
Baohe Zhang created SPARK-31584: --- Summary: NullPointerException when parsing event log with InMemoryStore Key: SPARK-31584 URL: https://issues.apache.org/jira/browse/SPARK-31584 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.0.1 Reporter: Baohe Zhang Fix For: 3.0.1 I compiled with the current branch-3.0 source and tested it in mac os. A java.lang.NullPointerException will be thrown when below conditions are met: # Using InMemoryStore as kvstore when parsing the event log file (e.g., when spark.history.store.path is unset). # At least one stage in this event log has task number greater than spark.ui.retainedTasks (by default is 10). In this case, kvstore needs to delete extra task records. # The job has more than one stage, so parentToChildrenMap in InMemoryStore.java will have more than one key. The java.lang.NullPointerException is thrown in InMemoryStore.java :296. In the method deleteParentIndex(). {code:java} private void deleteParentIndex(Object key) { if (hasNaturalParentIndex) { for (NaturalKeys v : parentToChildrenMap.values()) { if (v.remove(asKey(key))) { // `v` can be empty after removing the natural key and we can remove it from // `parentToChildrenMap`. However, `parentToChildrenMap` is a ConcurrentMap and such // checking and deleting can be slow. // This method is to delete one object with certain key, let's make it simple here. break; } } } }{code} In “if (v.remove(asKey(key)))”, if the key is not contained in v, "v.remove(asKey(key))" will return null, and java will throw a NullPointerException when executing "if (null)". An exception stack trace is attached. This issue can be fixed by updating if statement to {code:java} if (v.remove(asKey(key)) != null){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31525) Inconsistent result of df.head(1) and df.head()
[ https://issues.apache.org/jira/browse/SPARK-31525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094028#comment-17094028 ] Holden Karau commented on SPARK-31525: -- I agree it's inconsistent, also the docs are a little misleading. I think the root cause is we're using head as both `peek` and `take` which is why we've got mixed metaphores. cc [~davies] who worked on this code most recently (2015) for his thoughts. > Inconsistent result of df.head(1) and df.head() > --- > > Key: SPARK-31525 > URL: https://issues.apache.org/jira/browse/SPARK-31525 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Joshua Hendinata >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > In this line > [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1339], > if you are calling `df.head()` and dataframe is empty, it will return *None* > but if you are calling `df.head(1)` and dataframe is empty, it will return > *empty list* instead. > This particular behaviour is not consistent and can create confusion. > Especially when you are calling `len(df.head())` which will throw an > exception for empty dataframe -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31577) fix various problems when check name conflicts of CTE relations
[ https://issues.apache.org/jira/browse/SPARK-31577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31577. --- Fix Version/s: 3.0.0 Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/28371 > fix various problems when check name conflicts of CTE relations > --- > > Key: SPARK-31577 > URL: https://issues.apache.org/jira/browse/SPARK-31577 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31536) Backport SPARK-25407 Allow nested access for non-existent field for Parquet file when nested pruning is enabled
[ https://issues.apache.org/jira/browse/SPARK-31536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-31536. -- Resolution: Won't Fix > Backport SPARK-25407 Allow nested access for non-existent field for > Parquet file when nested pruning is enabled > - > > Key: SPARK-31536 > URL: https://issues.apache.org/jira/browse/SPARK-31536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > Consider backporting SPARK-25407 Allow nested access for non-existent > field for Parquet file when nested pruning is enabled to 2.4.6 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31536) Backport SPARK-25407 Allow nested access for non-existent field for Parquet file when nested pruning is enabled
[ https://issues.apache.org/jira/browse/SPARK-31536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094024#comment-17094024 ] Holden Karau commented on SPARK-31536: -- If the code path is already disabled by default then yeah let's skip the backport. > Backport SPARK-25407 Allow nested access for non-existent field for > Parquet file when nested pruning is enabled > - > > Key: SPARK-31536 > URL: https://issues.apache.org/jira/browse/SPARK-31536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > Consider backporting SPARK-25407 Allow nested access for non-existent > field for Parquet file when nested pruning is enabled to 2.4.6 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31537) Backport SPARK-25559 Remove the unsupported predicates in Parquet when possible
[ https://issues.apache.org/jira/browse/SPARK-31537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-31537. -- Resolution: Won't Fix per [~hyukjin.kwon] > Backport SPARK-25559 Remove the unsupported predicates in Parquet when > possible > > > Key: SPARK-31537 > URL: https://issues.apache.org/jira/browse/SPARK-31537 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Holden Karau >Assignee: DB Tsai >Priority: Major > > Consider backporting SPARK-25559 Remove the unsupported predicates in > Parquet when possible to 2.4.6 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31538) Backport SPARK-25338 Ensure to call super.beforeAll() and super.afterAll() in test cases
[ https://issues.apache.org/jira/browse/SPARK-31538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094022#comment-17094022 ] Holden Karau commented on SPARK-31538: -- [~hyukjin.kwon]if the Jira is already closed affecting the fix version implies it's already fixed for that version. > Backport SPARK-25338 Ensure to call super.beforeAll() and > super.afterAll() in test cases > -- > > Key: SPARK-31538 > URL: https://issues.apache.org/jira/browse/SPARK-31538 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > Backport SPARK-25338 Ensure to call super.beforeAll() and > super.afterAll() in test cases -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31545) Backport SPARK-27676 InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles
[ https://issues.apache.org/jira/browse/SPARK-31545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094020#comment-17094020 ] Holden Karau commented on SPARK-31545: -- Sounds reasonable, I'll close this as won't fix. > Backport SPARK-27676 InMemoryFileIndex should respect > spark.sql.files.ignoreMissingFiles > -- > > Key: SPARK-31545 > URL: https://issues.apache.org/jira/browse/SPARK-31545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > Backport SPARK-27676 InMemoryFileIndex should respect > spark.sql.files.ignoreMissingFiles > cc [~joshrosen] I think backporting this has been asked in the original > ticket, do you have any objections? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31545) Backport SPARK-27676 InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles
[ https://issues.apache.org/jira/browse/SPARK-31545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-31545. -- Resolution: Won't Fix > Backport SPARK-27676 InMemoryFileIndex should respect > spark.sql.files.ignoreMissingFiles > -- > > Key: SPARK-31545 > URL: https://issues.apache.org/jira/browse/SPARK-31545 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Major > > Backport SPARK-27676 InMemoryFileIndex should respect > spark.sql.files.ignoreMissingFiles > cc [~joshrosen] I think backporting this has been asked in the original > ticket, do you have any objections? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31583) grouping_id calculation should be improved
Costas Piliotis created SPARK-31583: --- Summary: grouping_id calculation should be improved Key: SPARK-31583 URL: https://issues.apache.org/jira/browse/SPARK-31583 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.5 Reporter: Costas Piliotis Unrelated to SPARK-21858 which identifies that grouping_id is determined by exclusion from a grouping_set rather than inclusion, when performing complex grouping_sets that are not in the order of the base select statement, flipping the bit in the grouping_id seems to be happen when the grouping set is identified rather than when the columns are selected in the sql. I will of course use the exclusion strategy identified in SPARK-21858 as the baseline for this. {code:scala} import spark.implicits._ val df= Seq( ("a","b","c","d"), ("a","b","c","d"), ("a","b","c","d"), ("a","b","c","d") ).toDF("a","b","c","d").createOrReplaceTempView("abc") {code} expected to have these references in the grouping_id: d=1 c=2 b=4 a=8 {code:scala} spark.sql(""" select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc group by GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) """).show(false) {code} This returns: {noformat} ++++++---+---+ |a |b |c |d |count(1)|gid|gid_bin| ++++++---+---+ |a |null|c |null|4 |6 |110| |null|null|null|null|4 |15 | | |a |null|null|d |4 |5 |101| |a |b |null|d |4 |1 |1 | ++++++---+---+ {noformat} In other words, I would have expected the excluded values one way but I received them excluded in the order they were first seen in the specified grouping sets. a,b,d included = excldes c = 2; expected gid=2. received gid=1 a,d included = excludes b=4, c=2 expected gid=6, received gid=5 The grouping_id that actually is expected is (a,b,d,c) {code:scala} spark.sql(""" select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, bin(grouping_id(a,b,d,c)) as gid_bin from abc group by GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) """).show(false) {code} columns forming groupingid seem to be creatred as the grouping sets are identified rather than ordinal position in parent query. I'd like to at least point out that grouping_id is documented in many other rdbms and I believe the spark project should use a policy of flipping the bits so 1=inclusion; 0=exclusion in the grouping set. However many rdms that do have the feature of a grouping_id do implement it by the ordinal position recognized as fields in the select clause, rather than allocating them as they are observed in the grouping sets. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27340) Alias on TimeWIndow expression may cause watermark metadata lost
[ https://issues.apache.org/jira/browse/SPARK-27340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27340. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28326 [https://github.com/apache/spark/pull/28326] > Alias on TimeWIndow expression may cause watermark metadata lost > - > > Key: SPARK-27340 > URL: https://issues.apache.org/jira/browse/SPARK-27340 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.4.0 >Reporter: Kevin Zhang >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.0.0 > > > When we use data api to write a structured streaming query job we usually > specify a watermark on event time column. If we define a window on the event > time column, the delayKey metadata of the event time column is supposed to be > propagated to the new column generated by time window expression. But if we > add additional alias on the time window column, the delayKey metadata is lost. > Currently I only find the bug will affect stream-stream join with equal > window join keys. In terms of aggregation, the gourping expression can be > trimed(in CleanupAliases rule) so additional alias are removed and the > metadata is kept. > Here is an example: > {code:scala} > val sparkSession = SparkSession > .builder() > .master("local") > .getOrCreate() > val rateStream = sparkSession.readStream > .format("rate") > .option("rowsPerSecond", 10) > .load() > val fooStream = rateStream > .select( > col("value").as("fooId"), > col("timestamp").as("fooTime") > ) > .withWatermark("fooTime", "2 seconds") > .select($"fooId", $"fooTime", window($"fooTime", "2 > seconds").alias("fooWindow")) > val barStream = rateStream > .where(col("value") % 2 === 0) > .select( > col("value").as("barId"), > col("timestamp").as("barTime") > ) > .withWatermark("barTime", "2 seconds") > .select($"barId", $"barTime", window($"barTime", "2 > seconds").alias("barWindow")) > val joinedDf = fooStream > .join( > barStream, > $"fooId" === $"barId" && > fooStream.col("fooWindow") === barStream.col("barWindow"), > joinType = "LeftOuter" > ) > val query = joinedDf > .writeStream > .format("console") > .option("truncate", 100) > .trigger(Trigger.ProcessingTime("5 seconds")) > .start() > query.awaitTermination() > {code} > this program will end with an exception, and from the analyzed plan we can > see there is no delayKey metadata on 'fooWindow' > {code:java} > org.apache.spark.sql.AnalysisException: Stream-stream outer join between two > streaming DataFrame/Datasets is not supported without a watermark in the join > keys, or a watermark on the nullable side and an appropriate range condition;; > Join LeftOuter, ((fooId#4L = barId#14L) && (fooWindow#9 = barWindow#19)) > :- Project [fooId#4L, fooTime#5-T2000ms, window#10-T2000ms AS fooWindow#9] > : +- Filter isnotnull(fooTime#5-T2000ms) > : +- Project [named_struct(start, precisetimestampconversion(CASE > WHEN (cast(CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, > TimestampType, LongType) - 0) as double) / cast(200 as double))) as > double) = (cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, > LongType) - 0) as double) / cast(200 as double))) THEN > (CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, > LongType) - 0) as double) / cast(200 as double))) + cast(1 as bigint)) > ELSE CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, > LongType) - 0) as double) / cast(200 as double))) END + cast(0 as > bigint)) - cast(1 as bigint)) * 200) + 0), LongType, TimestampType), end, > precisetimestampconversion((CASE WHEN > (cast(CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, > TimestampType, LongType) - 0) as double) / cast(200 as double))) as > double) = (cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, > LongType) - 0) as double) / cast(200 as double))) THEN > (CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, > LongType) - 0) as double) / cast(200 as double))) + cast(1 as bigint)) > ELSE CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, > LongType) - 0) as double) / cast(200 as double))) END + cast(0 as > bigint)) - cast(1 as bigint)) * 200) + 0) + 200), LongType, > TimestampType)) AS window#10-T2000ms, fooId#4L, fooTime#5-T2000ms] > :+- EventTimeWatermark fooTime#5: timestamp, interval 2 seconds > : +- Project [value#1L
[jira] [Assigned] (SPARK-27340) Alias on TimeWIndow expression may cause watermark metadata lost
[ https://issues.apache.org/jira/browse/SPARK-27340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-27340: - Assignee: Yuanjian Li > Alias on TimeWIndow expression may cause watermark metadata lost > - > > Key: SPARK-27340 > URL: https://issues.apache.org/jira/browse/SPARK-27340 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.4.0 >Reporter: Kevin Zhang >Assignee: Yuanjian Li >Priority: Major > > When we use data api to write a structured streaming query job we usually > specify a watermark on event time column. If we define a window on the event > time column, the delayKey metadata of the event time column is supposed to be > propagated to the new column generated by time window expression. But if we > add additional alias on the time window column, the delayKey metadata is lost. > Currently I only find the bug will affect stream-stream join with equal > window join keys. In terms of aggregation, the gourping expression can be > trimed(in CleanupAliases rule) so additional alias are removed and the > metadata is kept. > Here is an example: > {code:scala} > val sparkSession = SparkSession > .builder() > .master("local") > .getOrCreate() > val rateStream = sparkSession.readStream > .format("rate") > .option("rowsPerSecond", 10) > .load() > val fooStream = rateStream > .select( > col("value").as("fooId"), > col("timestamp").as("fooTime") > ) > .withWatermark("fooTime", "2 seconds") > .select($"fooId", $"fooTime", window($"fooTime", "2 > seconds").alias("fooWindow")) > val barStream = rateStream > .where(col("value") % 2 === 0) > .select( > col("value").as("barId"), > col("timestamp").as("barTime") > ) > .withWatermark("barTime", "2 seconds") > .select($"barId", $"barTime", window($"barTime", "2 > seconds").alias("barWindow")) > val joinedDf = fooStream > .join( > barStream, > $"fooId" === $"barId" && > fooStream.col("fooWindow") === barStream.col("barWindow"), > joinType = "LeftOuter" > ) > val query = joinedDf > .writeStream > .format("console") > .option("truncate", 100) > .trigger(Trigger.ProcessingTime("5 seconds")) > .start() > query.awaitTermination() > {code} > this program will end with an exception, and from the analyzed plan we can > see there is no delayKey metadata on 'fooWindow' > {code:java} > org.apache.spark.sql.AnalysisException: Stream-stream outer join between two > streaming DataFrame/Datasets is not supported without a watermark in the join > keys, or a watermark on the nullable side and an appropriate range condition;; > Join LeftOuter, ((fooId#4L = barId#14L) && (fooWindow#9 = barWindow#19)) > :- Project [fooId#4L, fooTime#5-T2000ms, window#10-T2000ms AS fooWindow#9] > : +- Filter isnotnull(fooTime#5-T2000ms) > : +- Project [named_struct(start, precisetimestampconversion(CASE > WHEN (cast(CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, > TimestampType, LongType) - 0) as double) / cast(200 as double))) as > double) = (cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, > LongType) - 0) as double) / cast(200 as double))) THEN > (CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, > LongType) - 0) as double) / cast(200 as double))) + cast(1 as bigint)) > ELSE CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, > LongType) - 0) as double) / cast(200 as double))) END + cast(0 as > bigint)) - cast(1 as bigint)) * 200) + 0), LongType, TimestampType), end, > precisetimestampconversion((CASE WHEN > (cast(CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, > TimestampType, LongType) - 0) as double) / cast(200 as double))) as > double) = (cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, > LongType) - 0) as double) / cast(200 as double))) THEN > (CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, > LongType) - 0) as double) / cast(200 as double))) + cast(1 as bigint)) > ELSE CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, > LongType) - 0) as double) / cast(200 as double))) END + cast(0 as > bigint)) - cast(1 as bigint)) * 200) + 0) + 200), LongType, > TimestampType)) AS window#10-T2000ms, fooId#4L, fooTime#5-T2000ms] > :+- EventTimeWatermark fooTime#5: timestamp, interval 2 seconds > : +- Project [value#1L AS fooId#4L, timestamp#0 AS fooTime#5] > : +- StreamingRelationV2 >
[jira] [Created] (SPARK-31582) Be able to not populate Hadoop classpath
DB Tsai created SPARK-31582: --- Summary: Be able to not populate Hadoop classpath Key: SPARK-31582 URL: https://issues.apache.org/jira/browse/SPARK-31582 Project: Spark Issue Type: New Feature Components: YARN Affects Versions: 2.4.5 Reporter: DB Tsai Spark Yarn client will populate hadoop classpath from `yarn.application.classpath` and ``mapreduce.application.classpath`. However, for Spark with embedded hadoop build, it will result jar conflicts because spark distribution can contain different version of hadoop jars. We are adding a new Yarn configuration to not populate hadoop classpath from `yarn.application.classpath` and ``mapreduce.application.classpath`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26631) Issue while reading Parquet data from Hadoop Archive files (.har)
[ https://issues.apache.org/jira/browse/SPARK-26631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093781#comment-17093781 ] Tien Dat commented on SPARK-26631: -- Dear experts, this also happens on the Spark 2.4.0. Could you please update on the issue, or at least the reason to not tackle it for a solution > Issue while reading Parquet data from Hadoop Archive files (.har) > - > > Key: SPARK-26631 > URL: https://issues.apache.org/jira/browse/SPARK-26631 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Sathish >Priority: Minor > > While reading Parquet file from Hadoop Archive file Spark is failing with > below exception > > {code:java} > scala> val hardf = > sqlContext.read.parquet("har:///tmp/testarchive.har/userdata1.parquet") > org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. > It must be specified manually.; at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:622) at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:606) ... > 49 elided > {code} > > Whereas the same parquet file can be read normally without any issues > {code:java} > scala> val df = > sqlContext.read.parquet("hdfs:///tmp/testparquet/userdata1.parquet") > df: org.apache.spark.sql.DataFrame = [registration_dttm: timestamp, id: int > ... 11 more fields]{code} > > +Here are the steps to reproduce the issue+ > > a) hadoop fs -mkdir /tmp/testparquet > b) Get sample parquet data and rename the file to userdata1.parquet > wget > [https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet?raw=true] > c) hadoop fs -put userdata.parquet /tmp/testparquet > d) hadoop archive -archiveName testarchive.har -p /tmp/testparquet /tmp > e) We should be able to see the file under har file > hadoop fs -ls har:///tmp/testarchive.har > f) Launch spark2 / spark shell > g) > {code:java} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > val df = > sqlContext.read.parquet("har:///tmp/testarchive.har/userdata1.parquet"){code} > is there anything which I am missing here. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31580) Upgrade Apache ORC to 1.5.10
[ https://issues.apache.org/jira/browse/SPARK-31580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31580: -- Issue Type: Bug (was: Improvement) > Upgrade Apache ORC to 1.5.10 > > > Key: SPARK-31580 > URL: https://issues.apache.org/jira/browse/SPARK-31580 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31580) Upgrade Apache ORC to 1.5.10
[ https://issues.apache.org/jira/browse/SPARK-31580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31580: -- Affects Version/s: (was: 3.1.0) 3.0.0 > Upgrade Apache ORC to 1.5.10 > > > Key: SPARK-31580 > URL: https://issues.apache.org/jira/browse/SPARK-31580 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31581) paste0 is always better than paste(sep="")
Michael Chirico created SPARK-31581: --- Summary: paste0 is always better than paste(sep="") Key: SPARK-31581 URL: https://issues.apache.org/jira/browse/SPARK-31581 Project: Spark Issue Type: Documentation Components: R Affects Versions: 2.4.5 Reporter: Michael Chirico paste0 is available in the stated R dependency (3.1), so we should use it instead of paste(sep="") -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31580) Upgrade Apache ORC to 1.5.10
Dongjoon Hyun created SPARK-31580: - Summary: Upgrade Apache ORC to 1.5.10 Key: SPARK-31580 URL: https://issues.apache.org/jira/browse/SPARK-31580 Project: Spark Issue Type: Improvement Components: Build, SQL Affects Versions: 3.1.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()
Maxim Gekk created SPARK-31579: -- Summary: Replace floorDiv by / in localRebaseGregorianToJulianDays() Key: SPARK-31579 URL: https://issues.apache.org/jira/browse/SPARK-31579 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check that for all available time zones in the range of [0001, 2100] years with the step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv can be replaced by /, and this should improve performance of RebaseDateTime.localRebaseGregorianToJulianDays. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31578) Internal checkSchemaInArrow is inefficient
Michael Chirico created SPARK-31578: --- Summary: Internal checkSchemaInArrow is inefficient Key: SPARK-31578 URL: https://issues.apache.org/jira/browse/SPARK-31578 Project: Spark Issue Type: Documentation Components: R Affects Versions: 2.4.5 Reporter: Michael Chirico Current implementation is doubly inefficient: Repeatedly doing the same (95%) sapply loop Doing scalar == on a vector (== should be done over the whole vector for efficiency) See existing PR: https://github.com/apache/spark/pull/28372 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31569) Add links to subsections in SQL Reference main page
[ https://issues.apache.org/jira/browse/SPARK-31569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-31569. -- Fix Version/s: 3.0.0 Assignee: Huaxin Gao Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28360 > Add links to subsections in SQL Reference main page > --- > > Key: SPARK-31569 > URL: https://issues.apache.org/jira/browse/SPARK-31569 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > Add links to subsections in SQL Reference main page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31577) fix various problems when check name conflicts of CTE relations
Wenchen Fan created SPARK-31577: --- Summary: fix various problems when check name conflicts of CTE relations Key: SPARK-31577 URL: https://issues.apache.org/jira/browse/SPARK-31577 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31576) Unable to return Hive data into Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED
[ https://issues.apache.org/jira/browse/SPARK-31576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuzhang updated SPARK-31576: - Description: I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. Unfortunately, when I try to query data that resides in every column I get the following error: Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199) 1) On Hive create a simple table,its name is "test",it have three column(aname,score,banji),their type both are "String" 2)important code: object HiveDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| url.contains("hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } --- object callOffRun { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().enableHiveSupport().getOrCreate() JdbcDialects.registerDialect(HiveDialect) val props = new Properties() props.put("driver","org.apache.hive.jdbc.HiveDriver") props.put("user","username") props.put("password","password") props.put("fetchsize","20") val table=spark.read .jdbc("jdbc:hive2://:1","test",props) table.show() } } 3)spark-submit ,After running,it have error Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) 4)table.count() have result 5) I try some method to print result,They all reported the same error was: I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. Unfortunately, when I try to query data that resides in every column I get the following error: Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199) 1) On Hive create a simple table,its name is "test",it have three column(aname,score,banji),their type both are "String" 2)important code: object HiveDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| url.contains("hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } --- object callOffRun { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().enableHiveSupport().getOrCreate() JdbcDialects.registerDialect(HiveDialect) val props = new Properties() props.put("driver","org.apache.hive.jdbc.HiveDriver") props.put("user","username") props.put("password","password") props.put("fetchsize","20") val table=spark.read .jdbc("jdbc:hive2://:1","test",props) table.show() } } 3)spark-submit ,After running,it have error Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) 4)table.count() have result 5) I try some method to print result,They all reported the same error > Unable to return Hive data into Spark via Hive JDBC driver Caused by: > org.apache.hive.service.cli.HiveSQLException: Error while compiling > statement: FAILED > > > Key:
[jira] [Updated] (SPARK-31576) Unable to return Hive data into Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED
[ https://issues.apache.org/jira/browse/SPARK-31576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuzhang updated SPARK-31576: - Description: I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. Unfortunately, when I try to query data that resides in every column I get the following error: Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199) 1) On Hive create a simple table,its name is "test",it have three column(aname,score,banji),their type both are "String" 2)important code: object HiveDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| url.contains("hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } --- object callOffRun { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().enableHiveSupport().getOrCreate() JdbcDialects.registerDialect(HiveDialect) val props = new Properties() props.put("driver","org.apache.hive.jdbc.HiveDriver") props.put("user","username") props.put("password","password") props.put("fetchsize","20") val table=spark.read .jdbc("jdbc:hive2://:1","test",props) table.show() } } 3)spark-submit ,After running,it have error Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) 4)table.count() have result 5) I try some method to print result,They all reported the same error was: I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. Unfortunately, when I try to query data that resides in every column I get the following error: Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199) 1) On Hive create a simple table,its name is "test",it have three column(aname,score,banji),their type both are "String" 2)important code: object HiveDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| url.contains("hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } --- object callOffRun { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().enableHiveSupport().getOrCreate() JdbcDialects.registerDialect(HiveDialect) val props = new Properties() props.put("driver","org.apache.hive.jdbc.HiveDriver") props.put("user","username") props.put("password","password") props.put("fetchsize","20") val table=spark.read .jdbc("jdbc:hive2://:1","test",props) table.show() } } 3)spark-submit ,After running,it have error Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) 4)table.count() have result 5) I try some method to print result,They all reported the same error > Unable to return Hive data into Spark via Hive JDBC driver Caused by: > org.apache.hive.service.cli.HiveSQLException: Error while compiling > statement: FAILED > > > Key: SPARK-31576 > URL: https://issues.apache.org/jira/browse/SPARK-31576 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit >Affects Versions: 2.3.1 > Environment: hdp 3.0,hadoop 3.1.1,spark 2.3.1 >Reporter: liuzhang >Priority: Major > > I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. > Unfortunately, when I try to query data that resides in every column I get > the following error: > Caused by:
[jira] [Updated] (SPARK-31576) Unable to return Hive data into Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED
[ https://issues.apache.org/jira/browse/SPARK-31576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuzhang updated SPARK-31576: - Description: I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. Unfortunately, when I try to query data that resides in every column I get the following error: Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199) 1) On Hive create a simple table,its name is "test",it have three column(aname,score,banji),their type both are "String" 2)important code: object HiveDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| url.contains("hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } --- object callOffRun { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().enableHiveSupport().getOrCreate() JdbcDialects.registerDialect(HiveDialect) val props = new Properties() props.put("driver","org.apache.hive.jdbc.HiveDriver") props.put("user","username") props.put("password","password") props.put("fetchsize","20") val table=spark.read .jdbc("jdbc:hive2://:1","test",props) table.show() } } 3)spark-submit ,After running,it have error Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) 4)table.count() have result 5) I try some method to print result,They all reported the same error was: I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. Unfortunately, when I try to query data that resides in every column I get the following error: Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199) 1) On Hive create a simple table,its name is "test",it have three column(aname,score,banji),their type both are "String" 2)important code: object HiveDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| url.contains("hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } --- object callOffRun { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().enableHiveSupport().getOrCreate() JdbcDialects.registerDialect(HiveDialect) val props = new Properties() props.put("driver","org.apache.hive.jdbc.HiveDriver") props.put("user","username") props.put("password","password") props.put("fetchsize","20") val table=spark.read .jdbc("jdbc:hive2://:1","test",props) table.show() } } 3)spark-submit ,After running,it have error Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) 4)table.count() have result 5) I try some method to print result,They all reported the same error > Unable to return Hive data into Spark via Hive JDBC driver Caused by: > org.apache.hive.service.cli.HiveSQLException: Error while compiling > statement: FAILED > > > Key: SPARK-31576 > URL: https://issues.apache.org/jira/browse/SPARK-31576 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit >Affects Versions: 2.3.1 > Environment: hdp 3.0,hadoop 3.1.1,spark 2.3.1 >Reporter: liuzhang >Priority: Major > > I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. > Unfortunately, when I try to query data that resides in every column I get > the following error: > Caused by: org.apache.hive.service.cli.HiveSQLException: Error while > compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 > Invalid table
[jira] [Updated] (SPARK-31576) Unable to return Hive data into Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED
[ https://issues.apache.org/jira/browse/SPARK-31576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuzhang updated SPARK-31576: - Description: I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. Unfortunately, when I try to query data that resides in every column I get the following error: Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199) 1) On Hive create a simple table,its name is "test",it have three column(aname,score,banji),their type both are "String" 2)important code: object HiveDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| url.contains("hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } --- object callOffRun { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().enableHiveSupport().getOrCreate() JdbcDialects.registerDialect(HiveDialect) val props = new Properties() props.put("driver","org.apache.hive.jdbc.HiveDriver") props.put("user","username") props.put("password","password") props.put("fetchsize","20") val table=spark.read .jdbc("jdbc:hive2://:1","test",props) table.show() } } 3)spark-submit ,After running,it have error Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) 4)table.count() have result 5) I try some method to print result,They all reported the same error was: I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. Unfortunately, when I try to query data that resides in every column I get the following error: Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199) 1) On Hive create a simple table,its name is "test",it have three column(aname,score,banji),their type both are "String" 2)important code: object HiveDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| url.contains("hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } --- object callOffRun { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().enableHiveSupport().getOrCreate() JdbcDialects.registerDialect(HiveDialect) val props = new Properties() props.put("driver","org.apache.hive.jdbc.HiveDriver") props.put("user","username") props.put("password","password") props.put("fetchsize","20") val table=spark.read .jdbc("jdbc:hive2://:1","test",props) table.show() } } 3)spark-submit ,After running,it have error Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) 4)table.count() have result 5) I try some method to print result,They all reported the same error > Unable to return Hive data into Spark via Hive JDBC driver Caused by: > org.apache.hive.service.cli.HiveSQLException: Error while compiling > statement: FAILED > > > Key: SPARK-31576 > URL: https://issues.apache.org/jira/browse/SPARK-31576 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit >Affects Versions: 2.3.1 > Environment: hdp 3.0,hadoop 3.1.1,spark 2.3.1 >Reporter: liuzhang >Priority: Major > > I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. > Unfortunately, when I try to query data that resides in every column I get > the following error: > Caused by: org.apache.hive.service.cli.HiveSQLException: Error while > compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 > Invalid table alias or column
[jira] [Updated] (SPARK-31576) Unable to return Hive data into Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED
[ https://issues.apache.org/jira/browse/SPARK-31576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuzhang updated SPARK-31576: - Description: I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. Unfortunately, when I try to query data that resides in every column I get the following error: Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199) 1) On Hive create a simple table,its name is "test",it have three column(aname,score,banji),their type both are "String" 2)important code: object HiveDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| url.contains("hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } --- object callOffRun { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().enableHiveSupport().getOrCreate() JdbcDialects.registerDialect(HiveDialect) val props = new Properties() props.put("driver","org.apache.hive.jdbc.HiveDriver") props.put("user","username") props.put("password","password") props.put("fetchsize","20") val table=spark.read .jdbc("jdbc:hive2://:1","test",props) table.show() } } 3)spark-submit ,After running,it have error Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) 4)table.count() have result 5) I try some method to print result,They all reported the same error was: I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. Unfortunately, when I try to query data that resides in every column I get the following error: Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199) 1) On Hive create a simple table,its name is "test",it have three column(aname,score,banji),their type both are "String" 2)important code: object HiveDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| url.contains("hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } --- object callOffRun { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().enableHiveSupport().getOrCreate() JdbcDialects.registerDialect(HiveDialect) val props = new Properties() props.put("driver","org.apache.hive.jdbc.HiveDriver") props.put("user","username") props.put("password","password") props.put("fetchsize","20") val table=spark.read .jdbc("jdbc:hive2://:1","test",props) table.show() } } 3)spark-submit ,After running,it have error Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) 4)table.count() have result 5) i try some method to print result,They all reported the same error > Unable to return Hive data into Spark via Hive JDBC driver Caused by: > org.apache.hive.service.cli.HiveSQLException: Error while compiling > statement: FAILED > > > Key: SPARK-31576 > URL: https://issues.apache.org/jira/browse/SPARK-31576 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit >Affects Versions: 2.3.1 > Environment: hdp 3.0,hadoop 3.1.1,spark 2.3.1 >Reporter: liuzhang >Priority: Major > > I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. > Unfortunately, when I try to query data that resides in every column I get > the following error: > Caused by: org.apache.hive.service.cli.HiveSQLException: Error while > compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 > Invalid table alias or column
[jira] [Created] (SPARK-31576) Unable to return Hive data into Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED
liuzhang created SPARK-31576: Summary: Unable to return Hive data into Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED Key: SPARK-31576 URL: https://issues.apache.org/jira/browse/SPARK-31576 Project: Spark Issue Type: Bug Components: Spark Shell, Spark Submit Affects Versions: 2.3.1 Environment: hdp 3.0,hadoop 3.1.1,spark 2.3.1 Reporter: liuzhang I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. Unfortunately, when I try to query data that resides in every column I get the following error: Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335) at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199) 1) On Hive create a simple table,its name is "test",it have three column(aname,score,banji),their type both are "String" 2)important code: object HiveDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| url.contains("hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } --- object callOffRun { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().enableHiveSupport().getOrCreate() JdbcDialects.registerDialect(HiveDialect) val props = new Properties() props.put("driver","org.apache.hive.jdbc.HiveDriver") props.put("user","username") props.put("password","password") props.put("fetchsize","20") val table=spark.read .jdbc("jdbc:hive2://:1","test",props) table.show() } } 3)spark-submit ,After running,it have error Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'test.aname': (possible column names are: aname, score, banji) 4)table.count() have result 5) i try some method to print result,They all reported the same error -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31527) date add/subtract interval only allow those day precision in ansi mode
[ https://issues.apache.org/jira/browse/SPARK-31527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093462#comment-17093462 ] Kent Yao edited comment on SPARK-31527 at 4/27/20, 12:44 PM: - Added a followup PR for benchmark [https://github.com/apache/spark/pull/28369] was (Author: qin yao): Add a followup PR for benchmark [https://github.com/apache/spark/pull/28369] > date add/subtract interval only allow those day precision in ansi mode > -- > > Key: SPARK-31527 > URL: https://issues.apache.org/jira/browse/SPARK-31527 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > Under ANSI mode, we should not allow date add interval with hours, minutes... > microseconds. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31527) date add/subtract interval only allow those day precision in ansi mode
[ https://issues.apache.org/jira/browse/SPARK-31527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093462#comment-17093462 ] Kent Yao commented on SPARK-31527: -- Add a followup PR for benchmark [https://github.com/apache/spark/pull/28369] > date add/subtract interval only allow those day precision in ansi mode > -- > > Key: SPARK-31527 > URL: https://issues.apache.org/jira/browse/SPARK-31527 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > Under ANSI mode, we should not allow date add interval with hours, minutes... > microseconds. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31575) Synchronise global JVM security configuration modification
Gabor Somogyi created SPARK-31575: - Summary: Synchronise global JVM security configuration modification Key: SPARK-31575 URL: https://issues.apache.org/jira/browse/SPARK-31575 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Gabor Somogyi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31529) Remove extra whitespaces in the formatted explain
[ https://issues.apache.org/jira/browse/SPARK-31529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31529. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28315 [https://github.com/apache/spark/pull/28315] > Remove extra whitespaces in the formatted explain > - > > Key: SPARK-31529 > URL: https://issues.apache.org/jira/browse/SPARK-31529 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > The formatted explain included extra whitespaces. And even the number of > spaces are different between master and branch-3.0, which leads to failed > explain tests if we backport to branch-3.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31529) Remove extra whitespaces in the formatted explain
[ https://issues.apache.org/jira/browse/SPARK-31529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31529: --- Assignee: wuyi > Remove extra whitespaces in the formatted explain > - > > Key: SPARK-31529 > URL: https://issues.apache.org/jira/browse/SPARK-31529 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > The formatted explain included extra whitespaces. And even the number of > spaces are different between master and branch-3.0, which leads to failed > explain tests if we backport to branch-3.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 胡振宇 updated SPARK-14850: Comment: was deleted (was: /*code is for spark 1.6.1*/ object Example{ def main (args:Array[String]){ val conf = new SparkConf.setAppname("Example") val sc=new sparkContext(conf) val sqlContext=new SQLContext(sc) import sqlContext.implicts._ val count=sqlContext.sparkContext.parallelize(0,until 1e4.toInt,1).map{ i=>(i,Vectors.dense(Array.fill(1e6.toInt)(1.0))) }.toDF().rdd.count() //at this step toDF can be used on Spark1.6.1 } } so I am not able to test the simple serialization example ) > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 胡振宇 updated SPARK-14850: Comment: was deleted (was: I try to run your code on spark1.6.1 but i found that "toDF" cannot be used in this example Here are my code object Example{ def main (args:Array[String]){ case class Test(num:Int,vector:Vector) val conf = new SparkConf.setAppname("Example") val sqlContext=new SQLContext(sc) import sqlContext.implicts._ val temp=sqlContext.sparkContext.parallelize(0,until 1e4.toInt,1).map(i=>Test(i,Vectors.dense(Array.fill(1e6.toInt)(1.0.toDF() //at this step toDF can be used I do } } sc.parallelize(0 until 1e4.toInt, 1).map { i => (i, Vectors.dense(Array.fill(1e6.toInt)(1.0))) }.toDF.rdd.count() I even use sparkcontext but toDF cannot be used too Do you have a solution to run the example on spark1.6.1? Thank you } ) > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17604) Support purging aged file entry for FileStreamSource metadata log
[ https://issues.apache.org/jira/browse/SPARK-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-17604: - Affects Version/s: 3.1.0 Labels: (was: bulk-closed) Priority: Major (was: Minor) > Support purging aged file entry for FileStreamSource metadata log > - > > Key: SPARK-17604 > URL: https://issues.apache.org/jira/browse/SPARK-17604 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Saisai Shao >Priority: Major > > Currently with SPARK-15698, FileStreamSource metadata log will be compacted > periodically (10 batches by default), this means compacted batch file will > contain whole file entries been processed. With the time passed, the > compacted batch file will be accumulated to a relative large file. > With SPARK-17165, now {{FileStreamSource}} doesn't track the aged file entry, > but in the log we still keep the full records, this is not necessary and > quite time-consuming during recovery. So here propose to also add file entry > purging ability to {{FileStreamSource}} metadata log. > This is pending on SPARK-15698. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31550) nondeterministic configurations with general meanings in sql configuration doc
[ https://issues.apache.org/jira/browse/SPARK-31550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31550: Assignee: Kent Yao > nondeterministic configurations with general meanings in sql configuration doc > -- > > Key: SPARK-31550 > URL: https://issues.apache.org/jira/browse/SPARK-31550 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > spark.sql.session.timeZone > spark.sql.warehouse.dir > > these 2 configs are nondeterministic and vary with environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17604) Support purging aged file entry for FileStreamSource metadata log
[ https://issues.apache.org/jira/browse/SPARK-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reopened SPARK-17604: -- Reopening this, as end user reported this in user mailing list recently. https://lists.apache.org/thread.html/r897771f5526d10d0b13da9177a6b7d2e37b22823c839cceea457%40%3Cuser.spark.apache.org%3E > Support purging aged file entry for FileStreamSource metadata log > - > > Key: SPARK-17604 > URL: https://issues.apache.org/jira/browse/SPARK-17604 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Saisai Shao >Priority: Minor > Labels: bulk-closed > > Currently with SPARK-15698, FileStreamSource metadata log will be compacted > periodically (10 batches by default), this means compacted batch file will > contain whole file entries been processed. With the time passed, the > compacted batch file will be accumulated to a relative large file. > With SPARK-17165, now {{FileStreamSource}} doesn't track the aged file entry, > but in the log we still keep the full records, this is not necessary and > quite time-consuming during recovery. So here propose to also add file entry > purging ability to {{FileStreamSource}} metadata log. > This is pending on SPARK-15698. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31550) nondeterministic configurations with general meanings in sql configuration doc
[ https://issues.apache.org/jira/browse/SPARK-31550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31550. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28322 [https://github.com/apache/spark/pull/28322] > nondeterministic configurations with general meanings in sql configuration doc > -- > > Key: SPARK-31550 > URL: https://issues.apache.org/jira/browse/SPARK-31550 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > spark.sql.session.timeZone > spark.sql.warehouse.dir > > these 2 configs are nondeterministic and vary with environments -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31574) Schema evolution in spark while using the storage format as parquet
sharad Gupta created SPARK-31574: Summary: Schema evolution in spark while using the storage format as parquet Key: SPARK-31574 URL: https://issues.apache.org/jira/browse/SPARK-31574 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: sharad Gupta Hi Team, Use case: Suppose there is a table T1 with column C1 with datatype as int in schema version 1. In the first on boarding table T1. I wrote couple of parquet files with this schema version 1 with underlying file format used parquet. Now in schema version 2 the C1 column datatype changed to string from int. Now It will write data with schema version 2 in parquet. So some parquet files are written with schema version 1 and some written with schema version 2. Problem statement : 1. We are not able to execute the below command from spark sql ```Alter table Table T1 change C1 C1 string``` 2. So as a solution i goto hive and alter the table change datatype because it supported in hive then try to read the data in spark. So it is giving me error ``` Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44) at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51) at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:372) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)``` 3. Suspecting that the underlying parquet file is written with integer type and we are reading from a table whose column is changed to string type. So that is why it is happening. How you can reproduce this: spark sql 1. Create a table from spark sql with one column with datatype as int with stored as parquet. 2. Now put some data into table. 3. Now you can see the data if you select from table. Hive 1. change datatype from int to string by alter command 2. Now try to read data, You will be able to read the data here even after changing the datatype. spark sql 1. Try to read data from here now you will see the error. Now the question is how to solve schema evolution in spark while using the storage format as parquet. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31568) R: gapply documentation could be clearer about what the func argument is
[ https://issues.apache.org/jira/browse/SPARK-31568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31568. -- Fix Version/s: 3.0.0 2.4.6 Assignee: Michael Chirico Resolution: Fixed Fixed in https://github.com/apache/spark/pull/28350 > R: gapply documentation could be clearer about what the func argument is > > > Key: SPARK-31568 > URL: https://issues.apache.org/jira/browse/SPARK-31568 > Project: Spark > Issue Type: Documentation > Components: R >Affects Versions: 2.4.5 >Reporter: Michael Chirico >Assignee: Michael Chirico >Priority: Minor > Fix For: 2.4.6, 3.0.0 > > > copied from pre-existing GH PR: > https://github.com/apache/spark/pull/28350 > Spent a long time this weekend trying to figure out just what exactly key is > in gapply's func. I had assumed it would be a named list, but apparently not > -- the examples are working because schema is applying the name and the names > of the output data.frame don't matter. > As near as I can tell the description I've added is correct, namely, that key > is an unnamed list. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors
[ https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093151#comment-17093151 ] Gabor Somogyi commented on SPARK-12312: --- Not sure what you mean renewal. Renewal of what? * When keytab is invalid because the password has changed then the same must happen just like all other use-cases (re-distribute the keytab file which will be picked up properly when new connection is initiated) * When TGT is the question the solution re-obtains TGT automatically each and every time when new connection is created Other thing which can be renewed I'm not aware of. > JDBC connection to Kerberos secured databases fails on remote executors > --- > > Key: SPARK-12312 > URL: https://issues.apache.org/jira/browse/SPARK-12312 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 2.4.2 >Reporter: nabacg >Priority: Minor > > When loading DataFrames from JDBC datasource with Kerberos authentication, > remote executors (yarn-client/cluster etc. modes) fail to establish a > connection due to lack of Kerberos ticket or ability to generate it. > This is a real issue when trying to ingest data from kerberized data sources > (SQL Server, Oracle) in enterprise environment where exposing simple > authentication access is not an option due to IT policy issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31573) Use fixed=TRUE where possible for internal efficiency
Michael Chirico created SPARK-31573: --- Summary: Use fixed=TRUE where possible for internal efficiency Key: SPARK-31573 URL: https://issues.apache.org/jira/browse/SPARK-31573 Project: Spark Issue Type: Documentation Components: R Affects Versions: 2.4.5 Reporter: Michael Chirico gsub('_', '', x) is more efficient if we signal there's no regex: gsub('_', '', x, fixed = TRUE) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31448) Difference in Storage Levels used in cache() and persist() for pyspark dataframes
[ https://issues.apache.org/jira/browse/SPARK-31448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093079#comment-17093079 ] Abhishek Dixit commented on SPARK-31448: Any update on this? > Difference in Storage Levels used in cache() and persist() for pyspark > dataframes > - > > Key: SPARK-31448 > URL: https://issues.apache.org/jira/browse/SPARK-31448 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Abhishek Dixit >Priority: Major > > There is a difference in default storage level *MEMORY_AND_DISK* in pyspark > and scala. > *Scala*: StorageLevel(true, true, false, true) > *Pyspark:* StorageLevel(True, True, False, False) > > *Problem Description:* > Calling *df.cache()* for pyspark dataframe directly invokes Scala method > cache() and Storage Level used is StorageLevel(true, true, false, true). > But calling *df.persist()* for pyspark dataframe sets the > newStorageLevel=StorageLevel(true, true, false, false) inside pyspark and > then invokes Scala function persist(newStorageLevel). > *Possible Fix:* > Invoke pyspark function persist inside pyspark function cache instead of > calling the scala function directly. > I can raise a PR for this fix if someone can confirm that this is a bug and > the possible fix is the correct approach. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31572) Improve task logs at executor side
wuyi created SPARK-31572: Summary: Improve task logs at executor side Key: SPARK-31572 URL: https://issues.apache.org/jira/browse/SPARK-31572 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 3.0.0 Reporter: wuyi In some places, task names between driver and executor have different format, which brings extra difficulty for user to debug task level slowness. And we can add more logs to help the debug purpose. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24194) HadoopFsRelation cannot overwrite a path that is also being read from
[ https://issues.apache.org/jira/browse/SPARK-24194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093055#comment-17093055 ] philipse commented on SPARK-24194: -- Hi is the issue closed ? can i try it in product env? Thanks > HadoopFsRelation cannot overwrite a path that is also being read from > - > > Key: SPARK-24194 > URL: https://issues.apache.org/jira/browse/SPARK-24194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 > Environment: spark master >Reporter: yangz >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > When > {code:java} > INSERT OVERWRITE TABLE territory_count_compare select * from > territory_count_compare where shop_count!=real_shop_count > {code} > And territory_count_compare is a table with parquet, there will be a error > Cannot overwrite a path that is also being read from > > And in file MetastoreDataSourceSuite.scala, there have a test case > > > {code:java} > table(tableName).write.mode(SaveMode.Overwrite).insertInto(tableName) > {code} > > But when the table territory_count_compare is a common hive table, there is > no error. > So I think the reason is when insert overwrite into hadoopfs relation with > static partition, it first delete the partition in the output. But it should > be the time when the job commited. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31571) don't use stop(paste to build R errors
Michael Chirico created SPARK-31571: --- Summary: don't use stop(paste to build R errors Key: SPARK-31571 URL: https://issues.apache.org/jira/browse/SPARK-31571 Project: Spark Issue Type: Documentation Components: R Affects Versions: 2.4.5 Reporter: Michael Chirico I notice for example this: stop(paste0("Arrow optimization does not support 'dapplyCollect' yet. Please disable ", "Arrow optimization or use 'collect' and 'dapply' APIs instead.")) paste0 is totally unnecessary here -- stop itself uses ... (vararg) with default ''-sep combination, i.e., the above is equivalent to: stop("Arrow optimization does not support 'dapplyCollect' yet. Please disable ", "Arrow optimization or use 'collect' and 'dapply' APIs instead.") More generally, for portability, this will make it more difficult for user-contributed translations because the standard set of tools for doing this (namely tools::update_pkg_po('.")) would fail to capture these messages as being candidates for translation. In fact, it's completely preferable IMO to keep the entire stop("") message as a single string -- I've found that breaking the string across multiple lines makes translation across different languages with different grammars quite difficult. Understand there are lint style constraints however so I wouldn't press on that for now. If formatting is needed, I recommend using stop(gettextf(...)) instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31485) Barrier stage can hang if only partial tasks launched
[ https://issues.apache.org/jira/browse/SPARK-31485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31485: --- Assignee: wuyi > Barrier stage can hang if only partial tasks launched > - > > Key: SPARK-31485 > URL: https://issues.apache.org/jira/browse/SPARK-31485 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > The issue can be reproduced by following test: > > {code:java} > initLocalClusterSparkContext(2) > val rdd0 = sc.parallelize(Seq(0, 1, 2, 3), 2) > val dep = new OneToOneDependency[Int](rdd0) > val rdd = new MyRDD(sc, 2, List(dep), > Seq(Seq("executor_h_0"),Seq("executor_h_0"))) > rdd.barrier().mapPartitions { iter => > BarrierTaskContext.get().barrier() > iter > }.collect() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31485) Barrier stage can hang if only partial tasks launched
[ https://issues.apache.org/jira/browse/SPARK-31485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31485. - Fix Version/s: 2.4.6 Resolution: Fixed Issue resolved by pull request 28357 [https://github.com/apache/spark/pull/28357] > Barrier stage can hang if only partial tasks launched > - > > Key: SPARK-31485 > URL: https://issues.apache.org/jira/browse/SPARK-31485 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 2.4.6 > > > The issue can be reproduced by following test: > > {code:java} > initLocalClusterSparkContext(2) > val rdd0 = sc.parallelize(Seq(0, 1, 2, 3), 2) > val dep = new OneToOneDependency[Int](rdd0) > val rdd = new MyRDD(sc, 2, List(dep), > Seq(Seq("executor_h_0"),Seq("executor_h_0"))) > rdd.barrier().mapPartitions { iter => > BarrierTaskContext.get().barrier() > iter > }.collect() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org