[jira] [Commented] (SPARK-29503) MapObjects doesn't copy Unsafe data when nested under Safe data
[ https://issues.apache.org/jira/browse/SPARK-29503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955090#comment-16955090 ] Jungtaek Lim commented on SPARK-29503: -- Thanks for reporting the issue in super detailed information! I've submitted a PR based on your observation. Please take a look. > MapObjects doesn't copy Unsafe data when nested under Safe data > --- > > Key: SPARK-29503 > URL: https://issues.apache.org/jira/browse/SPARK-29503 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 3.0.0 >Reporter: Aaron Lewis >Priority: Major > Labels: correctness > > In order for MapObjects to operate safely, it checks to see if the result of > the mapping function is an Unsafe type (UnsafeRow, UnsafeArrayData, > UnsafeMapData) and performs a copy before writing it into MapObjects' output > array. This is to protect against expressions which re-use the same native > memory buffer to represent its result across evaluations; if the copy wasn't > here, all results would be pointing to the same native buffer and would > represent the last result written to the buffer. However, MapObjects misses > this needed copy if the Unsafe data is nested below some safe structure, for > instance a GenericArrrayData whose elements are all UnsafeRows. In this > scenario, all elements of the GenericArrayData will be pointing to the same > native UnsafeRow buffer which will hold the last value written to it. > > Right now, this bug seems to only occur when a `ProjectExec` goes down the > `execute` path, as opposed to WholeStageCodegen's `produce` and `consume` > path. > > Example Reproduction Code: > {code:scala} > import org.apache.spark.sql.catalyst.expressions.objects.MapObjects > import org.apache.spark.sql.catalyst.expressions.CreateArray > import org.apache.spark.sql.catalyst.expressions.Expression > import org.apache.spark.sql.functions.{array, struct} > import org.apache.spark.sql.Column > import org.apache.spark.sql.types.ArrayType > // For the purpose of demonstration, we need to disable WholeStage codegen > spark.conf.set("spark.sql.codegen.wholeStage", "false") > val exampleDS = spark.sparkContext.parallelize(Seq(Seq(1, 2, > 3))).toDF("items") > // Trivial example: Nest unsafe struct inside safe array > // items: Seq[Int] => items.map{item => Seq(Struct(item))} > val result = exampleDS.select( > new Column(MapObjects( > {item: Expression => array(struct(new Column(item))).expr}, > $"items".expr, > exampleDS.schema("items").dataType.asInstanceOf[ArrayType].elementType > )) as "items" > ) > result.show(10, false) > {code} > > Actual Output: > {code:java} > +-+ > |items| > +-+ > |[WrappedArray([3]), WrappedArray([3]), WrappedArray([3])]| > +-+ > {code} > > Expected Output: > {code:java} > +-+ > |items| > +-+ > |[WrappedArray([1]), WrappedArray([2]), WrappedArray([3])]| > +-+ > {code} > > We've confirmed that the bug exists on version 2.1.1 as well as on master > (which I assume corresponds to version 3.0.0?) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29517) TRUNCATE TABLE should look up catalog/table like v2 commands
L. C. Hsieh created SPARK-29517: --- Summary: TRUNCATE TABLE should look up catalog/table like v2 commands Key: SPARK-29517 URL: https://issues.apache.org/jira/browse/SPARK-29517 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: L. C. Hsieh TRUNCATE TABLE should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29517) TRUNCATE TABLE should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-29517: --- Assignee: L. C. Hsieh > TRUNCATE TABLE should look up catalog/table like v2 commands > > > Key: SPARK-29517 > URL: https://issues.apache.org/jira/browse/SPARK-29517 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > TRUNCATE TABLE should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29512) REPAIR TABLE should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-29512. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26168 [https://github.com/apache/spark/pull/26168] > REPAIR TABLE should look up catalog/table like v2 commands > -- > > Key: SPARK-29512 > URL: https://issues.apache.org/jira/browse/SPARK-29512 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > REPAIR TABLE should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29512) REPAIR TABLE should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-29512: --- Assignee: Terry Kim > REPAIR TABLE should look up catalog/table like v2 commands > -- > > Key: SPARK-29512 > URL: https://issues.apache.org/jira/browse/SPARK-29512 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > > REPAIR TABLE should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29516) Test ThriftServerQueryTestSuite asynchronously
Yuming Wang created SPARK-29516: --- Summary: Test ThriftServerQueryTestSuite asynchronously Key: SPARK-29516 URL: https://issues.apache.org/jira/browse/SPARK-29516 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang spark.sql.hive.thriftServer.async -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955061#comment-16955061 ] zhao bo commented on SPARK-29106: - Thanks [~shaneknapp]. That's great that you will copy a full jenkins configuration test code to us. ;) We are very happy that the test result for the fist periodic arm test job. For us, building a powerful ARM testing architecture is still a very hard work to be done, and from our team, we are plan to integrate more and higher performance ARM VMs into community for supporting the PullRequest Trigger type testing jobs, also more works to improve exec testing for matching the PullRequest Trigger requirement are waiting for us.. Now let's see the test result. > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29494) ArrayOutOfBoundsException when converting from string to timestamp
[ https://issues.apache.org/jira/browse/SPARK-29494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-29494: - Fix Version/s: (was: 2.4.5) > ArrayOutOfBoundsException when converting from string to timestamp > -- > > Key: SPARK-29494 > URL: https://issues.apache.org/jira/browse/SPARK-29494 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rahul Shivu Mahadev >Assignee: Rahul Shivu Mahadev >Priority: Minor > Fix For: 3.0.0 > > > In a couple of scenarios while converting from String to Timestamp ` > DateTimeUtils.stringToTimestamp` throws an array out of bounds exception if > there is trailing spaces or ':'. The behavior of this method requires it to > return `None` in case the format of the string is incorrect. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29502) typed interval expression should fail for invalid format
[ https://issues.apache.org/jira/browse/SPARK-29502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-29502. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26151 [https://github.com/apache/spark/pull/26151] > typed interval expression should fail for invalid format > > > Key: SPARK-29502 > URL: https://issues.apache.org/jira/browse/SPARK-29502 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29466) Show `Duration` for running drivers in Standalone master web UI
[ https://issues.apache.org/jira/browse/SPARK-29466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29466. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26113 [https://github.com/apache/spark/pull/26113] > Show `Duration` for running drivers in Standalone master web UI > --- > > Key: SPARK-29466 > URL: https://issues.apache.org/jira/browse/SPARK-29466 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > > This issue aims to add a new column for `Duration` for running drivers table > in `Standalone` master web UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29466) Show `Duration` for running drivers in Standalone master web UI
[ https://issues.apache.org/jira/browse/SPARK-29466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29466: - Assignee: Dongjoon Hyun > Show `Duration` for running drivers in Standalone master web UI > --- > > Key: SPARK-29466 > URL: https://issues.apache.org/jira/browse/SPARK-29466 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > > This issue aims to add a new column for `Duration` for running drivers table > in `Standalone` master web UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29494) ArrayOutOfBoundsException when converting from string to timestamp
[ https://issues.apache.org/jira/browse/SPARK-29494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-29494: Assignee: Rahul Shivu Mahadev > ArrayOutOfBoundsException when converting from string to timestamp > -- > > Key: SPARK-29494 > URL: https://issues.apache.org/jira/browse/SPARK-29494 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rahul Shivu Mahadev >Assignee: Rahul Shivu Mahadev >Priority: Minor > > In a couple of scenarios while converting from String to Timestamp ` > DateTimeUtils.stringToTimestamp` throws an array out of bounds exception if > there is trailing spaces or ':'. The behavior of this method requires it to > return `None` in case the format of the string is incorrect. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29494) ArrayOutOfBoundsException when converting from string to timestamp
[ https://issues.apache.org/jira/browse/SPARK-29494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29494. -- Fix Version/s: 3.0.0 2.4.5 Resolution: Fixed Issue resolved by pull request 26143 [https://github.com/apache/spark/pull/26143] > ArrayOutOfBoundsException when converting from string to timestamp > -- > > Key: SPARK-29494 > URL: https://issues.apache.org/jira/browse/SPARK-29494 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rahul Shivu Mahadev >Assignee: Rahul Shivu Mahadev >Priority: Minor > Fix For: 2.4.5, 3.0.0 > > > In a couple of scenarios while converting from String to Timestamp ` > DateTimeUtils.stringToTimestamp` throws an array out of bounds exception if > there is trailing spaces or ':'. The behavior of this method requires it to > return `None` in case the format of the string is incorrect. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29494) ArrayOutOfBoundsException when converting from string to timestamp
[ https://issues.apache.org/jira/browse/SPARK-29494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-29494: - Priority: Minor (was: Major) > ArrayOutOfBoundsException when converting from string to timestamp > -- > > Key: SPARK-29494 > URL: https://issues.apache.org/jira/browse/SPARK-29494 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rahul Shivu Mahadev >Priority: Minor > > In a couple of scenarios while converting from String to Timestamp ` > DateTimeUtils.stringToTimestamp` throws an array out of bounds exception if > there is trailing spaces or ':'. The behavior of this method requires it to > return `None` in case the format of the string is incorrect. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29515) MapStatuses SerDeser Benchmark
[ https://issues.apache.org/jira/browse/SPARK-29515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-29515: --- Assignee: DB Tsai > MapStatuses SerDeser Benchmark > -- > > Key: SPARK-29515 > URL: https://issues.apache.org/jira/browse/SPARK-29515 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29515) MapStatuses SerDeser Benchmark
[ https://issues.apache.org/jira/browse/SPARK-29515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-29515. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26169 [https://github.com/apache/spark/pull/26169] > MapStatuses SerDeser Benchmark > -- > > Key: SPARK-29515 > URL: https://issues.apache.org/jira/browse/SPARK-29515 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29515) MapStatuses SerDeser Benchmark
[ https://issues.apache.org/jira/browse/SPARK-29515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-29515: Affects Version/s: (was: 3.0.0) 2.4.4 > MapStatuses SerDeser Benchmark > -- > > Key: SPARK-29515 > URL: https://issues.apache.org/jira/browse/SPARK-29515 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: DB Tsai >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29515) MapStatuses SerDeser Benchmark
DB Tsai created SPARK-29515: --- Summary: MapStatuses SerDeser Benchmark Key: SPARK-29515 URL: https://issues.apache.org/jira/browse/SPARK-29515 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.0.0 Reporter: DB Tsai -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954912#comment-16954912 ] Shane Knapp commented on SPARK-29106: - also, i will be exploring the purchase of an ARM server for our cluster. the VM is just not going to be enough for our purposes. this won't happen immediately, so we'll use the VM until then. > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954904#comment-16954904 ] Shane Knapp commented on SPARK-29106: - i'm actually not going to use the script – the testing code will be in the jenkins job config: [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/] once i get the build config sorted and working as expected i'll be sure to give you all a copy. :) > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29503) MapObjects doesn't copy Unsafe data when nested under Safe data
[ https://issues.apache.org/jira/browse/SPARK-29503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-29503: Description: In order for MapObjects to operate safely, it checks to see if the result of the mapping function is an Unsafe type (UnsafeRow, UnsafeArrayData, UnsafeMapData) and performs a copy before writing it into MapObjects' output array. This is to protect against expressions which re-use the same native memory buffer to represent its result across evaluations; if the copy wasn't here, all results would be pointing to the same native buffer and would represent the last result written to the buffer. However, MapObjects misses this needed copy if the Unsafe data is nested below some safe structure, for instance a GenericArrrayData whose elements are all UnsafeRows. In this scenario, all elements of the GenericArrayData will be pointing to the same native UnsafeRow buffer which will hold the last value written to it. Right now, this bug seems to only occur when a `ProjectExec` goes down the `execute` path, as opposed to WholeStageCodegen's `produce` and `consume` path. Example Reproduction Code: {code:scala} import org.apache.spark.sql.catalyst.expressions.objects.MapObjects import org.apache.spark.sql.catalyst.expressions.CreateArray import org.apache.spark.sql.catalyst.expressions.Expression import org.apache.spark.sql.functions.{array, struct} import org.apache.spark.sql.Column import org.apache.spark.sql.types.ArrayType // For the purpose of demonstration, we need to disable WholeStage codegen spark.conf.set("spark.sql.codegen.wholeStage", "false") val exampleDS = spark.sparkContext.parallelize(Seq(Seq(1, 2, 3))).toDF("items") // Trivial example: Nest unsafe struct inside safe array // items: Seq[Int] => items.map{item => Seq(Struct(item))} val result = exampleDS.select( new Column(MapObjects( {item: Expression => array(struct(new Column(item))).expr}, $"items".expr, exampleDS.schema("items").dataType.asInstanceOf[ArrayType].elementType )) as "items" ) result.show(10, false) {code} Actual Output: {code:java} +-+ |items| +-+ |[WrappedArray([3]), WrappedArray([3]), WrappedArray([3])]| +-+ {code} Expected Output: {code:java} +-+ |items| +-+ |[WrappedArray([1]), WrappedArray([2]), WrappedArray([3])]| +-+ {code} We've confirmed that the bug exists on version 2.1.1 as well as on master (which I assume corresponds to version 3.0.0?) was: *strong text*In order for MapObjects to operate safely, it checks to see if the result of the mapping function is an Unsafe type (UnsafeRow, UnsafeArrayData, UnsafeMapData) and performs a copy before writing it into MapObjects' output array. This is to protect against expressions which re-use the same native memory buffer to represent its result across evaluations; if the copy wasn't here, all results would be pointing to the same native buffer and would represent the last result written to the buffer. However, MapObjects misses this needed copy if the Unsafe data is nested below some safe structure, for instance a GenericArrrayData whose elements are all UnsafeRows. In this scenario, all elements of the GenericArrayData will be pointing to the same native UnsafeRow buffer which will hold the last value written to it. Right now, this bug seems to only occur when a `ProjectExec` goes down the `execute` path, as opposed to WholeStageCodegen's `produce` and `consume` path. Example Reproduction Code: {code:scala} import org.apache.spark.sql.catalyst.expressions.objects.MapObjects import org.apache.spark.sql.catalyst.expressions.CreateArray import org.apache.spark.sql.catalyst.expressions.Expression import org.apache.spark.sql.functions.{array, struct} import org.apache.spark.sql.Column import org.apache.spark.sql.types.ArrayType // For the purpose of demonstration, we need to disable WholeStage codegen spark.conf.set("spark.sql.codegen.wholeStage", "false") val exampleDS = spark.sparkContext.parallelize(Seq(Seq(1, 2, 3))).toDF("items") // Trivial example: Nest unsafe struct inside safe array // items: Seq[Int] => items.map{item => Seq(Struct(item))} val result = exampleDS.select( new Column(MapObjects( {item: Expression => array(struct(new Column(item))).expr}, $"items".expr, exampleDS.schema("items").dataType.asInstanceOf[ArrayType].elementType )) as "items" ) result.show(10, false) {code} Actual Output: {code:java
[jira] [Updated] (SPARK-29503) MapObjects doesn't copy Unsafe data when nested under Safe data
[ https://issues.apache.org/jira/browse/SPARK-29503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-29503: Description: *strong text*In order for MapObjects to operate safely, it checks to see if the result of the mapping function is an Unsafe type (UnsafeRow, UnsafeArrayData, UnsafeMapData) and performs a copy before writing it into MapObjects' output array. This is to protect against expressions which re-use the same native memory buffer to represent its result across evaluations; if the copy wasn't here, all results would be pointing to the same native buffer and would represent the last result written to the buffer. However, MapObjects misses this needed copy if the Unsafe data is nested below some safe structure, for instance a GenericArrrayData whose elements are all UnsafeRows. In this scenario, all elements of the GenericArrayData will be pointing to the same native UnsafeRow buffer which will hold the last value written to it. Right now, this bug seems to only occur when a `ProjectExec` goes down the `execute` path, as opposed to WholeStageCodegen's `produce` and `consume` path. Example Reproduction Code: {code:scala} import org.apache.spark.sql.catalyst.expressions.objects.MapObjects import org.apache.spark.sql.catalyst.expressions.CreateArray import org.apache.spark.sql.catalyst.expressions.Expression import org.apache.spark.sql.functions.{array, struct} import org.apache.spark.sql.Column import org.apache.spark.sql.types.ArrayType // For the purpose of demonstration, we need to disable WholeStage codegen spark.conf.set("spark.sql.codegen.wholeStage", "false") val exampleDS = spark.sparkContext.parallelize(Seq(Seq(1, 2, 3))).toDF("items") // Trivial example: Nest unsafe struct inside safe array // items: Seq[Int] => items.map{item => Seq(Struct(item))} val result = exampleDS.select( new Column(MapObjects( {item: Expression => array(struct(new Column(item))).expr}, $"items".expr, exampleDS.schema("items").dataType.asInstanceOf[ArrayType].elementType )) as "items" ) result.show(10, false) {code} Actual Output: {code:java} +-+ |items| +-+ |[WrappedArray([3]), WrappedArray([3]), WrappedArray([3])]| +-+ {code} Expected Output: {code:java} +-+ |items| +-+ |[WrappedArray([1]), WrappedArray([2]), WrappedArray([3])]| +-+ {code} We've confirmed that the bug exists on version 2.1.1 as well as on master (which I assume corresponds to version 3.0.0?) was: In order for MapObjects to operate safely, it checks to see if the result of the mapping function is an Unsafe type (UnsafeRow, UnsafeArrayData, UnsafeMapData) and performs a copy before writing it into MapObjects' output array. This is to protect against expressions which re-use the same native memory buffer to represent its result across evaluations; if the copy wasn't here, all results would be pointing to the same native buffer and would represent the last result written to the buffer. However, MapObjects misses this needed copy if the Unsafe data is nested below some safe structure, for instance a GenericArrrayData whose elements are all UnsafeRows. In this scenario, all elements of the GenericArrayData will be pointing to the same native UnsafeRow buffer which will hold the last value written to it. Right now, this bug seems to only occur when a `ProjectExec` goes down the `execute` path, as opposed to WholeStageCodegen's `produce` and `consume` path. Example Reproduction Code: {code:scala} import org.apache.spark.sql.catalyst.expressions.objects.MapObjects import org.apache.spark.sql.catalyst.expressions.CreateArray import org.apache.spark.sql.catalyst.expressions.Expression import org.apache.spark.sql.functions.{array, struct} import org.apache.spark.sql.Column import org.apache.spark.sql.types.ArrayType // For the purpose of demonstration, we need to disable WholeStage codegen spark.conf.set("spark.sql.codegen.wholeStage", "false") val exampleDS = spark.sparkContext.parallelize(Seq(Seq(1, 2, 3))).toDF("items") // Trivial example: Nest unsafe struct inside safe array // items: Seq[Int] => items.map{item => Seq(Struct(item))} val result = exampleDS.select( new Column(MapObjects( {item: Expression => array(struct(new Column(item))).expr}, $"items".expr, exampleDS.schema("items").dataType.asInstanceOf[ArrayType].elementType )) as "items" ) result.show(10, false) {code} Actual Output: {code:java
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954879#comment-16954879 ] zhao bo commented on SPARK-29106: - And if possible and you are free, could you please help us to make the test script better and make it more like a good test process? Such as, you mentioned installing test debs before test some modules.. Thanks very much, @shane > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29076) Generalize the PVTestSuite to no longer need the minikube tag
[ https://issues.apache.org/jira/browse/SPARK-29076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29076: -- Priority: Major (was: Trivial) > Generalize the PVTestSuite to no longer need the minikube tag > - > > Key: SPARK-29076 > URL: https://issues.apache.org/jira/browse/SPARK-29076 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.0.0 >Reporter: Holden Karau >Priority: Major > > Currently the PVTestSuite has the MiniKube test tag applied so it can be > skipped for non-minikube tests. It should be somewhat easily generalizable to > at least other local k8s test envs, however as written it depends on being > able to mount a local folder as a PV so may take more work to generalize to > arbitrary k8s. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954842#comment-16954842 ] zhao bo commented on SPARK-29106: - Thanks @shane. Correct, the test dependency of pyspark and spark R are installed when we test the demo on the VM after your email. For now, we can focus on the maven test, the pyspark and spark R test are just show both of them can success on VM. If we find there would be some improvements about all thing, just feel free to point out and we try the best to do that. And that would be great if the first periodic job join in our jenkins env recently. > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29514) String function: string_to_array
Kent Yao created SPARK-29514: Summary: String function: string_to_array Key: SPARK-29514 URL: https://issues.apache.org/jira/browse/SPARK-29514 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao |string_to_array}}(}}{{text}}{{, }}{{text}}{{ [, {{text}}])}}|{{text[]}}|splits string into array elements using supplied delimiter and optional null string|{{string_to_array('xx~^~yy~^~zz', '~^~', 'yy')}}|{{{xx,NULL,zz}}}| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954809#comment-16954809 ] Shane Knapp commented on SPARK-29106: - we're definitely going to have an issue w/the both R and python tests as it looks like none of the testing deps have been installed. we use anaconda python to manage our bare metal, so i'll have to see if i can make things work w/virtualenv. R, well, that's always a can of worms best left untouched. > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29507) Support ALTER TABLE SET OWNER command
[ https://issues.apache.org/jira/browse/SPARK-29507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954805#comment-16954805 ] Kent Yao commented on SPARK-29507: -- I am working on it > Support ALTER TABLE SET OWNER command > - > > Key: SPARK-29507 > URL: https://issues.apache.org/jira/browse/SPARK-29507 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Major > > see https://jira.apache.org/jira/browse/HIVE-18762 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27812: -- Fix Version/s: 2.4.5 > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.3, 2.4.4 >Reporter: Henry Yu >Assignee: Igor Calabria >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28420) Date/Time Functions: date_part for intervals
[ https://issues.apache.org/jira/browse/SPARK-28420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-28420. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25981 [https://github.com/apache/spark/pull/25981] > Date/Time Functions: date_part for intervals > > > Key: SPARK-28420 > URL: https://issues.apache.org/jira/browse/SPARK-28420 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > ||Function||Return Type||Description||Example||Result|| > |{{date_part(}}{{text}}{{, }}{{interval}}{{)}}|{{double precision}}|Get > subfield (equivalent to {{extract}}); see [Section > 9.9.1|https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT]|{{date_part('month', > interval '2 years 3 months')}}|{{3}}| > We can replace it with {{extract(field from timestamp)}}. > https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28420) Date/Time Functions: date_part for intervals
[ https://issues.apache.org/jira/browse/SPARK-28420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-28420: --- Assignee: Maxim Gekk (was: Yuming Wang) > Date/Time Functions: date_part for intervals > > > Key: SPARK-28420 > URL: https://issues.apache.org/jira/browse/SPARK-28420 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > ||Function||Return Type||Description||Example||Result|| > |{{date_part(}}{{text}}{{, }}{{interval}}{{)}}|{{double precision}}|Get > subfield (equivalent to {{extract}}); see [Section > 9.9.1|https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT]|{{date_part('month', > interval '2 years 3 months')}}|{{3}}| > We can replace it with {{extract(field from timestamp)}}. > https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28420) Date/Time Functions: date_part for intervals
[ https://issues.apache.org/jira/browse/SPARK-28420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-28420: --- Assignee: Yuming Wang > Date/Time Functions: date_part for intervals > > > Key: SPARK-28420 > URL: https://issues.apache.org/jira/browse/SPARK-28420 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > ||Function||Return Type||Description||Example||Result|| > |{{date_part(}}{{text}}{{, }}{{interval}}{{)}}|{{double precision}}|Get > subfield (equivalent to {{extract}}); see [Section > 9.9.1|https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT]|{{date_part('month', > interval '2 years 3 months')}}|{{3}}| > We can replace it with {{extract(field from timestamp)}}. > https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954705#comment-16954705 ] Shane Knapp commented on SPARK-29106: - re: real time logging -- yeah i noticed that. :) i'll look at that script and play around w/it today. > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29481) all the commands should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954688#comment-16954688 ] L. C. Hsieh commented on SPARK-29481: - Thanks for pinging me. Will take some in this weekend. > all the commands should look up catalog/table like v2 commands > -- > > Key: SPARK-29481 > URL: https://issues.apache.org/jira/browse/SPARK-29481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > The newly added v2 commands support multi-catalog and respect the current > catalog/namespace. However, it's not true for old v1 commands. > This leads to very confusing behaviors, for example > {code} > USE my_catalog > DESC t // success and describe the table t from my_catalog > ANALYZE TABLE t // report table not found as there is no table t in the > session catalog > {code} > We should make sure all the commands have the same behavior regarding table > resolution -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29513) REFRESH TABLE should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954675#comment-16954675 ] Terry Kim commented on SPARK-29513: --- I am working on this. > REFRESH TABLE should look up catalog/table like v2 commands > --- > > Key: SPARK-29513 > URL: https://issues.apache.org/jira/browse/SPARK-29513 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Priority: Major > > REFRESH TABLE should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29513) REFRESH TABLE should look up catalog/table like v2 commands
Terry Kim created SPARK-29513: - Summary: REFRESH TABLE should look up catalog/table like v2 commands Key: SPARK-29513 URL: https://issues.apache.org/jira/browse/SPARK-29513 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Terry Kim REFRESH TABLE should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29512) REPAIR TABLE should look up catalog/table like v2 commands
Terry Kim created SPARK-29512: - Summary: REPAIR TABLE should look up catalog/table like v2 commands Key: SPARK-29512 URL: https://issues.apache.org/jira/browse/SPARK-29512 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Terry Kim REPAIR TABLE should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29512) REPAIR TABLE should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954672#comment-16954672 ] Terry Kim commented on SPARK-29512: --- I am working on this. > REPAIR TABLE should look up catalog/table like v2 commands > -- > > Key: SPARK-29512 > URL: https://issues.apache.org/jira/browse/SPARK-29512 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Priority: Major > > REPAIR TABLE should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29014) DataSourceV2: Clean up current, default, and session catalog uses
[ https://issues.apache.org/jira/browse/SPARK-29014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29014. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26120 [https://github.com/apache/spark/pull/26120] > DataSourceV2: Clean up current, default, and session catalog uses > - > > Key: SPARK-29014 > URL: https://issues.apache.org/jira/browse/SPARK-29014 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Terry Kim >Priority: Blocker > Fix For: 3.0.0 > > > Catalog tracking in DSv2 has evolved since the initial changes went in. We > need to make sure that handling is consistent across plans using the latest > rules: > * The _current_ catalog should be used when no catalog is specified > * The _default_ catalog is the catalog _current_ is initialized to > * If the _default_ catalog is not set, then it is the built-in Spark session > catalog, which will be called `spark_catalog` (This is the v2 session catalog) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29014) DataSourceV2: Clean up current, default, and session catalog uses
[ https://issues.apache.org/jira/browse/SPARK-29014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29014: --- Assignee: Terry Kim > DataSourceV2: Clean up current, default, and session catalog uses > - > > Key: SPARK-29014 > URL: https://issues.apache.org/jira/browse/SPARK-29014 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Terry Kim >Priority: Blocker > > Catalog tracking in DSv2 has evolved since the initial changes went in. We > need to make sure that handling is consistent across plans using the latest > rules: > * The _current_ catalog should be used when no catalog is specified > * The _default_ catalog is the catalog _current_ is initialized to > * If the _default_ catalog is not set, then it is the built-in Spark session > catalog, which will be called `spark_catalog` (This is the v2 session catalog) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29511) DataSourceV2: Support CREATE NAMESPACE
Terry Kim created SPARK-29511: - Summary: DataSourceV2: Support CREATE NAMESPACE Key: SPARK-29511 URL: https://issues.apache.org/jira/browse/SPARK-29511 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Terry Kim CREATE NAMESPACE needs to support v2 catalogs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29481) all the commands should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954661#comment-16954661 ] Wenchen Fan commented on SPARK-29481: - I really appreciate it if more people can help on this ticket. There are still many commands need to be handled, e.g. REPAIR TABLE, REFRESH TABLE, ADD PARTITION, etc. We can take a look at `SparkSqlAstBuilder` and pick the commands that need to resolve a table. https://github.com/apache/spark/pull/26129 is a good example about how to add a v2 command: 1. create a statement plan for the command. 2. update `SqlBase.g4` and `AstBuilder`, use multi-part table name for the command and create statement plan after parsing. 3. create a logical plan and physical plan for the command, if this command can be implemented via v2 APIs (e.g. REFRESH TABLE). 4. update `ResolveCatalogs`, convert the statement plan to the logical plan, if we create such a logical plan in step 3. 5. update `ResolveSessionCatalog`, convert the statement plan to the old v1 command plan. 6. add tests in `DDLSuite` and `DataSourceV2SQLSuite` Please create a sub-task for this ticket if you want to work on one command. cc [~imback82] [~rdblue] [~dongjoon] [~viirya] > all the commands should look up catalog/table like v2 commands > -- > > Key: SPARK-29481 > URL: https://issues.apache.org/jira/browse/SPARK-29481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > The newly added v2 commands support multi-catalog and respect the current > catalog/namespace. However, it's not true for old v1 commands. > This leads to very confusing behaviors, for example > {code} > USE my_catalog > DESC t // success and describe the table t from my_catalog > ANALYZE TABLE t // report table not found as there is no table t in the > session catalog > {code} > We should make sure all the commands have the same behavior regarding table > resolution -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29510) JobGroup ID is not set for the job submitted from Spark-SQL and Spark -Shell
[ https://issues.apache.org/jira/browse/SPARK-29510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK KUMAR GUPTA updated SPARK-29510: - Attachment: JobGroup2.png JobGroup3.png JobGroup1.png > JobGroup ID is not set for the job submitted from Spark-SQL and Spark -Shell > > > Key: SPARK-29510 > URL: https://issues.apache.org/jira/browse/SPARK-29510 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > Attachments: JobGroup1.png, JobGroup2.png, JobGroup3.png > > > When user submit jobs from Spark-shell or SQL Job group id is not set. UI > Screen shot attached. > But from beeline job Group ID is set. > Steps: > create table customer(id int, name String, CName String, address String, city > String, pin int, country String); > insert into customer values(1,'Alfred','Maria','Obere Str > 57','Berlin',12209,'Germany'); > insert into customer values(2,'Ana','trujilo','Adva de la','Maxico > D.F.',05021,'Maxico'); > insert into customer values(3,'Antonio','Antonio Moreno','Mataderos > 2312','Maxico D.F.',05023,'Maxico'); -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29510) JobGroup ID is not set for the job submitted from Spark-SQL and Spark -Shell
[ https://issues.apache.org/jira/browse/SPARK-29510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954645#comment-16954645 ] Ankit Raj Boudh commented on SPARK-29510: - I will start working in this issue > JobGroup ID is not set for the job submitted from Spark-SQL and Spark -Shell > > > Key: SPARK-29510 > URL: https://issues.apache.org/jira/browse/SPARK-29510 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > When user submit jobs from Spark-shell or SQL Job group id is not set. UI > Screen shot attached. > But from beeline job Group ID is set. > Steps: > create table customer(id int, name String, CName String, address String, city > String, pin int, country String); > insert into customer values(1,'Alfred','Maria','Obere Str > 57','Berlin',12209,'Germany'); > insert into customer values(2,'Ana','trujilo','Adva de la','Maxico > D.F.',05021,'Maxico'); > insert into customer values(3,'Antonio','Antonio Moreno','Mataderos > 2312','Maxico D.F.',05023,'Maxico'); -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29510) JobGroup ID is not set for the job submitted from Spark-SQL and Spark -Shell
ABHISHEK KUMAR GUPTA created SPARK-29510: Summary: JobGroup ID is not set for the job submitted from Spark-SQL and Spark -Shell Key: SPARK-29510 URL: https://issues.apache.org/jira/browse/SPARK-29510 Project: Spark Issue Type: Bug Components: Spark Shell, SQL Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA When user submit jobs from Spark-shell or SQL Job group id is not set. UI Screen shot attached. But from beeline job Group ID is set. Steps: create table customer(id int, name String, CName String, address String, city String, pin int, country String); insert into customer values(1,'Alfred','Maria','Obere Str 57','Berlin',12209,'Germany'); insert into customer values(2,'Ana','trujilo','Adva de la','Maxico D.F.',05021,'Maxico'); insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 2312','Maxico D.F.',05023,'Maxico'); -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29509) Deduplicate code blocks in Kafka data source
Jungtaek Lim created SPARK-29509: Summary: Deduplicate code blocks in Kafka data source Key: SPARK-29509 URL: https://issues.apache.org/jira/browse/SPARK-29509 Project: Spark Issue Type: Task Components: SQL, Structured Streaming Affects Versions: 3.0.0 Reporter: Jungtaek Lim There're bunch of methods in Kafka data source which have repeated lines in a method - especially they're tied to the number of fields in writer schema, so once we add a new field redundant code lines will be increased. This issue tracks the efforts to deduplicate them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29508) Implicitly cast strings in datetime arithmetic operations
[ https://issues.apache.org/jira/browse/SPARK-29508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954579#comment-16954579 ] Maxim Gekk commented on SPARK-29508: I am working on it > Implicitly cast strings in datetime arithmetic operations > - > > Key: SPARK-29508 > URL: https://issues.apache.org/jira/browse/SPARK-29508 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Maxim Gekk >Priority: Minor > > To improve Spark SQL UX, strings can be cast to the `INTERVAL` or `TIMESTAMP` > types in the cases: > # Cast string to interval in interval - string > # Cast string to interval in datetime + string or string + datetime > # Cast string to timestamp in datetime - string or string - datetime -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29482) ANALYZE TABLE should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-29482. Resolution: Fixed The issue is resolved in https://github.com/apache/spark/pull/26129 > ANALYZE TABLE should look up catalog/table like v2 commands > > > Key: SPARK-29482 > URL: https://issues.apache.org/jira/browse/SPARK-29482 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29508) Implicitly cast strings in datetime arithmetic operations
Maxim Gekk created SPARK-29508: -- Summary: Implicitly cast strings in datetime arithmetic operations Key: SPARK-29508 URL: https://issues.apache.org/jira/browse/SPARK-29508 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.4 Reporter: Maxim Gekk To improve Spark SQL UX, strings can be cast to the `INTERVAL` or `TIMESTAMP` types in the cases: # Cast string to interval in interval - string # Cast string to interval in datetime + string or string + datetime # Cast string to timestamp in datetime - string or string - datetime -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29478) Improve tooltip information for AggregatedLogs Tab
[ https://issues.apache.org/jira/browse/SPARK-29478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK KUMAR GUPTA resolved SPARK-29478. -- Resolution: Invalid It is not supported by Open Source community > Improve tooltip information for AggregatedLogs Tab > -- > > Key: SPARK-29478 > URL: https://issues.apache.org/jira/browse/SPARK-29478 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29489) ml.evaluation support log-loss
[ https://issues.apache.org/jira/browse/SPARK-29489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-29489: Assignee: zhengruifeng > ml.evaluation support log-loss > -- > > Key: SPARK-29489 > URL: https://issues.apache.org/jira/browse/SPARK-29489 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > > {color:#5a6e5a}log-loss (aka logistic loss or cross-entropy loss) is one of > the most widely used metrics in classification tasks. It is already impled in > famous libraries like sklearn. > {color} > {color:#5a6e5a}However, it is missing so far. > {color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29489) ml.evaluation support log-loss
[ https://issues.apache.org/jira/browse/SPARK-29489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-29489. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26135 [https://github.com/apache/spark/pull/26135] > ml.evaluation support log-loss > -- > > Key: SPARK-29489 > URL: https://issues.apache.org/jira/browse/SPARK-29489 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Fix For: 3.0.0 > > > {color:#5a6e5a}log-loss (aka logistic loss or cross-entropy loss) is one of > the most widely used metrics in classification tasks. It is already impled in > famous libraries like sklearn. > {color} > {color:#5a6e5a}However, it is missing so far. > {color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954447#comment-16954447 ] zhao bo commented on SPARK-29106: - The reason to introducing the shell script is that I found if we call ansible script, it won't log the real-time log information.;) > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954445#comment-16954445 ] zhao bo commented on SPARK-29106: - Hi, I create a pretty simple shell script in /home/jenkins/ansible_test_scripts/, it's named "sample_shell_test.sh" . You can run the script directly after setting a "SPARK_HOME" env. [~shaneknapp], how about we try this script with jenkins? > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29262) DataFrameWriter insertIntoPartition function
[ https://issues.apache.org/jira/browse/SPARK-29262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954424#comment-16954424 ] feiwang commented on SPARK-29262: - I'll try to implement it. > DataFrameWriter insertIntoPartition function > > > Key: SPARK-29262 > URL: https://issues.apache.org/jira/browse/SPARK-29262 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: feiwang >Priority: Minor > > InsertIntoPartition is a useful function. > For SQL statement, relative syntax. > {code:java} > insert overwrite table tbl_a partition(p1=v1,p2=v2,...,pn=vn) select ... > {code} > In the example above, I specify all the partition key value, so it must be a > static partition overwrite, regardless whether enable dynamic partition > overwrite. > If we enable dynamic partition overwrite. For the sql below, it will only > overwrite relative partition not whole table. > If we disable dynamic partition overwrite, it will overwrite whole table. > {code:java} > insert overwrite table tbl_a partition(p1,p2,...,pn) select ... > {code} > As far as now, dataFrame does not support overwrite a specific partition. > It means that, for a partitioned table, if we insert overwrite by using > dataFrame with dynamic partition overwrite disabled, it will always > overwrite whole table. > So, we should support insertIntoPartition for dataFrameWriter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28120) RocksDB state storage
[ https://issues.apache.org/jira/browse/SPARK-28120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vikram Agrawal resolved SPARK-28120. Resolution: Later The implementation will be submitted to https://spark-packages.org. > RocksDB state storage > - > > Key: SPARK-28120 > URL: https://issues.apache.org/jira/browse/SPARK-28120 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Vikram Agrawal >Priority: Major > > SPARK-13809 introduced a framework for state management for computing > Streaming Aggregates. The default implementation was in-memory hashmap which > was backed up in HDFS complaint file system at the end of every micro-batch. > Current implementation suffers from Performance and Latency Issues. It uses > Executor JVM memory to store the states. State store size is limited by the > size of the executor memory. Also > Executor JVM memory is shared by state storage and other tasks operations. > State storage size will impact the performance of task execution > Moreover, GC pauses, executor failures, OOM issues are common when the size > of state storage increases which increases overall latency of a micro-batch > RocksDb is an embedded DB which can provide major performance improvements. > Other major streaming frameworks have rocksdb as default state storage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28022) k8s pod affinity to achieve cloud native friendly autoscaling
[ https://issues.apache.org/jira/browse/SPARK-28022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954398#comment-16954398 ] Jiaxin Shan commented on SPARK-28022: - I don't quite understand the use case here. Sound like you want to put your executor as close as possible. Kubernetes has the native support for node affinity and pod affinity but they're a little bit different even it make you pods sit close at some level. # node selector or node affinity k8s scheduler put your application of subset of nodes pool. The problem is if you have a large pool of certain nodes, it won't do bin pack inside target node group. In cloud environment, if you have autoscaler enable, it will guarantee resource is utilized. # pod affinity. k8s scheduler will try to find the qualified pod and try to put following pods to the same node. My question is can both of these address your use case? > k8s pod affinity to achieve cloud native friendly autoscaling > -- > > Key: SPARK-28022 > URL: https://issues.apache.org/jira/browse/SPARK-28022 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Henry Yu >Priority: Major > > Hi, in order to achieve cloud native friendly autoscaling , I propose to add > a pod affinity feature. > Traditionally, when we use spark in fix size yarn cluster, it make sense to > spread containers to every node. > Coming to cloud native resource manage, we want to release node when we don't > need it any more. > Pod affinity feature counts to place all pods of certain application to some > nodes instead of all nodes. > By the way, using pod template is not a good choice, adding application id > to pod affinity term when submit is more robust. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29295. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25979 [https://github.com/apache/spark/pull/25979] > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.0.0 > > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> > false.toString) { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(1), Row(2))) > } > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) > { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(2))) > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29295: --- Assignee: L. C. Hsieh > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Assignee: L. C. Hsieh >Priority: Major > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> > false.toString) { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(1), Row(2))) > } > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) > { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(2))) > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29507) Support ALTER TABLE SET OWNER command
Kent Yao created SPARK-29507: Summary: Support ALTER TABLE SET OWNER command Key: SPARK-29507 URL: https://issues.apache.org/jira/browse/SPARK-29507 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao see https://jira.apache.org/jira/browse/HIVE-18762 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29405) Alter table / Insert statements should not change a table's ownership
[ https://issues.apache.org/jira/browse/SPARK-29405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29405: --- Assignee: Kent Yao > Alter table / Insert statements should not change a table's ownership > - > > Key: SPARK-29405 > URL: https://issues.apache.org/jira/browse/SPARK-29405 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > When executing 'insert into/overwrite ...' DML, or 'alter table set > tblproperties ...' DDL, spark would change the ownership of the table the > one who runs the spark application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29405) Alter table / Insert statements should not change a table's ownership
[ https://issues.apache.org/jira/browse/SPARK-29405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29405. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26068 [https://github.com/apache/spark/pull/26068] > Alter table / Insert statements should not change a table's ownership > - > > Key: SPARK-29405 > URL: https://issues.apache.org/jira/browse/SPARK-29405 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > When executing 'insert into/overwrite ...' DML, or 'alter table set > tblproperties ...' DDL, spark would change the ownership of the table the > one who runs the spark application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29444) Add configuration to support JacksonGenrator to keep fields with null values
[ https://issues.apache.org/jira/browse/SPARK-29444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29444: --- Assignee: Jackey Lee > Add configuration to support JacksonGenrator to keep fields with null values > > > Key: SPARK-29444 > URL: https://issues.apache.org/jira/browse/SPARK-29444 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jackey Lee >Assignee: Jackey Lee >Priority: Major > Fix For: 3.0.0 > > > DataSet.toJSON will lost some column when field data is null. Maybe it is > better to keep null data in some scenarios. > Such as sparkmagic, which is widely used in jupyter with livy, we use toJSON > to toJSON is used to get sql results. with toJSON sparkmagic may return empty > results, which confused users. > Maybe adding a config is the best choice. This configuration retains the > current semantics and will remain fields with null until the configuration is > modified. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29444) Add configuration to support JacksonGenrator to keep fields with null values
[ https://issues.apache.org/jira/browse/SPARK-29444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29444. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26098 [https://github.com/apache/spark/pull/26098] > Add configuration to support JacksonGenrator to keep fields with null values > > > Key: SPARK-29444 > URL: https://issues.apache.org/jira/browse/SPARK-29444 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jackey Lee >Priority: Major > Fix For: 3.0.0 > > > DataSet.toJSON will lost some column when field data is null. Maybe it is > better to keep null data in some scenarios. > Such as sparkmagic, which is widely used in jupyter with livy, we use toJSON > to toJSON is used to get sql results. with toJSON sparkmagic may return empty > results, which confused users. > Maybe adding a config is the best choice. This configuration retains the > current semantics and will remain fields with null until the configuration is > modified. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29092) EXPLAIN FORMATTED does not work well with DPP
[ https://issues.apache.org/jira/browse/SPARK-29092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29092. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26042 [https://github.com/apache/spark/pull/26042] > EXPLAIN FORMATTED does not work well with DPP > - > > Key: SPARK-29092 > URL: https://issues.apache.org/jira/browse/SPARK-29092 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Dilip Biswal >Priority: Major > Fix For: 3.0.0 > > > > {code:java} > withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true", > SQLConf.DYNAMIC_PARTITION_PRUNING_REUSE_BROADCAST.key -> "false") { > withTable("df1", "df2") { > spark.range(1000) > .select(col("id"), col("id").as("k")) > .write > .partitionBy("k") > .format(tableFormat) > .mode("overwrite") > .saveAsTable("df1") > spark.range(100) > .select(col("id"), col("id").as("k")) > .write > .partitionBy("k") > .format(tableFormat) > .mode("overwrite") > .saveAsTable("df2") > sql("EXPLAIN FORMATTED SELECT df1.id, df2.k FROM df1 JOIN df2 ON df1.k = > df2.k AND df2.id < 2") > .show(false) > sql("EXPLAIN EXTENDED SELECT df1.id, df2.k FROM df1 JOIN df2 ON df1.k = > df2.k AND df2.id < 2") > .show(false) > } > } > {code} > The output of EXPLAIN EXTENDED is expected. > {code:java} > == Physical Plan == > *(2) Project [id#2721L, k#2724L] > +- *(2) BroadcastHashJoin [k#2722L], [k#2724L], Inner, BuildRight >:- *(2) ColumnarToRow >: +- FileScan parquet default.df1[id#2721L,k#2722L] Batched: true, > DataFilters: [], Format: Parquet, Location: > PrunedInMemoryFileIndex[file:/Users/lixiao/IdeaProjects/spark/sql/core/spark-warehouse/org.apache..., > PartitionFilters: [isnotnull(k#2722L), dynamicpruningexpression(k#2722L IN > subquery2741)], PushedFilters: [], ReadSchema: struct >:+- Subquery subquery2741, [id=#358] >: +- *(2) HashAggregate(keys=[k#2724L], functions=[], > output=[k#2724L#2740L]) >: +- Exchange hashpartitioning(k#2724L, 5), true, [id=#354] >: +- *(1) HashAggregate(keys=[k#2724L], functions=[], > output=[k#2724L]) >:+- *(1) Project [k#2724L] >: +- *(1) Filter (isnotnull(id#2723L) AND (id#2723L > < 2)) >: +- *(1) ColumnarToRow >: +- FileScan parquet > default.df2[id#2723L,k#2724L] Batched: true, DataFilters: > [isnotnull(id#2723L), (id#2723L < 2)], Format: Parquet, Location: > PrunedInMemoryFileIndex[file:/Users/lixiao/IdeaProjects/spark/sql/core/spark-warehouse/org.apache..., > PartitionFilters: [isnotnull(k#2724L)], PushedFilters: [IsNotNull(id), > LessThan(id,2)], ReadSchema: struct >+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > true])), [id=#379] > +- *(1) Project [k#2724L] > +- *(1) Filter (isnotnull(id#2723L) AND (id#2723L < 2)) > +- *(1) ColumnarToRow >+- FileScan parquet default.df2[id#2723L,k#2724L] Batched: > true, DataFilters: [isnotnull(id#2723L), (id#2723L < 2)], Format: Parquet, > Location: > PrunedInMemoryFileIndex[file:/Users/lixiao/IdeaProjects/spark/sql/core/spark-warehouse/org.apache..., > PartitionFilters: [isnotnull(k#2724L)], PushedFilters: [IsNotNull(id), > LessThan(id,2)], ReadSchema: struct > {code} > However, the output of FileScan node of EXPLAIN FORMATTED does not show the > effect of DPP > {code:java} > * Project (9) > +- * BroadcastHashJoin Inner BuildRight (8) >:- * ColumnarToRow (2) >: +- Scan parquet default.df1 (1) >+- BroadcastExchange (7) > +- * Project (6) > +- * Filter (5) > +- * ColumnarToRow (4) >+- Scan parquet default.df2 (3) > (1) Scan parquet default.df1 > Output: [id#2716L, k#2717L] > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29092) EXPLAIN FORMATTED does not work well with DPP
[ https://issues.apache.org/jira/browse/SPARK-29092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29092: --- Assignee: Dilip Biswal > EXPLAIN FORMATTED does not work well with DPP > - > > Key: SPARK-29092 > URL: https://issues.apache.org/jira/browse/SPARK-29092 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Dilip Biswal >Priority: Major > > > {code:java} > withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true", > SQLConf.DYNAMIC_PARTITION_PRUNING_REUSE_BROADCAST.key -> "false") { > withTable("df1", "df2") { > spark.range(1000) > .select(col("id"), col("id").as("k")) > .write > .partitionBy("k") > .format(tableFormat) > .mode("overwrite") > .saveAsTable("df1") > spark.range(100) > .select(col("id"), col("id").as("k")) > .write > .partitionBy("k") > .format(tableFormat) > .mode("overwrite") > .saveAsTable("df2") > sql("EXPLAIN FORMATTED SELECT df1.id, df2.k FROM df1 JOIN df2 ON df1.k = > df2.k AND df2.id < 2") > .show(false) > sql("EXPLAIN EXTENDED SELECT df1.id, df2.k FROM df1 JOIN df2 ON df1.k = > df2.k AND df2.id < 2") > .show(false) > } > } > {code} > The output of EXPLAIN EXTENDED is expected. > {code:java} > == Physical Plan == > *(2) Project [id#2721L, k#2724L] > +- *(2) BroadcastHashJoin [k#2722L], [k#2724L], Inner, BuildRight >:- *(2) ColumnarToRow >: +- FileScan parquet default.df1[id#2721L,k#2722L] Batched: true, > DataFilters: [], Format: Parquet, Location: > PrunedInMemoryFileIndex[file:/Users/lixiao/IdeaProjects/spark/sql/core/spark-warehouse/org.apache..., > PartitionFilters: [isnotnull(k#2722L), dynamicpruningexpression(k#2722L IN > subquery2741)], PushedFilters: [], ReadSchema: struct >:+- Subquery subquery2741, [id=#358] >: +- *(2) HashAggregate(keys=[k#2724L], functions=[], > output=[k#2724L#2740L]) >: +- Exchange hashpartitioning(k#2724L, 5), true, [id=#354] >: +- *(1) HashAggregate(keys=[k#2724L], functions=[], > output=[k#2724L]) >:+- *(1) Project [k#2724L] >: +- *(1) Filter (isnotnull(id#2723L) AND (id#2723L > < 2)) >: +- *(1) ColumnarToRow >: +- FileScan parquet > default.df2[id#2723L,k#2724L] Batched: true, DataFilters: > [isnotnull(id#2723L), (id#2723L < 2)], Format: Parquet, Location: > PrunedInMemoryFileIndex[file:/Users/lixiao/IdeaProjects/spark/sql/core/spark-warehouse/org.apache..., > PartitionFilters: [isnotnull(k#2724L)], PushedFilters: [IsNotNull(id), > LessThan(id,2)], ReadSchema: struct >+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > true])), [id=#379] > +- *(1) Project [k#2724L] > +- *(1) Filter (isnotnull(id#2723L) AND (id#2723L < 2)) > +- *(1) ColumnarToRow >+- FileScan parquet default.df2[id#2723L,k#2724L] Batched: > true, DataFilters: [isnotnull(id#2723L), (id#2723L < 2)], Format: Parquet, > Location: > PrunedInMemoryFileIndex[file:/Users/lixiao/IdeaProjects/spark/sql/core/spark-warehouse/org.apache..., > PartitionFilters: [isnotnull(k#2724L)], PushedFilters: [IsNotNull(id), > LessThan(id,2)], ReadSchema: struct > {code} > However, the output of FileScan node of EXPLAIN FORMATTED does not show the > effect of DPP > {code:java} > * Project (9) > +- * BroadcastHashJoin Inner BuildRight (8) >:- * ColumnarToRow (2) >: +- Scan parquet default.df1 (1) >+- BroadcastExchange (7) > +- * Project (6) > +- * Filter (5) > +- * ColumnarToRow (4) >+- Scan parquet default.df2 (3) > (1) Scan parquet default.df1 > Output: [id#2716L, k#2717L] > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29505) desc extended is case sensitive
[ https://issues.apache.org/jira/browse/SPARK-29505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16954330#comment-16954330 ] Shivu Sondur commented on SPARK-29505: -- I am checking this issue > desc extended is case sensitive > -- > > Key: SPARK-29505 > URL: https://issues.apache.org/jira/browse/SPARK-29505 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > create table customer(id int, name String, *CName String*, address String, > city String, pin int, country String); > insert into customer values(1,'Alfred','Maria','Obere Str > 57','Berlin',12209,'Germany'); > insert into customer values(2,'Ana','trujilo','Adva de la','Maxico > D.F.',05021,'Maxico'); > insert into customer values(3,'Antonio','Antonio Moreno','Mataderos > 2312','Maxico D.F.',05023,'Maxico'); > analyze table customer compute statistics for columns cname; – *Success( > Though cname is not as CName)* > desc extended customer cname; – Failed > jdbc:hive2://10.18.19.208:23040/default> desc extended customer *cname;* > +-+-+ > | info_name | info_value | > +-+-+ > | col_name | cname | > | data_type | string | > | comment | NULL | > | min | NULL | > | max | NULL | > | num_nulls | NULL | > | distinct_count | NULL | > | avg_col_len | NULL | > | max_col_len | NULL | > | histogram | NULL | > +-+-- > > But > desc extended customer CName; – SUCCESS > 0: jdbc:hive2://10.18.19.208:23040/default> desc extended customer *CName;* > +-+-+ > | info_name | info_value | > +-+-+ > | col_name | CName | > | data_type | string | > | comment | NULL | > | min | NULL | > | max | NULL | > | num_nulls | 0 | > | distinct_count | 3 | > | avg_col_len | 9 | > | max_col_len | 14 | > | histogram | NULL | > +-+-+ > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29506) Use dynamicPartitionOverwrite in FileCommitProtocol when insert into hive table
L. C. Hsieh created SPARK-29506: --- Summary: Use dynamicPartitionOverwrite in FileCommitProtocol when insert into hive table Key: SPARK-29506 URL: https://issues.apache.org/jira/browse/SPARK-29506 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh When insert overwrite into hive table, enabling dynamicPartitionOverwrite when initializing FileCommitProtocol. HadoopMapReduceCommitProtocol uses FileOutputCommitter to commit job output files. FileOutputCommitter continues do FileSystem.listStatus for directories in partitions, recursively, and commits job output leaf files. It is inefficient when dynamically overwritting many partitions and files. HadoopMapReduceCommitProtocol, when dynamicPartitionOverwrite is enabled, writes to staging dir dynamically, and commits written partition directories, instead of leaf files. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29505) desc extended is case sensitive
ABHISHEK KUMAR GUPTA created SPARK-29505: Summary: desc extended is case sensitive Key: SPARK-29505 URL: https://issues.apache.org/jira/browse/SPARK-29505 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA create table customer(id int, name String, *CName String*, address String, city String, pin int, country String); insert into customer values(1,'Alfred','Maria','Obere Str 57','Berlin',12209,'Germany'); insert into customer values(2,'Ana','trujilo','Adva de la','Maxico D.F.',05021,'Maxico'); insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 2312','Maxico D.F.',05023,'Maxico'); analyze table customer compute statistics for columns cname; – *Success( Though cname is not as CName)* desc extended customer cname; – Failed jdbc:hive2://10.18.19.208:23040/default> desc extended customer *cname;* +-+-+ | info_name | info_value | +-+-+ | col_name | cname | | data_type | string | | comment | NULL | | min | NULL | | max | NULL | | num_nulls | NULL | | distinct_count | NULL | | avg_col_len | NULL | | max_col_len | NULL | | histogram | NULL | +-+-- But desc extended customer CName; – SUCCESS 0: jdbc:hive2://10.18.19.208:23040/default> desc extended customer *CName;* +-+-+ | info_name | info_value | +-+-+ | col_name | CName | | data_type | string | | comment | NULL | | min | NULL | | max | NULL | | num_nulls | 0 | | distinct_count | 3 | | avg_col_len | 9 | | max_col_len | 14 | | histogram | NULL | +-+-+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org