[jira] [Created] (SPARK-31050) Disable flaky KafkaDelegationTokenSuite
wuyi created SPARK-31050: Summary: Disable flaky KafkaDelegationTokenSuite Key: SPARK-31050 URL: https://issues.apache.org/jira/browse/SPARK-31050 Project: Spark Issue Type: Bug Components: SQL, Structured Streaming Affects Versions: 3.0.0 Environment: Disable flaky KafkaDelegationTokenSuite since it's too flaky. Reporter: wuyi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30886) Deprecate LTRIM, RTRIM, and two-parameter TRIM functions
[ https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30886: -- Target Version/s: 3.0.0 > Deprecate LTRIM, RTRIM, and two-parameter TRIM functions > > > Key: SPARK-30886 > URL: https://issues.apache.org/jira/browse/SPARK-30886 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Apache Spark community decided to keep the existing esoteric two-parameter > use cases with a proper warning. This JIRA aims to show warning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30886) Deprecate LTRIM, RTRIM, and two-parameter TRIM functions
[ https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30886: -- Issue Type: Bug (was: Task) > Deprecate LTRIM, RTRIM, and two-parameter TRIM functions > > > Key: SPARK-30886 > URL: https://issues.apache.org/jira/browse/SPARK-30886 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Apache Spark community decided to keep the existing esoteric two-parameter > use cases with a proper warning. This JIRA aims to show warning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30886) Deprecate LTRIM, RTRIM, and two-parameter TRIM functions
[ https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30886: -- Summary: Deprecate LTRIM, RTRIM, and two-parameter TRIM functions (was: Warn two-parameter TRIM/LTRIM/RTRIM functions) > Deprecate LTRIM, RTRIM, and two-parameter TRIM functions > > > Key: SPARK-30886 > URL: https://issues.apache.org/jira/browse/SPARK-30886 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Apache Spark community decided to keep the existing esoteric two-parameter > use cases with a proper warning. This JIRA aims to show warning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30541) Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite
[ https://issues.apache.org/jira/browse/SPARK-30541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-30541: - Priority: Blocker (was: Major) > Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite > --- > > Key: SPARK-30541 > URL: https://issues.apache.org/jira/browse/SPARK-30541 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Blocker > > The test suite has been failing intermittently as of now: > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116862/testReport/] > > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.(It is not a test it > is a sbt.testing.SuiteSelector) > > {noformat} > Error Details > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 3939 times over > 1.000122353532 minutes. Last failure message: KeeperErrorCode = > AuthFailed for /brokers/ids. > Stack Trace > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 3939 times over > 1.000122353532 minutes. Last failure message: KeeperErrorCode = > AuthFailed for /brokers/ids. > at > org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432) > at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439) > at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391) > at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479) > at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:337) > at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:336) > at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:292) > at > org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49) > at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) > at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:58) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: sbt.ForkMain$ForkError: > org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = > AuthFailed for /brokers/ids > at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) > at > kafka.zookeeper.AsyncResponse.resultException(ZooKeeperClient.scala:554) > at kafka.zk.KafkaZkClient.getChildren(KafkaZkClient.scala:719) > at kafka.zk.KafkaZkClient.getSortedBrokerList(KafkaZkClient.scala:455) > at > kafka.zk.KafkaZkClient.getAllBrokersInCluster(KafkaZkClient.scala:404) > at > org.apache.spark.sql.kafka010.KafkaTestUtils.$anonfun$setup$3(KafkaTestUtils.scala:293) > at > org.scalatest.concurrent.Eventually.makeAValiantAttempt$1(Eventually.scala:395) > at > org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:409) > ... 20 more > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31049) Support nested adjacent generators, e.g., explode(explode(v))
Takeshi Yamamuro created SPARK-31049: Summary: Support nested adjacent generators, e.g., explode(explode(v)) Key: SPARK-31049 URL: https://issues.apache.org/jira/browse/SPARK-31049 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0, 2.4.6 Reporter: Takeshi Yamamuro In the master, we currently don't support any nested generators, but I think supporting limited nested cases is somewhat useful for users, e.g., explode(explode(v)). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31048) alter hive column datatype is not supported
Sunil Aryal created SPARK-31048: --- Summary: alter hive column datatype is not supported Key: SPARK-31048 URL: https://issues.apache.org/jira/browse/SPARK-31048 Project: Spark Issue Type: Bug Components: Spark Shell, SQL Affects Versions: 2.2.2 Environment: spark sql with hive metadata store. Reporter: Sunil Aryal describe tb2; Getting log thread is interrupted, since query is done! +-+-+-+--+ | col_name | data_type | comment | +-+-+-+--+ | fn | int | NULL | | ln | string | NULL | | age | int | NULL | | # Partition Information | | | | # col_name | data_type | comment | | age | int | NULL | +-+-+-+--+ 6 rows selected (0.213 seconds) alter table tb2 change fn fn bigint; Getting log thread is interrupted, since query is done! Error: org.apache.spark.sql.AnalysisException: ALTER TABLE CHANGE COLUMN is not supported for changing column 'fn' with type 'IntegerType' to 'fn' with type 'LongType'; (state=,code=0) java.sql.SQLException: org.apache.spark.sql.AnalysisException: ALTER TABLE CHANGE COLUMN is not supported for changing column 'fn' with type 'IntegerType' to 'fn' with type 'LongType'; at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296) at org.apache.hive.beeline.Commands.execute(Commands.java:848) at org.apache.hive.beeline.Commands.sql(Commands.java:713) at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973) at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813) at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771) at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484) at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count
[ https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051844#comment-17051844 ] Hyukjin Kwon commented on SPARK-29058: -- Yeah, so there's kind of tradeoff. If we should have the same results, we should parse and convert everything always which is pretty costly. Workaround itself seems simple enough though. > Reading csv file with DROPMALFORMED showing incorrect record count > -- > > Key: SPARK-29058 > URL: https://issues.apache.org/jira/browse/SPARK-29058 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Minor > > The spark sql csv reader is dropping malformed records as expected, but the > record count is showing as incorrect. > Consider this file (fruit.csv) > {code} > apple,red,1,3 > banana,yellow,2,4.56 > orange,orange,3,5 > {code} > Defining schema as follows: > {code} > schema = "Fruit string,color string,price int,quantity int" > {code} > Notice that the "quantity" field is defined as integer type, but the 2nd row > in the file contains a floating point value, hence it is a corrupt record. > {code} > >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema) > >>> df.show() > +--+--+-++ > | Fruit| color|price|quantity| > +--+--+-++ > | apple| red|1| 3| > |orange|orange|3| 5| > +--+--+-++ > >>> df.count() > 3 > {code} > Malformed record is getting dropped as expected, but incorrect record count > is getting displayed. > Here the df.count() should give value as 2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count
[ https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051839#comment-17051839 ] Suchintak Patnaik commented on SPARK-29058: --- [~hyukjin.kwon] I agree with you on this. However, the dataframe is getting created without the second row which is malformed. This can be observed from df.show() >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema) >>> df.show() +--+--+-++ | Fruit| color|price|quantity| +--+--+-++ | apple| red|1| 3| |orange|orange|3| 5| +--+--+-++ so, ideally it should return the correct row count accordingly. What you say? > Reading csv file with DROPMALFORMED showing incorrect record count > -- > > Key: SPARK-29058 > URL: https://issues.apache.org/jira/browse/SPARK-29058 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Minor > > The spark sql csv reader is dropping malformed records as expected, but the > record count is showing as incorrect. > Consider this file (fruit.csv) > {code} > apple,red,1,3 > banana,yellow,2,4.56 > orange,orange,3,5 > {code} > Defining schema as follows: > {code} > schema = "Fruit string,color string,price int,quantity int" > {code} > Notice that the "quantity" field is defined as integer type, but the 2nd row > in the file contains a floating point value, hence it is a corrupt record. > {code} > >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema) > >>> df.show() > +--+--+-++ > | Fruit| color|price|quantity| > +--+--+-++ > | apple| red|1| 3| > |orange|orange|3| 5| > +--+--+-++ > >>> df.count() > 3 > {code} > Malformed record is getting dropped as expected, but incorrect record count > is getting displayed. > Here the df.count() should give value as 2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count
[ https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051827#comment-17051827 ] Hyukjin Kwon commented on SPARK-29058: -- Yeah, so essentially it doesn't need to parse and convert anything. That's why it doesn't treat the second row as malformed. > Reading csv file with DROPMALFORMED showing incorrect record count > -- > > Key: SPARK-29058 > URL: https://issues.apache.org/jira/browse/SPARK-29058 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Minor > > The spark sql csv reader is dropping malformed records as expected, but the > record count is showing as incorrect. > Consider this file (fruit.csv) > {code} > apple,red,1,3 > banana,yellow,2,4.56 > orange,orange,3,5 > {code} > Defining schema as follows: > {code} > schema = "Fruit string,color string,price int,quantity int" > {code} > Notice that the "quantity" field is defined as integer type, but the 2nd row > in the file contains a floating point value, hence it is a corrupt record. > {code} > >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema) > >>> df.show() > +--+--+-++ > | Fruit| color|price|quantity| > +--+--+-++ > | apple| red|1| 3| > |orange|orange|3| 5| > +--+--+-++ > >>> df.count() > 3 > {code} > Malformed record is getting dropped as expected, but incorrect record count > is getting displayed. > Here the df.count() should give value as 2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count
[ https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051820#comment-17051820 ] Suchintak Patnaik commented on SPARK-29058: --- [~hyukjin.kwon] How does column pruning work here because count() does not need any columns to perform the count. It just returns the row count. > Reading csv file with DROPMALFORMED showing incorrect record count > -- > > Key: SPARK-29058 > URL: https://issues.apache.org/jira/browse/SPARK-29058 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Minor > > The spark sql csv reader is dropping malformed records as expected, but the > record count is showing as incorrect. > Consider this file (fruit.csv) > {code} > apple,red,1,3 > banana,yellow,2,4.56 > orange,orange,3,5 > {code} > Defining schema as follows: > {code} > schema = "Fruit string,color string,price int,quantity int" > {code} > Notice that the "quantity" field is defined as integer type, but the 2nd row > in the file contains a floating point value, hence it is a corrupt record. > {code} > >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema) > >>> df.show() > +--+--+-++ > | Fruit| color|price|quantity| > +--+--+-++ > | apple| red|1| 3| > |orange|orange|3| 5| > +--+--+-++ > >>> df.count() > 3 > {code} > Malformed record is getting dropped as expected, but incorrect record count > is getting displayed. > Here the df.count() should give value as 2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30913) Add version information to the configuration of Tests.scala
[ https://issues.apache.org/jira/browse/SPARK-30913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30913. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27783 [https://github.com/apache/spark/pull/27783] > Add version information to the configuration of Tests.scala > --- > > Key: SPARK-30913 > URL: https://issues.apache.org/jira/browse/SPARK-30913 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > Fix For: 3.1.0 > > > core/src/main/scala/org/apache/spark/internal/config/Tests.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30889) Add version information to the configuration of Worker
[ https://issues.apache.org/jira/browse/SPARK-30889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30889. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27783 [https://github.com/apache/spark/pull/27783] > Add version information to the configuration of Worker > -- > > Key: SPARK-30889 > URL: https://issues.apache.org/jira/browse/SPARK-30889 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > Fix For: 3.1.0 > > > core/src/main/scala/org/apache/spark/internal/config/Worker.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31047) Improve file listing for ViewFileSystem
[ https://issues.apache.org/jira/browse/SPARK-31047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-31047: --- Component/s: (was: Input/Output) SQL > Improve file listing for ViewFileSystem > --- > > Key: SPARK-31047 > URL: https://issues.apache.org/jira/browse/SPARK-31047 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Manu Zhang >Priority: Minor > > https://issues.apache.org/jira/browse/SPARK-27801 has improved file listing > for DistributedFileSystem, where {{InMemoryFileIndex.listLeafFiles}} makes > use of DistributedFileSystem's one single {{listLocatedStatus}} to namenode. > This ticket intends to improve the case where ViewFileSystem is used to > manage multiple DistributedFileSystems. It has also overridden the > {{listLocatedStatus}} method by delegating to the filesystem it resolves to, > e.g. DistributedFileSystem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30980) Issue not resolved of Caught Hive MetaException attempting to get partition metadata by filter from Hive
[ https://issues.apache.org/jira/browse/SPARK-30980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051767#comment-17051767 ] Hyukjin Kwon commented on SPARK-30980: -- [~coderbond007], can you provide reproducible example? are you able to reproduce in your local too? > Issue not resolved of Caught Hive MetaException attempting to get partition > metadata by filter from Hive > > > Key: SPARK-30980 > URL: https://issues.apache.org/jira/browse/SPARK-30980 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.2 > Environment: 2.4.0-CDH6.3.1 (which I guess points to Spark Version > 2.4.2) >Reporter: Pradyumn Agrawal >Priority: Major > > I am querying on table created in Hive. Getting repetitive exception of > failing to query data with following stacktrace. > > {code:java} > // code placeholder > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. You can set the Spark configuration > setting spark.sql.hive.manageFilesourcePartitions to false to work around > this problem, however this will result in degraded performance. Please report > a bug: https://issues.apache.org/jira/browse/SPARKjava.lang.RuntimeException: > Caught Hive MetaException attempting to get partition metadata by filter from > Hive. You can set the Spark configuration setting > spark.sql.hive.manageFilesourcePartitions to false to work around this > problem, however this will result in degraded performance. Please report a > bug: https://issues.apache.org/jira/browse/SPARK at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1258) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1251) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1251) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957) > at > org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326) > at >
[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count
[ https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051765#comment-17051765 ] Hyukjin Kwon commented on SPARK-29058: -- It parses what it needs intenally via column pruning. It's kind of a feature. {code} spark.read.csv(path="tmp.csv",mode="DROPMALFORMED",schema=schema).rdd.count() {code} this workaround is pretty much feasible I guess? > Reading csv file with DROPMALFORMED showing incorrect record count > -- > > Key: SPARK-29058 > URL: https://issues.apache.org/jira/browse/SPARK-29058 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Minor > > The spark sql csv reader is dropping malformed records as expected, but the > record count is showing as incorrect. > Consider this file (fruit.csv) > {code} > apple,red,1,3 > banana,yellow,2,4.56 > orange,orange,3,5 > {code} > Defining schema as follows: > {code} > schema = "Fruit string,color string,price int,quantity int" > {code} > Notice that the "quantity" field is defined as integer type, but the 2nd row > in the file contains a floating point value, hence it is a corrupt record. > {code} > >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema) > >>> df.show() > +--+--+-++ > | Fruit| color|price|quantity| > +--+--+-++ > | apple| red|1| 3| > |orange|orange|3| 5| > +--+--+-++ > >>> df.count() > 3 > {code} > Malformed record is getting dropped as expected, but incorrect record count > is getting displayed. > Here the df.count() should give value as 2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31047) Improve file listing for ViewFileSystem
Manu Zhang created SPARK-31047: -- Summary: Improve file listing for ViewFileSystem Key: SPARK-31047 URL: https://issues.apache.org/jira/browse/SPARK-31047 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 3.1.0 Reporter: Manu Zhang https://issues.apache.org/jira/browse/SPARK-27801 has improved file listing for DistributedFileSystem, where {{InMemoryFileIndex.listLeafFiles}} makes use of DistributedFileSystem's one single {{listLocatedStatus}} to namenode. This ticket intends to improve the case where ViewFileSystem is used to manage multiple DistributedFileSystems. It has also overridden the {{listLocatedStatus}} method by delegating to the filesystem it resolves to, e.g. DistributedFileSystem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master
[ https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051664#comment-17051664 ] Thomas Graves edited comment on SPARK-31043 at 3/4/20, 10:18 PM: - rebuilt and still see the error. The full exception in the master log is: java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:757) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at org.apache.xerces.parsers.AbstractDOMParser.startDocument(Unknown Source) at org.apache.xerces.xinclude.XIncludeHandler.startDocument(Unknown Source) at org.apache.xerces.impl.dtd.XMLDTDValidator.startDocument(Unknown Source) at org.apache.xerces.impl.XMLDocumentScannerImpl.startEntity(Unknown Source) at org.apache.xerces.impl.XMLVersionDetector.startDocumentParsing(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150) at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2482) at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2470) at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2541) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2494) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2407) at org.apache.hadoop.conf.Configuration.get(Configuration.java:981) at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1031) at org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1432) at org.apache.hadoop.security.SecurityUtil.(SecurityUtil.java:72) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:274) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:262) at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:807) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:777) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:650) at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2412) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2412) at org.apache.spark.SecurityManager.(SecurityManager.scala:79) at org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1137) at org.apache.spark.deploy.master.Master$.main(Master.scala:1122) at org.apache.spark.deploy.master.Master.main(Master.scala) Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ... 44 more was (Author: tgraves): rebuilt and still see the error. The full exception in the master log is: Exception in thread "main" java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:757) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at org.apache.xerces.parsers.AbstractDOMParser.startDocument(Unknown Source) at
[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master
[ https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051664#comment-17051664 ] Thomas Graves commented on SPARK-31043: --- rebuilt and still see the error. The full exception in the master log is: Exception in thread "main" java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:757) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) at org.apache.xerces.parsers.AbstractDOMParser.startDocument(Unknown Source) at org.apache.xerces.xinclude.XIncludeHandler.startDocument(Unknown Source) at org.apache.xerces.impl.dtd.XMLDTDValidator.startDocument(Unknown Source) at org.apache.xerces.impl.XMLDocumentScannerImpl.startEntity(Unknown Source) at org.apache.xerces.impl.XMLVersionDetector.startDocumentParsing(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150) at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2482) at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2470) at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2541) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2494) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2407) at org.apache.hadoop.conf.Configuration.get(Configuration.java:981) at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1031) at org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1432) at org.apache.hadoop.security.SecurityUtil.(SecurityUtil.java:72) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:274) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:262) at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:807) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:777) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:650) at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2412) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2412) at org.apache.spark.SecurityManager.(SecurityManager.scala:79) at org.apache.spark.deploy.history.HistoryServer$.createSecurityManager(HistoryServer.scala:327) at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:288) at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala) Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ... 44 more > Spark 3.0 built against hadoop2.7 can't start standalone master > --- > > Key: SPARK-31043 > URL: https://issues.apache.org/jira/browse/SPARK-31043 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Critical > > trying to start a standalone master when building spark branch 3.0 with > hadoop2.7 fails with: > > Exception in thread "main" java.lang.NoClassDefFoundError: > org/w3c/dom/ElementTraversal > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at > [java.net|http://java.net/] > .URLClassLoader.defineClass(URLClassLoader.java:468) > at > [java.net|http://java.net/] >
[jira] [Created] (SPARK-31046) Make more efficient and clean up AQE update UI code
Wei Xue created SPARK-31046: --- Summary: Make more efficient and clean up AQE update UI code Key: SPARK-31046 URL: https://issues.apache.org/jira/browse/SPARK-31046 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wei Xue -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master
[ https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051655#comment-17051655 ] Thomas Graves commented on SPARK-31043: --- A couple of my colleagues actually ran into this and reported it to me. I built and saw the same thing. I did a clean when building, but I'll run again just to verify. I was building with: build/mvn -Phadoop-2.7 -Pyarn -Pkinesis-asl -Pkubernetes -Pmesos -Phadoop-cloud -Pspark-ganglia-lgpl clean package -DskipTests 2>&1 | tee out I reverted the one xerces version change commit and rebuilt with command above and the error went away. One thing is that I don't have hadoop env variables set - not sure if you do or have them in path such that it might be picking up jars from there. Yeah I actually started looking for other things because it was complaining about xml-apis so thought the xerces change was weird that it caused but haven't investigated further > Spark 3.0 built against hadoop2.7 can't start standalone master > --- > > Key: SPARK-31043 > URL: https://issues.apache.org/jira/browse/SPARK-31043 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Critical > > trying to start a standalone master when building spark branch 3.0 with > hadoop2.7 fails with: > > Exception in thread "main" java.lang.NoClassDefFoundError: > org/w3c/dom/ElementTraversal > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at > [java.net|http://java.net/] > .URLClassLoader.defineClass(URLClassLoader.java:468) > at > [java.net|http://java.net/] > .URLClassLoader.access$100(URLClassLoader.java:74) > at > [java.net|http://java.net/] > .URLClassLoader$1.run(URLClassLoader.java:369) > ... > Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal > at > [java.net|http://java.net/] > .URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > ... 42 more -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31045) Add config for AQE logging level
Wei Xue created SPARK-31045: --- Summary: Add config for AQE logging level Key: SPARK-31045 URL: https://issues.apache.org/jira/browse/SPARK-31045 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wei Xue -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master
[ https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051643#comment-17051643 ] Sean R. Owen commented on SPARK-31043: -- Hm, I'm not seeing failures for the 3.0 branch or master after this change, for Hadoop 2.7. It sure does look suspiciously related as that class is XML-related. However that class is also not in Xerces, but in xml-apis. You don't by chance have old and new Xerces in your deployment somehow? Anyway this makes me nervous enough relative to the gain, that unless you have a reason to think it's a fluke, I think I'm going to revert it. > Spark 3.0 built against hadoop2.7 can't start standalone master > --- > > Key: SPARK-31043 > URL: https://issues.apache.org/jira/browse/SPARK-31043 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Critical > > trying to start a standalone master when building spark branch 3.0 with > hadoop2.7 fails with: > > Exception in thread "main" java.lang.NoClassDefFoundError: > org/w3c/dom/ElementTraversal > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at > [java.net|http://java.net/] > .URLClassLoader.defineClass(URLClassLoader.java:468) > at > [java.net|http://java.net/] > .URLClassLoader.access$100(URLClassLoader.java:74) > at > [java.net|http://java.net/] > .URLClassLoader$1.run(URLClassLoader.java:369) > ... > Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal > at > [java.net|http://java.net/] > .URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > ... 42 more -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host
[ https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051630#comment-17051630 ] Thomas Graves commented on SPARK-27651: --- It looks like this only works when using the external shuffle service, is that correct? The way I read the description implies it works from both "from an executor (or the external shuffle service)" so perhaps we should clarify. In both this Jira and the config descriptions. Also was there any technical reasons we didn't support it for executor to executor shuffle? > Avoid the network when block manager fetches shuffle blocks from the same host > -- > > Key: SPARK-27651 > URL: https://issues.apache.org/jira/browse/SPARK-27651 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.0.0 > > > When a shuffle block (content) is fetched the network is always used even > when it is fetched from an executor (or the external shuffle service) running > on the same host. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31044) Support foldable input by `schema_of_json`
Maxim Gekk created SPARK-31044: -- Summary: Support foldable input by `schema_of_json` Key: SPARK-31044 URL: https://issues.apache.org/jira/browse/SPARK-31044 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Currently, the `schema_of_json()` function allows only string literal as the input. The ticket aims to support any foldable string expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master
[ https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051557#comment-17051557 ] Sean R. Owen commented on SPARK-31043: -- Weird but I think I have to revert it . It isn't essential enough as an update. I messed up the change in a way that didn't get it tested by the pr builder properly > Spark 3.0 built against hadoop2.7 can't start standalone master > --- > > Key: SPARK-31043 > URL: https://issues.apache.org/jira/browse/SPARK-31043 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Critical > > trying to start a standalone master when building spark branch 3.0 with > hadoop2.7 fails with: > > Exception in thread "main" java.lang.NoClassDefFoundError: > org/w3c/dom/ElementTraversal > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at > [java.net|http://java.net/] > .URLClassLoader.defineClass(URLClassLoader.java:468) > at > [java.net|http://java.net/] > .URLClassLoader.access$100(URLClassLoader.java:74) > at > [java.net|http://java.net/] > .URLClassLoader$1.run(URLClassLoader.java:369) > ... > Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal > at > [java.net|http://java.net/] > .URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > ... 42 more -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master
[ https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051554#comment-17051554 ] Thomas Graves commented on SPARK-31043: --- I'm working on tracing down what broke this. [~srowen] looks like [SPARK-30994][CORE] Update xerces to 2.12.0 broke this. When I revert that it works again. > Spark 3.0 built against hadoop2.7 can't start standalone master > --- > > Key: SPARK-31043 > URL: https://issues.apache.org/jira/browse/SPARK-31043 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Critical > > trying to start a standalone master when building spark branch 3.0 with > hadoop2.7 fails with: > > Exception in thread "main" java.lang.NoClassDefFoundError: > org/w3c/dom/ElementTraversal > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at > [java.net|http://java.net/] > .URLClassLoader.defineClass(URLClassLoader.java:468) > at > [java.net|http://java.net/] > .URLClassLoader.access$100(URLClassLoader.java:74) > at > [java.net|http://java.net/] > .URLClassLoader$1.run(URLClassLoader.java:369) > ... > Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal > at > [java.net|http://java.net/] > .URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > ... 42 more -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30784) Hive 2.3 profile should still use orc-nohive
[ https://issues.apache.org/jira/browse/SPARK-30784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-30784. -- Resolution: Not A Bug Resolving it because with Hive 2.3, using regular orc is required. > Hive 2.3 profile should still use orc-nohive > > > Key: SPARK-30784 > URL: https://issues.apache.org/jira/browse/SPARK-30784 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yin Huai >Priority: Critical > > Originally reported at > [https://github.com/apache/spark/pull/26619#issuecomment-583802901] > > Right now, Hive 2.3 profile pulls in regular orc, which depends on > hive-storage-api. However, hive-storage-api and hive-common have the > following common class files > > org/apache/hadoop/hive/common/ValidReadTxnList.class > org/apache/hadoop/hive/common/ValidTxnList.class > org/apache/hadoop/hive/common/ValidTxnList$RangeResponse.class > For example, > [https://github.com/apache/hive/blob/rel/storage-release-2.6.0/storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java] > (pulled in by orc 1.5.8) and > [https://github.com/apache/hive/blob/rel/release-2.3.6/common/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java] > (from hive-common 2.3.6) both are in the classpath and they are different. > Having both versions in the classpath can cause unexpected behavior due to > classloading order. We should still use orc-nohive, which has > hive-storage-api shaded. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master
Thomas Graves created SPARK-31043: - Summary: Spark 3.0 built against hadoop2.7 can't start standalone master Key: SPARK-31043 URL: https://issues.apache.org/jira/browse/SPARK-31043 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.0.0 Reporter: Thomas Graves trying to start a standalone master when building spark branch 3.0 with hadoop2.7 fails with: Exception in thread "main" java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:757) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at [java.net|http://java.net/] .URLClassLoader.defineClass(URLClassLoader.java:468) at [java.net|http://java.net/] .URLClassLoader.access$100(URLClassLoader.java:74) at [java.net|http://java.net/] .URLClassLoader$1.run(URLClassLoader.java:369) ... Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal at [java.net|http://java.net/] .URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:419) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ... 42 more -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31027) Refactor `DataSourceStrategy.scala` to minimize the changes to support nested predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-31027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-31027: Fix Version/s: (was: 3.1.0) 3.0.0 > Refactor `DataSourceStrategy.scala` to minimize the changes to support nested > predicate pushdown > > > Key: SPARK-31027 > URL: https://issues.apache.org/jira/browse/SPARK-31027 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.5 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31029) Occasional class not found error in user's Future code using global ExecutionContext
[ https://issues.apache.org/jira/browse/SPARK-31029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shanyu zhao updated SPARK-31029: Description: *Problem:* When running tpc-ds test (https://github.com/databricks/spark-sql-perf), occasionally we see error related to class not found: 2020-02-04 20:00:26,673 ERROR yarn.ApplicationMaster: User class threw exception: scala.ScalaReflectionException: class com.databricks.spark.sql.perf.ExperimentRun in JavaMirror with sun.misc.Launcher$AppClassLoader@28ba21f3 of type class sun.misc.Launcher$AppClassLoader with classpath [...] and parent being sun.misc.Launcher$ExtClassLoader@3ff5d147 of type class sun.misc.Launcher$ExtClassLoader with classpath [...] and parent being primordial classloader with boot classpath [...] not found. *Root cause:* Spark driver starts ApplicationMaster in the main thread, which starts a user thread and set MutableURLClassLoader to that thread's ContextClassLoader. userClassThread = startUserApplication() The main thread then setup YarnSchedulerBackend RPC endpoints, which handles these calls using scala Future with the default global ExecutionContext: - doRequestTotalExecutors - doKillExecutors If main thread starts a future to handle doKillExecutors() before user thread does then the default thread pool thread's ContextClassLoader would be the default (AppClassLoader). If user thread starts a future first then the thread pool thread will have MutableURLClassLoader. So if user's code uses a future which references a user provided class (only MutableURLClassLoader can load), and before the future if there are executor lost, you will see errors related to class not found. *Proposed Solution:* We can potentially solve this problem in one of two ways: 1) Set the same class loader (userClassLoader) to both the main thread and user thread in ApplicationMaster.scala 2) Do not use "ExecutionContext.Implicits.global" in YarnSchedulerBackend was: *Problem:* When running tpc-ds test (https://github.com/databricks/spark-sql-perf), occasionally we see error related to class not found: 2020-02-04 20:00:26,673 ERROR yarn.ApplicationMaster: User class threw exception: scala.ScalaReflectionException: class com.databricks.spark.sql.perf.ExperimentRun in JavaMirror with sun.misc.Launcher$AppClassLoader@28ba21f3 of type class sun.misc.Launcher$AppClassLoader with classpath [...] and parent being sun.misc.Launcher$ExtClassLoader@3ff5d147 of type class sun.misc.Launcher$ExtClassLoader with classpath [...] and parent being primordial classloader with boot classpath [...] not found. *Root cause:* Spark driver starts ApplicationMaster in the main thread, which starts a user thread and set MutableURLClassLoader to that thread's ContextClassLoader. userClassThread = startUserApplication() The main thread then setup YarnSchedulerBackend RPC endpoints, which handles these calls using scala Future with the default global ExecutionContext: - doRequestTotalExecutors - doKillExecutors If main thread starts a future to handle doKillExecutors() before user thread does then the default thread pool thread's ContextClassLoader would be the default (AppClassLoader). If user thread starts a future first then the thread pool thread will have MutableURLClassLoader. So if user's code uses a future which references a user provided class (only MutableURLClassLoader can load), and before the future if there are executor lost, you will see errors related to class not found. *Proposed Solution:* Set the same class loader (userClassLoader) to both the main thread and user thread in ApplicationMaster.scala > Occasional class not found error in user's Future code using global > ExecutionContext > > > Key: SPARK-31029 > URL: https://issues.apache.org/jira/browse/SPARK-31029 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.4.5 >Reporter: shanyu zhao >Priority: Major > > *Problem:* > When running tpc-ds test (https://github.com/databricks/spark-sql-perf), > occasionally we see error related to class not found: > 2020-02-04 20:00:26,673 ERROR yarn.ApplicationMaster: User class threw > exception: scala.ScalaReflectionException: class > com.databricks.spark.sql.perf.ExperimentRun in JavaMirror with > sun.misc.Launcher$AppClassLoader@28ba21f3 of type class > sun.misc.Launcher$AppClassLoader with classpath [...] > and parent being sun.misc.Launcher$ExtClassLoader@3ff5d147 of type class > sun.misc.Launcher$ExtClassLoader with classpath [...] > and parent being primordial classloader with boot classpath [...] not found. > *Root cause:* > Spark driver starts ApplicationMaster in the main thread, which starts a user > thread and set
[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count
[ https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051524#comment-17051524 ] Suchintak Patnaik commented on SPARK-29058: --- [~hyukjin.kwon] Any update on this issue? > Reading csv file with DROPMALFORMED showing incorrect record count > -- > > Key: SPARK-29058 > URL: https://issues.apache.org/jira/browse/SPARK-29058 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Minor > > The spark sql csv reader is dropping malformed records as expected, but the > record count is showing as incorrect. > Consider this file (fruit.csv) > {code} > apple,red,1,3 > banana,yellow,2,4.56 > orange,orange,3,5 > {code} > Defining schema as follows: > {code} > schema = "Fruit string,color string,price int,quantity int" > {code} > Notice that the "quantity" field is defined as integer type, but the 2nd row > in the file contains a floating point value, hence it is a corrupt record. > {code} > >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema) > >>> df.show() > +--+--+-++ > | Fruit| color|price|quantity| > +--+--+-++ > | apple| red|1| 3| > |orange|orange|3| 5| > +--+--+-++ > >>> df.count() > 3 > {code} > Malformed record is getting dropped as expected, but incorrect record count > is getting displayed. > Here the df.count() should give value as 2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31042) Error in writing a pyspark streaming dataframe created from Kafka source to a csv file
Suchintak Patnaik created SPARK-31042: - Summary: Error in writing a pyspark streaming dataframe created from Kafka source to a csv file Key: SPARK-31042 URL: https://issues.apache.org/jira/browse/SPARK-31042 Project: Spark Issue Type: Bug Components: PySpark, Structured Streaming Affects Versions: 2.4.5 Reporter: Suchintak Patnaik While writing a streaming dataframe created from Kafka source to a csv file gives following error in PySpark. NOTE : The same streaming dataframe is getting displayed in the console. sdf.writeStream.format("console").start().awaitTermination() // Working sdf.writeStream\ .format("csv")\ .option("path", "C://output")\ .option("checkpointLocation", "C://Checkpoint")\ .outputMode("append")\ .start().awaitTermination()// Not working Error - *File "C:\Spark\python\pyspark\sql\utils.py", line 63, in deco return f(*a, **kw) File "C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o63.awaitTermination. : org.apache.spark.sql.streaming.StreamingQueryException: Expected e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got {"logOffset":1} === Streaming Query === Identifier: [id = 6718625c-489e-44c8-b273-0da3429e97a8, runId = b64887ba-ca32-499e-9ab5-f839fd44ec26] Current Committed Offsets: {KafkaV2[Subscribe[test1]]: {"logOffset":1}} Current Available Offsets: {KafkaV2[Subscribe[test1]]: {"logOffset":1}} Current State: ACTIVE Thread State: RUNNABLE* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31041) Make arguments to make-distribution.sh position-independent
[ https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31041: - Description: This works: {code:java} ./dev/make-distribution.sh \ --pip \ -Phadoop-2.7 -Phive -Phadoop-cloud {code} But this doesn't: {code:java} ./dev/make-distribution.sh \ -Phadoop-2.7 -Phive -Phadoop-cloud \ --pip{code} The latter invocation yields the following, confusing output: {code:java} + VERSION=' -X,--debug Produce execution debug output'{code} was: This works: ``` ./dev/make-distribution.sh \ --pip \ -Phadoop-2.7 -Phive -Phadoop-cloud ``` But this doesn't: ``` ./dev/make-distribution.sh \ -Phadoop-2.7 -Phive -Phadoop-cloud \ --pip ``` > Make arguments to make-distribution.sh position-independent > --- > > Key: SPARK-31041 > URL: https://issues.apache.org/jira/browse/SPARK-31041 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Trivial > > This works: > {code:java} > ./dev/make-distribution.sh \ > --pip \ > -Phadoop-2.7 -Phive -Phadoop-cloud {code} > > But this doesn't: > {code:java} > ./dev/make-distribution.sh \ > -Phadoop-2.7 -Phive -Phadoop-cloud \ > --pip{code} > > The latter invocation yields the following, confusing output: > {code:java} > + VERSION=' -X,--debug Produce execution debug output'{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31041) Make arguments to make-distribution.sh position-independent
[ https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31041: - Description: This works: ``` ./dev/make-distribution.sh \ --pip \ -Phadoop-2.7 -Phive -Phadoop-cloud ``` But this doesn't: ``` ./dev/make-distribution.sh \ -Phadoop-2.7 -Phive -Phadoop-cloud \ --pip ``` was: This works: ``` ./dev/make-distribution.sh \ --pip \ -Phadoop-2.7 -Phive -Phadoop-cloud ``` But this doesn't: ``` ./dev/make-distribution.sh \ -Phadoop-2.7 -Phive -Phadoop-cloud \ --pip ``` > Make arguments to make-distribution.sh position-independent > --- > > Key: SPARK-31041 > URL: https://issues.apache.org/jira/browse/SPARK-31041 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Trivial > > This works: > ``` > ./dev/make-distribution.sh \ > --pip \ > -Phadoop-2.7 -Phive -Phadoop-cloud > ``` > > But this doesn't: > ``` > ./dev/make-distribution.sh \ > -Phadoop-2.7 -Phive -Phadoop-cloud \ > --pip > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31041) Make arguments to make-distribution.sh position-independent
[ https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31041: - Description: This works: ``` ./dev/make-distribution.sh \ --pip \ -Phadoop-2.7 -Phive -Phadoop-cloud ``` But this doesn't: ``` ./dev/make-distribution.sh \ -Phadoop-2.7 -Phive -Phadoop-cloud \ --pip ``` was: This works: ``` ./dev/make-distribution.sh \ --pip \ -Phadoop-2.7 -Phive -Phadoop-cloud ``` But this doesn't: ``` ./dev/make-distribution.sh \ -Phadoop-2.7 -Phive -Phadoop-cloud \ --pip``` > Make arguments to make-distribution.sh position-independent > --- > > Key: SPARK-31041 > URL: https://issues.apache.org/jira/browse/SPARK-31041 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Trivial > > This works: > > ``` > ./dev/make-distribution.sh \ > --pip \ > -Phadoop-2.7 -Phive -Phadoop-cloud > ``` > > But this doesn't: > > ``` > ./dev/make-distribution.sh \ > -Phadoop-2.7 -Phive -Phadoop-cloud \ > --pip > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31041) Make arguments to make-distribution.sh position-independent
[ https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31041: - Summary: Make arguments to make-distribution.sh position-independent (was: Make argument to make-distribution position-independent) > Make arguments to make-distribution.sh position-independent > --- > > Key: SPARK-31041 > URL: https://issues.apache.org/jira/browse/SPARK-31041 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Trivial > > This works: > > ``` > ./dev/make-distribution.sh \ > --pip \ > -Phadoop-2.7 -Phive -Phadoop-cloud > ``` > > But this doesn't: > > ``` > ./dev/make-distribution.sh \ > -Phadoop-2.7 -Phive -Phadoop-cloud \ > --pip``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31041) Make argument to make-distribution position-independent
Nicholas Chammas created SPARK-31041: Summary: Make argument to make-distribution position-independent Key: SPARK-31041 URL: https://issues.apache.org/jira/browse/SPARK-31041 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.1.0 Reporter: Nicholas Chammas This works: ``` ./dev/make-distribution.sh \ --pip \ -Phadoop-2.7 -Phive -Phadoop-cloud ``` But this doesn't: ``` ./dev/make-distribution.sh \ -Phadoop-2.7 -Phive -Phadoop-cloud \ --pip``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31009) Support json_object_keys function
[ https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31009: --- Affects Version/s: (was: 3.0.0) 3.1.0 > Support json_object_keys function > - > > Key: SPARK-31009 > URL: https://issues.apache.org/jira/browse/SPARK-31009 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > This function will return all the keys from outer json object. > > PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html] > Mysql -> > [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html] > MariaDB -> [https://mariadb.com/kb/en/json-functions/] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31008) Support json_array_length function
[ https://issues.apache.org/jira/browse/SPARK-31008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31008: --- Affects Version/s: (was: 3.0.0) 3.1.0 > Support json_array_length function > -- > > Key: SPARK-31008 > URL: https://issues.apache.org/jira/browse/SPARK-31008 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Priority: Major > > At the moment we don't support json_array_length function in spark. > This function is supported by > a.) PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html] > b.) Presto -> [https://prestodb.io/docs/current/functions/json.html] > c.) redshift -> > [https://docs.aws.amazon.com/redshift/latest/dg/JSON_ARRAY_LENGTH.html] > > This allows naive users to directly get array length with a well defined json > function. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31035) Show assigned resource information for local mode
[ https://issues.apache.org/jira/browse/SPARK-31035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-31035. Resolution: Invalid Resource aware scheduling doesn't support local mode for now so I'll close this ticket. > Show assigned resource information for local mode > - > > Key: SPARK-31035 > URL: https://issues.apache.org/jira/browse/SPARK-31035 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > ExecutorsPage shows resource information like GPUs and FPGAs for each > Executor. > But for local mode, resource information is not shown. > It's useful during application development if we can confirm the information > from WebUI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31040) Offsets are only logged for partitions which had data this causes next batch to read the partitions that were not included from the beginning when using kafka
[ https://issues.apache.org/jira/browse/SPARK-31040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Gilmore updated SPARK-31040: Description: Each batch should either log all offsets for each partition or should scan back across offset logs. [https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala] offset log 23615 {code:java} {"myTopic.myTopic.orders":{"2":27531503,"5":27562423,"4":27528794,"1":27514991,"3":27528899,"0":27504949}}% {code} offset log 23616 {code:java} {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}% {code} {code:java} /0/03/04 13:49:05 INFO MicroBatchExecution: Resuming at batch 26317 with committed offsets {KafkaV2[Subscribe[myTopic.myTopic.orders]]: {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}} and available offsets {KafkaV2[Subscribe[myTopic.myTopic.orders]]: {"myTopic.myTopic.orders":{"2":27531625,"5":27562568,"4":27528990,"1":27515131,"3":27529075,"0":27505141}}}commit log: {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%0/03/04 13:50:24 INFO KafkaMicroBatchReader: Partitions added: Map(myTopic.myTopic.orders-3 -> 26533520, myTopic.myTopic.orders-2 -> 26533730, myTopic.myTopic.orders-4 -> 26533608, myTopic.myTopic.orders-5 -> 26533486) 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-3 starts from 26533520 instead of 0. Some data may have been missed. 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-2 starts from 26533730 instead of 0. Some data may have been missed. 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-4 starts from 26533608 instead of 0. Some data may have been missed. 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-5 starts from 26533486 instead of 0. Some data may have been missed. {code} was: Each batch should either log all offsets for each partition or should scan back across commit logs. [https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala] offset log 23615 {code:java} {"myTopic.myTopic.orders":{"2":27531503,"5":27562423,"4":27528794,"1":27514991,"3":27528899,"0":27504949}}% {code} offset log 23616 {code:java} {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}% {code} {code:java} /0/03/04 13:49:05 INFO MicroBatchExecution: Resuming at batch 26317 with committed offsets {KafkaV2[Subscribe[myTopic.myTopic.orders]]: {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}} and available offsets {KafkaV2[Subscribe[myTopic.myTopic.orders]]: {"myTopic.myTopic.orders":{"2":27531625,"5":27562568,"4":27528990,"1":27515131,"3":27529075,"0":27505141}}}commit log: {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%0/03/04 13:50:24 INFO KafkaMicroBatchReader: Partitions added: Map(myTopic.myTopic.orders-3 -> 26533520, myTopic.myTopic.orders-2 -> 26533730, myTopic.myTopic.orders-4 -> 26533608, myTopic.myTopic.orders-5 -> 26533486) 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-3 starts from 26533520 instead of 0. Some data may have been missed. 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-2 starts from 26533730 instead of 0. Some data may have been missed. 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-4 starts from 26533608 instead of 0. Some data may have been missed. 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-5 starts from 26533486 instead of 0. Some data may have been missed. {code} > Offsets are only logged for partitions which had data this causes next batch > to read the partitions that were not included from the beginning when using > kafka > -- > > Key: SPARK-31040 > URL: https://issues.apache.org/jira/browse/SPARK-31040 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0, 2.4.4, 2.4.5 >Reporter: Richard Gilmore >Priority: Major > > Each batch should either log all offsets for each partition or should scan > back across offset logs. > [https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala] > offset log 23615 > > {code:java} > {"myTopic.myTopic.orders":{"2":27531503,"5":27562423,"4":27528794,"1":27514991,"3":27528899,"0":27504949}}% > {code} > > > offset log 23616 > > {code:java} > {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}% > {code}
[jira] [Updated] (SPARK-31040) Offsets are only logged for partitions which had data this causes next batch to read the partitions that were not included from the beginning when using kafka
[ https://issues.apache.org/jira/browse/SPARK-31040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Gilmore updated SPARK-31040: Description: Each batch should either log all offsets for each partition or should scan back across commit logs. [https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala] offset log 23615 {code:java} {"myTopic.myTopic.orders":{"2":27531503,"5":27562423,"4":27528794,"1":27514991,"3":27528899,"0":27504949}}% {code} offset log 23616 {code:java} {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}% {code} {code:java} /0/03/04 13:49:05 INFO MicroBatchExecution: Resuming at batch 26317 with committed offsets {KafkaV2[Subscribe[myTopic.myTopic.orders]]: {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}} and available offsets {KafkaV2[Subscribe[myTopic.myTopic.orders]]: {"myTopic.myTopic.orders":{"2":27531625,"5":27562568,"4":27528990,"1":27515131,"3":27529075,"0":27505141}}}commit log: {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%0/03/04 13:50:24 INFO KafkaMicroBatchReader: Partitions added: Map(myTopic.myTopic.orders-3 -> 26533520, myTopic.myTopic.orders-2 -> 26533730, myTopic.myTopic.orders-4 -> 26533608, myTopic.myTopic.orders-5 -> 26533486) 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-3 starts from 26533520 instead of 0. Some data may have been missed. 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-2 starts from 26533730 instead of 0. Some data may have been missed. 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-4 starts from 26533608 instead of 0. Some data may have been missed. 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-5 starts from 26533486 instead of 0. Some data may have been missed. {code} was: Each batch should either log all offsets for each partition or should scan back across commit logs. [https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala] offset log 23615 {code:java} {"myTopic.myTopic.orders":{"2":27531503,"5":27562423,"4":27528794,"1":27514991,"3":27528899,"0":27504949}}% {code} offset log 23616 {code:java} {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%Topic {code} {code:java} /0/03/04 13:49:05 INFO MicroBatchExecution: Resuming at batch 26317 with committed offsets {KafkaV2[Subscribe[myTopic.myTopic.orders]]: {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}} and available offsets {KafkaV2[Subscribe[myTopic.myTopic.orders]]: {"myTopic.myTopic.orders":{"2":27531625,"5":27562568,"4":27528990,"1":27515131,"3":27529075,"0":27505141}}}commit log: {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%0/03/04 13:50:24 INFO KafkaMicroBatchReader: Partitions added: Map(myTopic.myTopic.orders-3 -> 26533520, myTopic.myTopic.orders-2 -> 26533730, myTopic.myTopic.orders-4 -> 26533608, myTopic.myTopic.orders-5 -> 26533486) 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-3 starts from 26533520 instead of 0. Some data may have been missed. 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-2 starts from 26533730 instead of 0. Some data may have been missed. 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-4 starts from 26533608 instead of 0. Some data may have been missed. 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-5 starts from 26533486 instead of 0. Some data may have been missed. {code} > Offsets are only logged for partitions which had data this causes next batch > to read the partitions that were not included from the beginning when using > kafka > -- > > Key: SPARK-31040 > URL: https://issues.apache.org/jira/browse/SPARK-31040 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0, 2.4.4, 2.4.5 >Reporter: Richard Gilmore >Priority: Major > > Each batch should either log all offsets for each partition or should scan > back across commit logs. > [https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala] > offset log 23615 > > {code:java} > {"myTopic.myTopic.orders":{"2":27531503,"5":27562423,"4":27528794,"1":27514991,"3":27528899,"0":27504949}}% > {code} > > > offset log 23616 > > {code:java} > {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}% >
[jira] [Created] (SPARK-31040) Offsets are only logged for partitions which had data this causes next batch to read the partitions that were not included from the beginning when using kafka
Richard Gilmore created SPARK-31040: --- Summary: Offsets are only logged for partitions which had data this causes next batch to read the partitions that were not included from the beginning when using kafka Key: SPARK-31040 URL: https://issues.apache.org/jira/browse/SPARK-31040 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.5, 2.4.4, 2.4.0 Reporter: Richard Gilmore Each batch should either log all offsets for each partition or should scan back across commit logs. [https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala] offset log 23615 {code:java} {"myTopic.myTopic.orders":{"2":27531503,"5":27562423,"4":27528794,"1":27514991,"3":27528899,"0":27504949}}% {code} offset log 23616 {code:java} {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%Topic {code} {code:java} /0/03/04 13:49:05 INFO MicroBatchExecution: Resuming at batch 26317 with committed offsets {KafkaV2[Subscribe[myTopic.myTopic.orders]]: {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}} and available offsets {KafkaV2[Subscribe[myTopic.myTopic.orders]]: {"myTopic.myTopic.orders":{"2":27531625,"5":27562568,"4":27528990,"1":27515131,"3":27529075,"0":27505141}}}commit log: {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%0/03/04 13:50:24 INFO KafkaMicroBatchReader: Partitions added: Map(myTopic.myTopic.orders-3 -> 26533520, myTopic.myTopic.orders-2 -> 26533730, myTopic.myTopic.orders-4 -> 26533608, myTopic.myTopic.orders-5 -> 26533486) 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-3 starts from 26533520 instead of 0. Some data may have been missed. 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-2 starts from 26533730 instead of 0. Some data may have been missed. 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-4 starts from 26533608 instead of 0. Some data may have been missed. 20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition myTopic.myTopic.orders-5 starts from 26533486 instead of 0. Some data may have been missed. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31039) Unable to use vendor specific datatypes with JDBC
[ https://issues.apache.org/jira/browse/SPARK-31039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051327#comment-17051327 ] Frank Oosterhuis commented on SPARK-31039: -- As a workaround I have created the table manually and am using the option "truncate" with saveMode "overwrite". You can then just insert "13:17:00" strings :) > Unable to use vendor specific datatypes with JDBC > - > > Key: SPARK-31039 > URL: https://issues.apache.org/jira/browse/SPARK-31039 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Frank Oosterhuis >Priority: Major > > I'm trying to create a table in MSSQL with a time(7) type. > For this I'm using the createTableColumnTypes option like "CallStartTime > time(7)", with driver > "{color:#212121}com.microsoft.sqlserver.jdbc.SQLServerDriver"{color} > I'm getting an error: > {color:#212121}org.apache.spark.sql.catalyst.parser.ParseException: DataType > time(7) is not supported.(line 1, pos 43){color} > {color:#212121}What is then the point of using this option?{color} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31039) Unable to use vendor specific datatypes with JDBC
Frank Oosterhuis created SPARK-31039: Summary: Unable to use vendor specific datatypes with JDBC Key: SPARK-31039 URL: https://issues.apache.org/jira/browse/SPARK-31039 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5 Reporter: Frank Oosterhuis I'm trying to create a table in MSSQL with a time(7) type. For this I'm using the createTableColumnTypes option like "CallStartTime time(7)", with driver "{color:#212121}com.microsoft.sqlserver.jdbc.SQLServerDriver"{color} I'm getting an error: {color:#212121}org.apache.spark.sql.catalyst.parser.ParseException: DataType time(7) is not supported.(line 1, pos 43){color} {color:#212121}What is then the point of using this option?{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31038) Add checkValue for spark.sql.session.timeZone
Kent Yao created SPARK-31038: Summary: Add checkValue for spark.sql.session.timeZone Key: SPARK-31038 URL: https://issues.apache.org/jira/browse/SPARK-31038 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Kent Yao The `spark.sql.session.timeZone` config can accept any string value including invalid time zone ids, then it will fail other queries that rely on the time zone. We should do the value checking in the set phase and fail fast if the zone value is invalid. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31037) refine AQE config names
Wenchen Fan created SPARK-31037: --- Summary: refine AQE config names Key: SPARK-31037 URL: https://issues.apache.org/jira/browse/SPARK-31037 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31009) Support json_object_keys function
[ https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Raushan updated SPARK-31009: --- Description: This function will return all the keys from outer json object. PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html] Mysql -> [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html] MariaDB -> [https://mariadb.com/kb/en/json-functions/] was: This function will return all the keys from outer json object. PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html] Mysql -> [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html] > Support json_object_keys function > - > > Key: SPARK-31009 > URL: https://issues.apache.org/jira/browse/SPARK-31009 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Priority: Major > > This function will return all the keys from outer json object. > > PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html] > Mysql -> > [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html] > MariaDB -> [https://mariadb.com/kb/en/json-functions/] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31027) Refactor `DataSourceStrategy.scala` to minimize the changes to support nested predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-31027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31027. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27778 [https://github.com/apache/spark/pull/27778] > Refactor `DataSourceStrategy.scala` to minimize the changes to support nested > predicate pushdown > > > Key: SPARK-31027 > URL: https://issues.apache.org/jira/browse/SPARK-31027 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.5 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31027) Refactor `DataSourceStrategy.scala` to minimize the changes to support nested predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-31027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31027: Assignee: DB Tsai > Refactor `DataSourceStrategy.scala` to minimize the changes to support nested > predicate pushdown > > > Key: SPARK-31027 > URL: https://issues.apache.org/jira/browse/SPARK-31027 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.5 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30992) Arrange scattered config of streaming module
[ https://issues.apache.org/jira/browse/SPARK-30992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-30992: --- Summary: Arrange scattered config of streaming module (was: Arrange scattered config for streaming) > Arrange scattered config of streaming module > > > Key: SPARK-30992 > URL: https://issues.apache.org/jira/browse/SPARK-30992 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > I found a lot scattered config in Streaming module. > I think should arrange these config in unified position. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30992) Arrange scattered config for streaming
[ https://issues.apache.org/jira/browse/SPARK-30992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi updated SPARK-30992: -- Component/s: (was: Structured Streaming) DStreams > Arrange scattered config for streaming > --- > > Key: SPARK-30992 > URL: https://issues.apache.org/jira/browse/SPARK-30992 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > I found a lot scattered config in Streaming module. > I think should arrange these config in unified position. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31006) Mark Spark streaming as deprecated and add warnings.
[ https://issues.apache.org/jira/browse/SPARK-31006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051179#comment-17051179 ] Gabor Somogyi commented on SPARK-31006: --- I see many users are using DStreams at the moment. Not sure that dropping support is a good message. In general I'm always suggesting Structured Streaming to use but I think it would be good to wait users to migrate... > Mark Spark streaming as deprecated and add warnings. > > > Key: SPARK-31006 > URL: https://issues.apache.org/jira/browse/SPARK-31006 > Project: Spark > Issue Type: Bug > Components: Documentation, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Prashant Sharma >Priority: Major > > It is noticed that some of the users of Spark streaming do not immediately > realise that it is a deprecated component and it would be scary, if they end > up with it in production. Now that we are in a position to release about > Spark 3.0.0, may be we should discuss - should the spark streaming carry an > explicit notice? That it is not under active development. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31017) Test for shuffle requests packaging with different size and numBlocks limit
[ https://issues.apache.org/jira/browse/SPARK-31017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31017. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27767 [https://github.com/apache/spark/pull/27767] > Test for shuffle requests packaging with different size and numBlocks limit > --- > > Key: SPARK-31017 > URL: https://issues.apache.org/jira/browse/SPARK-31017 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > When packaging shuffle fetch requests in ShuffleBlockFetcherIterator, there > are two limitations: maxBytesInFlight and maxBlocksInFlightPerAddress. > However, we don’t have test cases to test them both, e.g. the size limitation > is hit before the numBlocks limitation. > We should add test cases in ShuffleBlockFetcherIteratorSuite to test: > # the size limitation is hit before the numBlocks limitation > # the numBlocks limitation is hit before the size limitation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31017) Test for shuffle requests packaging with different size and numBlocks limit
[ https://issues.apache.org/jira/browse/SPARK-31017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31017: --- Assignee: wuyi > Test for shuffle requests packaging with different size and numBlocks limit > --- > > Key: SPARK-31017 > URL: https://issues.apache.org/jira/browse/SPARK-31017 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > When packaging shuffle fetch requests in ShuffleBlockFetcherIterator, there > are two limitations: maxBytesInFlight and maxBlocksInFlightPerAddress. > However, we don’t have test cases to test them both, e.g. the size limitation > is hit before the numBlocks limitation. > We should add test cases in ShuffleBlockFetcherIteratorSuite to test: > # the size limitation is hit before the numBlocks limitation > # the numBlocks limitation is hit before the size limitation -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051094#comment-17051094 ] Wenchen Fan commented on SPARK-30951: - [~bersprockets] You are making a good point here. It's very hard to roll out the calendar switching smoothly, but we at least should give users a way to read their legacy data. The hive approach looks good to me. [~maxgekk] can we implement something like that? > Potential data loss for legacy applications after switch to proleptic > Gregorian calendar > > > Key: SPARK-30951 > URL: https://issues.apache.org/jira/browse/SPARK-30951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Priority: Major > > tl;dr: We recently discovered some Spark 2.x sites that have lots of data > containing dates before October 15, 1582. This could be an issue when such > sites try to upgrade to Spark 3.0. > From SPARK-26651: > {quote}"The changes might impact on the results for dates and timestamps > before October 15, 1582 (Gregorian) > {quote} > We recently discovered that some large scale Spark 2.x applications rely on > dates before October 15, 1582. > Two cases came up recently: > * An application that uses a commercial third-party library to encode > sensitive dates. On insert, the library encodes the actual date as some other > date. On select, the library decodes the date back to the original date. The > encoded value could be any date, including one before October 15, 1582 (e.g., > "0602-04-04"). > * An application that uses a specific unlikely date (e.g., "1200-01-01") as > a marker to indicate "unknown date" (in lieu of null) > Both sites ran into problems after another component in their system was > upgraded to use the proleptic Gregorian calendar. Spark applications that > read files created by the upgraded component were interpreting encoded or > marker dates incorrectly, and vice versa. Also, their data now had a mix of > calendars (hybrid and proleptic Gregorian) with no metadata to indicate which > file used which calendar. > Both sites had enormous amounts of existing data, so re-encoding the dates > using some other scheme was not a feasible solution. > This is relevant to Spark 3: > Any Spark 2 application that uses such date-encoding schemes may run into > trouble when run on Spark 3. The application may not properly interpret the > dates previously written by Spark 2. Also, once the Spark 3 version of the > application writes data, the tables will have a mix of calendars (hybrid and > proleptic gregorian) with no metadata to indicate which file uses which > calendar. > Similarly, sites might run with mixed Spark versions, resulting in data > written by one version that cannot be interpreted by the other. And as above, > the tables will now have a mix of calendars with no way to detect which file > uses which calendar. > As with the two real-life example cases, these applications may have enormous > amounts of legacy data, so re-encoding the dates using some other scheme may > not be feasible. > We might want to consider a configuration setting to allow the user to > specify the calendar for storing and retrieving date and timestamp values > (not sure how such a flag would affect other date and timestamp-related > functions). I realize the change is far bigger than just adding a > configuration setting. > Here's a quick example of where trouble may happen, using the real-life case > of the marker date. > In Spark 2.4: > {noformat} > scala> spark.read.orc(s"$home/data/datefile").filter("dt == > '1200-01-01'").count > res0: Long = 1 > scala> > {noformat} > In Spark 3.0 (reading from the same legacy file): > {noformat} > scala> spark.read.orc(s"$home/data/datefile").filter("dt == > '1200-01-01'").count > res0: Long = 0 > scala> > {noformat} > By the way, Hive had a similar problem. Hive switched from hybrid calendar to > proleptic Gregorian calendar between 2.x and 3.x. After some upgrade > headaches related to dates before 1582, the Hive community made the following > changes: > * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive > checks a configuration setting to determine which calendar to use. > * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive > stores the calendar type in the metadata. > * When reading date or timestamp data from ORC, Parquet, and Avro files, > Hive checks the metadata for the calendar type. > * When reading date or timestamp data from ORC, Parquet, and Avro files that > lack calendar metadata, Hive's behavior is determined by a configuration > setting. This allows Hive to read legacy data (note: if the data already >
[jira] [Commented] (SPARK-30563) Regressions in Join benchmarks
[ https://issues.apache.org/jira/browse/SPARK-30563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051076#comment-17051076 ] Maxim Gekk commented on SPARK-30563: [~petertoth] If you think it is possible to avoid some overhead of NoOp datasource, please, open a PR. > Regressions in Join benchmarks > -- > > Key: SPARK-30563 > URL: https://issues.apache.org/jira/browse/SPARK-30563 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > Regenerated benchmark results in the > https://github.com/apache/spark/pull/27078 shows many regressions in > JoinBenchmark. The benchmarked queries slowed down by up to 3 times, see > old results: > https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dL10 > new results: > https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dR10 > One of the difference in queries is using the `NoOp` datasource in new > queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30563) Regressions in Join benchmarks
[ https://issues.apache.org/jira/browse/SPARK-30563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051073#comment-17051073 ] Maxim Gekk commented on SPARK-30563: > we spend a lot of time in this loop even The loop just forces materialization of joined rows. By df.groupBy().count(), you skip some steps in join, it seems. I think in most cases, users need results of join but not just count on top of it. > Regressions in Join benchmarks > -- > > Key: SPARK-30563 > URL: https://issues.apache.org/jira/browse/SPARK-30563 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > Regenerated benchmark results in the > https://github.com/apache/spark/pull/27078 shows many regressions in > JoinBenchmark. The benchmarked queries slowed down by up to 3 times, see > old results: > https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dL10 > new results: > https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dR10 > One of the difference in queries is using the `NoOp` datasource in new > queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30960) add back the legacy date/timestamp format support in CSV/JSON parser
[ https://issues.apache.org/jira/browse/SPARK-30960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30960. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27710 [https://github.com/apache/spark/pull/27710] > add back the legacy date/timestamp format support in CSV/JSON parser > > > Key: SPARK-30960 > URL: https://issues.apache.org/jira/browse/SPARK-30960 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31036) Use stringArgs in Expression.toString to respect hidden parameters
Hyukjin Kwon created SPARK-31036: Summary: Use stringArgs in Expression.toString to respect hidden parameters Key: SPARK-31036 URL: https://issues.apache.org/jira/browse/SPARK-31036 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon Currently, the top of https://github.com/apache/spark/pull/27657, {code} val identify = udf((input: Seq[Int]) => input) spark.range(10).select(identify(array("id"))).show() {code} shows hidden parameter `useStringTypeWhenEmpty`. {code} +-+ |UDF(array(id, false))| +-+ | [0]| | [1]| ... {code} This is a general problem and we should respect hidden parameters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30563) Regressions in Join benchmarks
[ https://issues.apache.org/jira/browse/SPARK-30563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051014#comment-17051014 ] Peter Toth commented on SPARK-30563: [~maxgekk], [~dongjoon], [~hyukjin.kwon] it looks like the change in the {{JoinBenchmark}} ([https://github.com/apache/spark/commit/f5118f81e395bde0cd8253dbef6a9e6455c3958a#diff-da1033f4d10b6046046202dd8f85e3f7L49-R49]) causes this regression. If we used {{df.groupBy().count().noop()}} and measure the same as previously there won't be any regression in this suite. Please see the results running the fixed benchmark on my machine: [https://github.com/peter-toth/spark/commit/207d15d1801cfcf9c40635a481d4aa7192911548] This is because lots of rows are returned and we spend a lot of time in [this loop|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L438-L442] even if {{NoopWriter}} does nothing. Another very minor improvement regarding `NoOp` datasource could be to turn off using commit coordinator in {{NoopBatchWrite}}. Shall I open a PR with these changes (excluding the non-official benchmark result)? > Regressions in Join benchmarks > -- > > Key: SPARK-30563 > URL: https://issues.apache.org/jira/browse/SPARK-30563 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > Regenerated benchmark results in the > https://github.com/apache/spark/pull/27078 shows many regressions in > JoinBenchmark. The benchmarked queries slowed down by up to 3 times, see > old results: > https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dL10 > new results: > https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dR10 > One of the difference in queries is using the `NoOp` datasource in new > queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org