[jira] [Created] (SPARK-34326) "SPARK-31793: FileSourceScanExec metadata should contain limited file paths" fails in some edge-case
Jungtaek Lim created SPARK-34326: Summary: "SPARK-31793: FileSourceScanExec metadata should contain limited file paths" fails in some edge-case Key: SPARK-34326 URL: https://issues.apache.org/jira/browse/SPARK-34326 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Jungtaek Lim Our internal build failed with this test, and looks like the calculation in UT is missing some points about the format of location. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34326) "SPARK-31793: FileSourceScanExec metadata should contain limited file paths" fails in some edge-case
[ https://issues.apache.org/jira/browse/SPARK-34326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276918#comment-17276918 ] Jungtaek Lim commented on SPARK-34326: -- Will provide a PR shortly. > "SPARK-31793: FileSourceScanExec metadata should contain limited file paths" > fails in some edge-case > > > Key: SPARK-34326 > URL: https://issues.apache.org/jira/browse/SPARK-34326 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Priority: Major > > Our internal build failed with this test, and looks like the calculation in > UT is missing some points about the format of location. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34293) kubernetes executor pod unable to access secure hdfs
[ https://issues.apache.org/jira/browse/SPARK-34293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276913#comment-17276913 ] Manohar Chamaraju commented on SPARK-34293: --- Update: # In client mode by adding fs.defaultFS in core-site.xml fixed the issue for me. # what do to work is usage of hadoop-conf configmap in client mode. > kubernetes executor pod unable to access secure hdfs > > > Key: SPARK-34293 > URL: https://issues.apache.org/jira/browse/SPARK-34293 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1 >Reporter: Manohar Chamaraju >Priority: Major > Attachments: driver.log, executor.log, > image-2021-01-30-00-13-18-234.png, image-2021-01-30-00-14-14-329.png, > image-2021-01-30-00-14-45-335.png, image-2021-01-30-00-20-54-620.png, > image-2021-01-30-00-33-02-109.png, image-2021-01-30-00-34-05-946.png > > > Steps to reproduce > # Configure secure HDFS(kerberos) cluster running as containers in > kubernetes. > # Configure KDC on centos and create keytab for user principal hdfs, in > hdfsuser.keytab. > # Genearte spark image(v3.0.1), to spawn as container out of spark image. > # Inside spark container, run export HADOOP_CONF_DIR=/etc/hadoop/conf/ with > core-site.xml configuration as below > !image-2021-01-30-00-13-18-234.png! > # Create configmap kbr-conf > !image-2021-01-30-00-14-14-329.png! > # Run the command /opt/spark/bin/spark-submit \ > --deploy-mode client \ > --executor-memory 1g\ > --executor-memory 1g\ > --executor-cores 1\ > --class org.apache.spark.examples.HdfsTest \ > --conf spark.kubernetes.namespace=arcsight-installer-lh7fm\ > --master k8s://[https://172.17.17.1:443|https://172.17.17.1/] \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf spark.app.name=spark-hdfs \ > --conf spark.executer.instances=1 \ > --conf spark.kubernetes.node.selector.spark=yes\ > --conf spark.kubernetes.node.selector.Worker=label\ > --conf spark.kubernetes.container.image=manohar/spark:v3.0.1 \ > --conf spark.kubernetes.kerberos.enabled=true \ > --conf spark.kubernetes.kerberos.krb5.configMapName=krb5-conf \ > --conf spark.kerberos.keytab=/data/hdfsuser.keytab \ > --conf spark.kerberos.principal=h...@dom047600.lab \ > local:///opt/spark/examples/jars/spark-examples_2.12-3.0.1.jar \ > hdfs://hdfs-namenode:30820/staging-directory. > # On running this command driver is able to connect hdfs with kerberos but > execurtor fails to connect to secure hdfs and below is the logs > !image-2021-01-30-00-34-05-946.png! > # Some of observation > ## In Client mode, --conf spark.kubernetes.hadoop.configMapName=hadoop-conf > as not effect only works after HADOOP_CONF_DIR is set. Below was the contents > of hadoop-conf configmap. > !image-2021-01-30-00-20-54-620.png! > ## Ran the command in cluster mode as well, in cluster mode also executor > could not connect to secure hdfs. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34199) Block `count(table.*)` to follow ANSI standard and other SQL engines
[ https://issues.apache.org/jira/browse/SPARK-34199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-34199. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31286 [https://github.com/apache/spark/pull/31286] > Block `count(table.*)` to follow ANSI standard and other SQL engines > > > Key: SPARK-34199 > URL: https://issues.apache.org/jira/browse/SPARK-34199 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Linhong Liu >Assignee: Linhong Liu >Priority: Major > Fix For: 3.2.0 > > > In spark, the count(table.*) may cause very weird result, for example: > select count(*) from (select 1 as a, null as b) t; > output: 1 > select count(t.*) from (select 1 as a, null as b) t; > output: 0 > > After checking the ANSI standard, count(*) is always treated as count(1) > while count(t.*) is not allowed. What's more, this is also not allowed by > common databases, e.g. MySQL, oracle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34199) Block `count(table.*)` to follow ANSI standard and other SQL engines
[ https://issues.apache.org/jira/browse/SPARK-34199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-34199: --- Assignee: Linhong Liu > Block `count(table.*)` to follow ANSI standard and other SQL engines > > > Key: SPARK-34199 > URL: https://issues.apache.org/jira/browse/SPARK-34199 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Linhong Liu >Assignee: Linhong Liu >Priority: Major > > In spark, the count(table.*) may cause very weird result, for example: > select count(*) from (select 1 as a, null as b) t; > output: 1 > select count(t.*) from (select 1 as a, null as b) t; > output: 0 > > After checking the ANSI standard, count(*) is always treated as count(1) > while count(t.*) is not allowed. What's more, this is also not allowed by > common databases, e.g. MySQL, oracle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33591) NULL is recognized as the "null" string in partition specs
[ https://issues.apache.org/jira/browse/SPARK-33591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276908#comment-17276908 ] Apache Spark commented on SPARK-33591: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/31434 > NULL is recognized as the "null" string in partition specs > -- > > Key: SPARK-33591 > URL: https://issues.apache.org/jira/browse/SPARK-33591 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: correctness > Fix For: 3.0.2, 3.2.0, 3.1.1 > > > For example: > {code:sql} > spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED > BY (p1); > spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0; > spark-sql> SELECT isnull(p1) FROM tbl5; > false > {code} > The *p1 = null* is not recognized as a partition with NULL value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33591) NULL is recognized as the "null" string in partition specs
[ https://issues.apache.org/jira/browse/SPARK-33591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276907#comment-17276907 ] Apache Spark commented on SPARK-33591: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/31434 > NULL is recognized as the "null" string in partition specs > -- > > Key: SPARK-33591 > URL: https://issues.apache.org/jira/browse/SPARK-33591 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: correctness > Fix For: 3.0.2, 3.2.0, 3.1.1 > > > For example: > {code:sql} > spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED > BY (p1); > spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0; > spark-sql> SELECT isnull(p1) FROM tbl5; > false > {code} > The *p1 = null* is not recognized as a partition with NULL value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34319) Self-join after cogroup applyInPandas fails due to unresolved conflicting attributes
[ https://issues.apache.org/jira/browse/SPARK-34319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-34319: Assignee: wuyi > Self-join after cogroup applyInPandas fails due to unresolved conflicting > attributes > > > Key: SPARK-34319 > URL: https://issues.apache.org/jira/browse/SPARK-34319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > > {code:java} > df = spark.createDataFrame([(1, 1)], ("column", "value"))row = > df.groupby("ColUmn").cogroup( > df.groupby("COLUMN") > ).applyInPandas(lambda r, l: r + l, "column long, value long") > row.join(row).show() > {code} > {code:java} > Conflicting attributes: column#163321L,value#163322L > ;; > ’Join Inner > :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], > (column#163312L, value#163313L, column#163312L, value#163313L), > [column#163321L, value#163322L] > : :- Project [ColUmn#163312L, column#163312L, value#163313L] > : : +- LogicalRDD [column#163312L, value#163313L], false > : +- Project [COLUMN#163312L, column#163312L, value#163313L] > : +- LogicalRDD [column#163312L, value#163313L], false > +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], > (column#163312L, value#163313L, column#163312L, value#163313L), > [column#163321L, value#163322L] > :- Project [ColUmn#163312L, column#163312L, value#163313L] > : +- LogicalRDD [column#163312L, value#163313L], false > +- Project [COLUMN#163312L, column#163312L, value#163313L] > +- LogicalRDD [column#163312L, value#163313L], false > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34319) Self-join after cogroup applyInPandas fails due to unresolved conflicting attributes
[ https://issues.apache.org/jira/browse/SPARK-34319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34319. -- Fix Version/s: 3.1.2 3.0.2 Resolution: Fixed Issue resolved by pull request 31429 [https://github.com/apache/spark/pull/31429] > Self-join after cogroup applyInPandas fails due to unresolved conflicting > attributes > > > Key: SPARK-34319 > URL: https://issues.apache.org/jira/browse/SPARK-34319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.2, 3.1.2 > > > > {code:java} > df = spark.createDataFrame([(1, 1)], ("column", "value"))row = > df.groupby("ColUmn").cogroup( > df.groupby("COLUMN") > ).applyInPandas(lambda r, l: r + l, "column long, value long") > row.join(row).show() > {code} > {code:java} > Conflicting attributes: column#163321L,value#163322L > ;; > ’Join Inner > :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], > (column#163312L, value#163313L, column#163312L, value#163313L), > [column#163321L, value#163322L] > : :- Project [ColUmn#163312L, column#163312L, value#163313L] > : : +- LogicalRDD [column#163312L, value#163313L], false > : +- Project [COLUMN#163312L, column#163312L, value#163313L] > : +- LogicalRDD [column#163312L, value#163313L], false > +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], > (column#163312L, value#163313L, column#163312L, value#163313L), > [column#163321L, value#163322L] > :- Project [ColUmn#163312L, column#163312L, value#163313L] > : +- LogicalRDD [column#163312L, value#163313L], false > +- Project [COLUMN#163312L, column#163312L, value#163313L] > +- LogicalRDD [column#163312L, value#163313L], false > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276893#comment-17276893 ] L. C. Hsieh commented on SPARK-34198: - Thanks [~kabhwan] for your point. Besides the maintenance cost of extra code, I remember one concern of adding it, is the rocksdb dependency. I think the concern is valid and so it actually does have some differences between putting in sql core module or as an external module. IIUC, that is why we have external modules. If raising a discussion in dev mailing list helps, I think I will do it. The RocksDB StateStore we are working with, is also based on the existing implementation with our bug fix. So I think the review cost should be as lower as possible even we submit the changed code. Of course if the original author can contribute the code, it will be great too. And sure, this depends on what the consensus we get eventually. > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-29220) Flaky test: org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]
[ https://issues.apache.org/jira/browse/SPARK-29220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-29220. - > Flaky test: > org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large > number of containers and tasks (SPARK-18750) [hadoop-3.2][java11] > -- > > Key: SPARK-29220 > URL: https://issues.apache.org/jira/browse/SPARK-29220 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests, YARN >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Minor > > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111229/testReport/] > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111236/testReport/] > {code:java} > Error Messageorg.scalatest.exceptions.TestFailedException: > java.lang.StackOverflowError did not equal > nullStacktracesbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError > did not equal null > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > at > org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:48) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149) > at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) > at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56) > at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221) > at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214) > at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56) > at > org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite.run(Suite.scala:1147) > at org.scalatest.Suite.run$(Suite.scala:1129) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233) > at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56) > at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) > at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:507) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.ja
[jira] [Comment Edited] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files
[ https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276874#comment-17276874 ] Attila Zsolt Piros edited comment on SPARK-34194 at 2/2/21, 6:59 AM: - Yes, that is the reason. [~nchammas] so based on this you should consider closing this issue. was (Author: attilapiros): Yes. > Queries that only touch partition columns shouldn't scan through all files > -- > > Key: SPARK-34194 > URL: https://issues.apache.org/jira/browse/SPARK-34194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > When querying only the partition columns of a partitioned table, it seems > that Spark nonetheless scans through all files in the table, even though it > doesn't need to. > Here's an example: > {code:python} > >>> data = spark.read.option('mergeSchema', > >>> 'false').parquet('s3a://some/dataset') > [Stage 0:==> (407 + 12) / > 1158] > {code} > Note the 1158 tasks. This matches the number of partitions in the table, > which is partitioned on a single field named {{file_date}}: > {code:sh} > $ aws s3 ls s3://some/dataset | head -n 3 >PRE file_date=2017-05-01/ >PRE file_date=2017-05-02/ >PRE file_date=2017-05-03/ > $ aws s3 ls s3://some/dataset | wc -l > 1158 > {code} > The table itself has over 138K files, though: > {code:sh} > $ aws s3 ls --recursive --human --summarize s3://some/dataset > ... > Total Objects: 138708 >Total Size: 3.7 TiB > {code} > Now let's try to query just the {{file_date}} field and see what Spark does. > {code:python} > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).explain() > == Physical Plan == > TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], > output=[file_date#11]) > +- *(1) ColumnarToRow >+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: > Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct<> > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).show() > [Stage 2:> (179 + 12) / > 41011] > {code} > Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the > job progresses? I'm not sure. > What I do know is that this operation takes a long time (~20 min) running > from my laptop, whereas to list the top-level {{file_date}} partitions via > the AWS CLI take a second or two. > Spark appears to be going through all the files in the table, when it just > needs to list the partitions captured in the S3 "directory" structure. The > query is only touching {{file_date}}, after all. > The current workaround for this performance problem / optimizer wastefulness, > is to [query the catalog > directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot > of extra work compared to the elegant query against {{file_date}} that users > actually intend. > Spark should somehow know when it is only querying partition fields and skip > iterating through all the individual files in a table. > Tested on Spark 3.0.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29220) Flaky test: org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]
[ https://issues.apache.org/jira/browse/SPARK-29220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29220. --- Resolution: Duplicate > Flaky test: > org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large > number of containers and tasks (SPARK-18750) [hadoop-3.2][java11] > -- > > Key: SPARK-29220 > URL: https://issues.apache.org/jira/browse/SPARK-29220 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests, YARN >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Minor > > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111229/testReport/] > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111236/testReport/] > {code:java} > Error Messageorg.scalatest.exceptions.TestFailedException: > java.lang.StackOverflowError did not equal > nullStacktracesbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError > did not equal null > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > at > org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:48) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149) > at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) > at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56) > at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221) > at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214) > at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56) > at > org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite.run(Suite.scala:1147) > at org.scalatest.Suite.run$(Suite.scala:1129) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233) > at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56) > at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) > at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:507) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.r
[jira] [Commented] (SPARK-29220) Flaky test: org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]
[ https://issues.apache.org/jira/browse/SPARK-29220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276891#comment-17276891 ] Dongjoon Hyun commented on SPARK-29220: --- I agree with you, [~attilapiros]. > Flaky test: > org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large > number of containers and tasks (SPARK-18750) [hadoop-3.2][java11] > -- > > Key: SPARK-29220 > URL: https://issues.apache.org/jira/browse/SPARK-29220 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests, YARN >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Minor > > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111229/testReport/] > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111236/testReport/] > {code:java} > Error Messageorg.scalatest.exceptions.TestFailedException: > java.lang.StackOverflowError did not equal > nullStacktracesbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError > did not equal null > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > at > org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:48) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149) > at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) > at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56) > at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221) > at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214) > at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56) > at > org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite.run(Suite.scala:1147) > at org.scalatest.Suite.run$(Suite.scala:1129) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233) > at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56) > at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) > at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:507) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[jira] [Commented] (SPARK-33734) Spark Core ::Spark core versions upto 3.0.1 using interdependency on Jackson-core-asl version 1.9.13, which is having security issues reported.
[ https://issues.apache.org/jira/browse/SPARK-33734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276890#comment-17276890 ] Aparna commented on SPARK-33734: Hi, Please provide an updates on this, the spark-core 3.1.0 version is also using [org.apache.avro|https://mvnrepository.com/artifact/org.apache.avro] version 1.8.2 which is having [jackson-core-asl|https://mvnrepository.com/artifact/org.codehaus.jackson/jackson-core-asl] version 1.9.13. Details of Security Issues are shared in previous comments. Please update on the same. > Spark Core ::Spark core versions upto 3.0.1 using interdependency on > Jackson-core-asl version 1.9.13, which is having security issues reported. > > > Key: SPARK-33734 > URL: https://issues.apache.org/jira/browse/SPARK-33734 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Aparna >Priority: Major > > spark-core version upto latest 3.0.1 is using dependency > [org.apache.avro|https://mvnrepository.com/artifact/org.apache.avro] version > 1.8.2 which is having > [jackson-core-asl|https://mvnrepository.com/artifact/org.codehaus.jackson/jackson-core-asl] > version 1.9.13 which has security issues. > Please fix and share the new version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34325) remove_shuffleBlockResolver_in_SortShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276887#comment-17276887 ] Apache Spark commented on SPARK-34325: -- User 'offthewall123' has created a pull request for this issue: https://github.com/apache/spark/pull/31433 > remove_shuffleBlockResolver_in_SortShuffleWriter > > > Key: SPARK-34325 > URL: https://issues.apache.org/jira/browse/SPARK-34325 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Xudingyu >Priority: Major > > shuffleBlockResolver in SortShuffleWriter is not used, can remove it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34325) remove_shuffleBlockResolver_in_SortShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34325: Assignee: Apache Spark > remove_shuffleBlockResolver_in_SortShuffleWriter > > > Key: SPARK-34325 > URL: https://issues.apache.org/jira/browse/SPARK-34325 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Xudingyu >Assignee: Apache Spark >Priority: Major > > shuffleBlockResolver in SortShuffleWriter is not used, can remove it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34325) remove_shuffleBlockResolver_in_SortShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276886#comment-17276886 ] Apache Spark commented on SPARK-34325: -- User 'offthewall123' has created a pull request for this issue: https://github.com/apache/spark/pull/31433 > remove_shuffleBlockResolver_in_SortShuffleWriter > > > Key: SPARK-34325 > URL: https://issues.apache.org/jira/browse/SPARK-34325 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Xudingyu >Priority: Major > > shuffleBlockResolver in SortShuffleWriter is not used, can remove it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34325) remove_shuffleBlockResolver_in_SortShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34325: Assignee: (was: Apache Spark) > remove_shuffleBlockResolver_in_SortShuffleWriter > > > Key: SPARK-34325 > URL: https://issues.apache.org/jira/browse/SPARK-34325 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Xudingyu >Priority: Major > > shuffleBlockResolver in SortShuffleWriter is not used, can remove it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34325) remove_shuffleBlockResolver_in_SortShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34325: Assignee: Apache Spark > remove_shuffleBlockResolver_in_SortShuffleWriter > > > Key: SPARK-34325 > URL: https://issues.apache.org/jira/browse/SPARK-34325 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Xudingyu >Assignee: Apache Spark >Priority: Major > > shuffleBlockResolver in SortShuffleWriter is not used, can remove it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34309) Use Caffeine instead of Guava Cache
[ https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276885#comment-17276885 ] Dongjoon Hyun commented on SPARK-34309: --- Oh my. :( > Use Caffeine instead of Guava Cache > --- > > Key: SPARK-34309 > URL: https://issues.apache.org/jira/browse/SPARK-34309 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > Caffeine is a high performance, near optimal caching library based on Java 8, > it is used in a similar way to guava cache, but with better performance. The > comparison results are as follow are on the [caffeine benchmarks > |https://github.com/ben-manes/caffeine/wiki/Benchmarks] > At the same time, caffeine has been used in some open source projects like > Cassandra, Hbase, Neo4j, Druid, Spring and so on. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34325) remove_shuffleBlockResolver_in_SortShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudingyu updated SPARK-34325: - Description: shuffleBlockResolver in SortShuffleWriter is not used, can remove it. (was: shuffleBlockResolver in SortShuffleWriter is not used.) > remove_shuffleBlockResolver_in_SortShuffleWriter > > > Key: SPARK-34325 > URL: https://issues.apache.org/jira/browse/SPARK-34325 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Xudingyu >Priority: Major > > shuffleBlockResolver in SortShuffleWriter is not used, can remove it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34322) When refreshing a non-temporary view, also refresh its underlying tables
[ https://issues.apache.org/jira/browse/SPARK-34322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-34322: Description: For a view, there might be several underlying tables. In long-running spark server use case, such as zeppelin, kyuubi, livy. If a table updated, we need refresh this table in current long running spark session. But if the table is a view, we need refresh the underlying tables one by one. > When refreshing a non-temporary view, also refresh its underlying tables > > > Key: SPARK-34322 > URL: https://issues.apache.org/jira/browse/SPARK-34322 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: feiwang >Priority: Major > > For a view, there might be several underlying tables. > In long-running spark server use case, such as zeppelin, kyuubi, livy. > If a table updated, we need refresh this table in current long running spark > session. > But if the table is a view, we need refresh the underlying tables one by one. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34325) remove_shuffleBlockResolver_in_SortShuffleWriter
[ https://issues.apache.org/jira/browse/SPARK-34325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xudingyu updated SPARK-34325: - Description: shuffleBlockResolver in SortShuffleWriter is not used. > remove_shuffleBlockResolver_in_SortShuffleWriter > > > Key: SPARK-34325 > URL: https://issues.apache.org/jira/browse/SPARK-34325 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Xudingyu >Priority: Major > > shuffleBlockResolver in SortShuffleWriter is not used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34325) remove_shuffleBlockResolver_in_SortShuffleWriter
Xudingyu created SPARK-34325: Summary: remove_shuffleBlockResolver_in_SortShuffleWriter Key: SPARK-34325 URL: https://issues.apache.org/jira/browse/SPARK-34325 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.1 Reporter: Xudingyu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276883#comment-17276883 ] Jungtaek Lim commented on SPARK-34198: -- The external modules means modules in external directory. Personally I don't think there's huge difference between adding it in spark-sql core module vs adding it via external module. The major point of this is whether we want to add the functionality to Spark codebase or not. As we already confirmed there're concerns on adding this in Spark codebase, unless you raise the discussion in dev@ mailing list and gather consensus, the effort can be easily wasted. Please make sure we don't have such case. And once we decide to add this, I'd rather say I'd like to see either we persuade repo owner to contribute well-known existing implementation (https://github.com/chermenin/spark-states) to ASF, or new PR based on #24922. I wouldn't like to review multiple PRs again and again for the same functionality. > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276882#comment-17276882 ] L. C. Hsieh commented on SPARK-34198: - For external module here, I mean to put the related code under external/ along with other external modules like avro, kafka-0-10-sql, etc. > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34324) FileTable should not list TRUNCATE in capabilities by default
[ https://issues.apache.org/jira/browse/SPARK-34324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34324: Assignee: Apache Spark (was: L. C. Hsieh) > FileTable should not list TRUNCATE in capabilities by default > - > > Key: SPARK-34324 > URL: https://issues.apache.org/jira/browse/SPARK-34324 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > abstract class {{FileTable}} now lists {{TRUNCATE}} in its {{capabilities}}, > but {{FileTable}} does not know if an implementation really supports > truncation or not. Specifically, we can check existing {{FileTable}} > implementations including {{AvroTable}}, {{CSVTable}}, {{JsonTable}}, etc. No > one implementation really implements {{SupportsTruncate}} in its writer > builder. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34324) FileTable should not list TRUNCATE in capabilities by default
[ https://issues.apache.org/jira/browse/SPARK-34324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34324: Assignee: L. C. Hsieh (was: Apache Spark) > FileTable should not list TRUNCATE in capabilities by default > - > > Key: SPARK-34324 > URL: https://issues.apache.org/jira/browse/SPARK-34324 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > abstract class {{FileTable}} now lists {{TRUNCATE}} in its {{capabilities}}, > but {{FileTable}} does not know if an implementation really supports > truncation or not. Specifically, we can check existing {{FileTable}} > implementations including {{AvroTable}}, {{CSVTable}}, {{JsonTable}}, etc. No > one implementation really implements {{SupportsTruncate}} in its writer > builder. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34324) FileTable should not list TRUNCATE in capabilities by default
[ https://issues.apache.org/jira/browse/SPARK-34324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276880#comment-17276880 ] Apache Spark commented on SPARK-34324: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/31432 > FileTable should not list TRUNCATE in capabilities by default > - > > Key: SPARK-34324 > URL: https://issues.apache.org/jira/browse/SPARK-34324 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > abstract class {{FileTable}} now lists {{TRUNCATE}} in its {{capabilities}}, > but {{FileTable}} does not know if an implementation really supports > truncation or not. Specifically, we can check existing {{FileTable}} > implementations including {{AvroTable}}, {{CSVTable}}, {{JsonTable}}, etc. No > one implementation really implements {{SupportsTruncate}} in its writer > builder. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34324) FileTable should not list TRUNCATE in capabilities by default
L. C. Hsieh created SPARK-34324: --- Summary: FileTable should not list TRUNCATE in capabilities by default Key: SPARK-34324 URL: https://issues.apache.org/jira/browse/SPARK-34324 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh abstract class {{FileTable}} now lists {{TRUNCATE}} in its {{capabilities}}, but {{FileTable}} does not know if an implementation really supports truncation or not. Specifically, we can check existing {{FileTable}} implementations including {{AvroTable}}, {{CSVTable}}, {{JsonTable}}, etc. No one implementation really implements {{SupportsTruncate}} in its writer builder. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34322) When refreshing a non-temporary view, also refresh its underlying tables
[ https://issues.apache.org/jira/browse/SPARK-34322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276878#comment-17276878 ] Apache Spark commented on SPARK-34322: -- User 'turboFei' has created a pull request for this issue: https://github.com/apache/spark/pull/31431 > When refreshing a non-temporary view, also refresh its underlying tables > > > Key: SPARK-34322 > URL: https://issues.apache.org/jira/browse/SPARK-34322 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: feiwang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34322) When refreshing a non-temporary view, also refresh its underlying tables
[ https://issues.apache.org/jira/browse/SPARK-34322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34322: Assignee: (was: Apache Spark) > When refreshing a non-temporary view, also refresh its underlying tables > > > Key: SPARK-34322 > URL: https://issues.apache.org/jira/browse/SPARK-34322 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: feiwang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34322) When refreshing a non-temporary view, also refresh its underlying tables
[ https://issues.apache.org/jira/browse/SPARK-34322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276877#comment-17276877 ] Apache Spark commented on SPARK-34322: -- User 'turboFei' has created a pull request for this issue: https://github.com/apache/spark/pull/31431 > When refreshing a non-temporary view, also refresh its underlying tables > > > Key: SPARK-34322 > URL: https://issues.apache.org/jira/browse/SPARK-34322 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: feiwang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34322) When refreshing a non-temporary view, also refresh its underlying tables
[ https://issues.apache.org/jira/browse/SPARK-34322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34322: Assignee: Apache Spark > When refreshing a non-temporary view, also refresh its underlying tables > > > Key: SPARK-34322 > URL: https://issues.apache.org/jira/browse/SPARK-34322 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: feiwang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34316) Support spark.kubernetes.executor.disableConfigMap
[ https://issues.apache.org/jira/browse/SPARK-34316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34316: -- Summary: Support spark.kubernetes.executor.disableConfigMap (was: Optional Propagation of SPARK_CONF_DIR in K8s) > Support spark.kubernetes.executor.disableConfigMap > -- > > Key: SPARK-34316 > URL: https://issues.apache.org/jira/browse/SPARK-34316 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Zhou JIANG >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.2.0 > > > In shared Kubernetes clusters, Spark could be restricted from creating and > deleting config maps in job namespaces. > It would be helpful if the current mandatory config map creation could be > optional. User may still take responsibility of handing Spark conf files > separately. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34316) Optional Propagation of SPARK_CONF_DIR in K8s
[ https://issues.apache.org/jira/browse/SPARK-34316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34316: - Assignee: Dongjoon Hyun > Optional Propagation of SPARK_CONF_DIR in K8s > - > > Key: SPARK-34316 > URL: https://issues.apache.org/jira/browse/SPARK-34316 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Zhou JIANG >Assignee: Dongjoon Hyun >Priority: Major > > In shared Kubernetes clusters, Spark could be restricted from creating and > deleting config maps in job namespaces. > It would be helpful if the current mandatory config map creation could be > optional. User may still take responsibility of handing Spark conf files > separately. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34316) Optional Propagation of SPARK_CONF_DIR in K8s
[ https://issues.apache.org/jira/browse/SPARK-34316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34316. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31428 [https://github.com/apache/spark/pull/31428] > Optional Propagation of SPARK_CONF_DIR in K8s > - > > Key: SPARK-34316 > URL: https://issues.apache.org/jira/browse/SPARK-34316 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Zhou JIANG >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.2.0 > > > In shared Kubernetes clusters, Spark could be restricted from creating and > deleting config maps in job namespaces. > It would be helpful if the current mandatory config map creation could be > optional. User may still take responsibility of handing Spark conf files > separately. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34314) Wrong discovered partition value
[ https://issues.apache.org/jira/browse/SPARK-34314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-34314: --- Affects Version/s: 3.1.0 3.0.2 2.4.8 > Wrong discovered partition value > > > Key: SPARK-34314 > URL: https://issues.apache.org/jira/browse/SPARK-34314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > The example below portraits the issue: > {code:scala} > val df = Seq((0, "AA"), (1, "-0")).toDF("id", "part") > df.write > .partitionBy("part") > .format("parquet") > .save(path) > val readback = spark.read.parquet(path) > readback.printSchema() > readback.show(false) > {code} > It write the partition value as string: > {code} > /private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/spark-e09eae99-7ecf-4ab2-b99b-f63f8dea658d > ├── _SUCCESS > ├── part=-0 > │ └── part-1-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet > └── part=AA > └── part-0-02144398-2896-4d21-9628-a8743d098cb4.c000.snappy.parquet > {code} > *"-0"* and "AA". > but when Spark reads data back, it transforms "-0" to "0" > {code} > root > |-- id: integer (nullable = true) > |-- part: string (nullable = true) > +---++ > |id |part| > +---++ > |0 |AA | > |1 |0 | > +---++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276875#comment-17276875 ] Cheng Su commented on SPARK-34198: -- Sorry I think my question is not very clear. I am aware of [https://github.com/apache/spark/pull/24922] and what was the concern back then. I think I am not sure what's external module here. Do you mind explaining a bit more of what's external module meaning here (or is there an existing example in current codebase I can refer to)? More context is we are working on RocksDB state store internally as well based on above PR. So would like to check if there's any good things here we need to watch out for backport, thanks. > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files
[ https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276874#comment-17276874 ] Attila Zsolt Piros commented on SPARK-34194: Yes. > Queries that only touch partition columns shouldn't scan through all files > -- > > Key: SPARK-34194 > URL: https://issues.apache.org/jira/browse/SPARK-34194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > When querying only the partition columns of a partitioned table, it seems > that Spark nonetheless scans through all files in the table, even though it > doesn't need to. > Here's an example: > {code:python} > >>> data = spark.read.option('mergeSchema', > >>> 'false').parquet('s3a://some/dataset') > [Stage 0:==> (407 + 12) / > 1158] > {code} > Note the 1158 tasks. This matches the number of partitions in the table, > which is partitioned on a single field named {{file_date}}: > {code:sh} > $ aws s3 ls s3://some/dataset | head -n 3 >PRE file_date=2017-05-01/ >PRE file_date=2017-05-02/ >PRE file_date=2017-05-03/ > $ aws s3 ls s3://some/dataset | wc -l > 1158 > {code} > The table itself has over 138K files, though: > {code:sh} > $ aws s3 ls --recursive --human --summarize s3://some/dataset > ... > Total Objects: 138708 >Total Size: 3.7 TiB > {code} > Now let's try to query just the {{file_date}} field and see what Spark does. > {code:python} > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).explain() > == Physical Plan == > TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], > output=[file_date#11]) > +- *(1) ColumnarToRow >+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: > Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct<> > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).show() > [Stage 2:> (179 + 12) / > 41011] > {code} > Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the > job progresses? I'm not sure. > What I do know is that this operation takes a long time (~20 min) running > from my laptop, whereas to list the top-level {{file_date}} partitions via > the AWS CLI take a second or two. > Spark appears to be going through all the files in the table, when it just > needs to list the partitions captured in the S3 "directory" structure. The > query is only touching {{file_date}}, after all. > The current workaround for this performance problem / optimizer wastefulness, > is to [query the catalog > directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot > of extra work compared to the elegant query against {{file_date}} that users > actually intend. > Spark should somehow know when it is only querying partition fields and skip > iterating through all the individual files in a table. > Tested on Spark 3.0.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34323) Upgrade zstd-jni to 1.4.8-3
[ https://issues.apache.org/jira/browse/SPARK-34323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34323: Assignee: (was: Apache Spark) > Upgrade zstd-jni to 1.4.8-3 > --- > > Key: SPARK-34323 > URL: https://issues.apache.org/jira/browse/SPARK-34323 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: William Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34323) Upgrade zstd-jni to 1.4.8-3
[ https://issues.apache.org/jira/browse/SPARK-34323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276873#comment-17276873 ] Apache Spark commented on SPARK-34323: -- User 'williamhyun' has created a pull request for this issue: https://github.com/apache/spark/pull/31430 > Upgrade zstd-jni to 1.4.8-3 > --- > > Key: SPARK-34323 > URL: https://issues.apache.org/jira/browse/SPARK-34323 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: William Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34323) Upgrade zstd-jni to 1.4.8-3
[ https://issues.apache.org/jira/browse/SPARK-34323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34323: Assignee: Apache Spark > Upgrade zstd-jni to 1.4.8-3 > --- > > Key: SPARK-34323 > URL: https://issues.apache.org/jira/browse/SPARK-34323 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: William Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34323) Upgrade zstd-jni to 1.4.8-3
William Hyun created SPARK-34323: Summary: Upgrade zstd-jni to 1.4.8-3 Key: SPARK-34323 URL: https://issues.apache.org/jira/browse/SPARK-34323 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.2.0 Reporter: William Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34322) When refreshing a non-temporary view, also refresh its underlying tables
feiwang created SPARK-34322: --- Summary: When refreshing a non-temporary view, also refresh its underlying tables Key: SPARK-34322 URL: https://issues.apache.org/jira/browse/SPARK-34322 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.1 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files
[ https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276869#comment-17276869 ] Nicholas Chammas edited comment on SPARK-34194 at 2/2/21, 5:56 AM: --- Interesting reference, [~attilapiros]. It looks like that config is internal to Spark and was [deprecated in Spark 3.0|https://github.com/apache/spark/blob/bec80d7eec91ee83fcbb0e022b33bd526c80f423/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L918-L929] due to the correctness issue mentioned in that warning and documented in SPARK-26709. was (Author: nchammas): Interesting reference, [~attilapiros]. It looks like that config was [deprecated in Spark 3.0|https://github.com/apache/spark/blob/bec80d7eec91ee83fcbb0e022b33bd526c80f423/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L918-L929] due to the correctness issue mentioned in that warning and documented in SPARK-26709. > Queries that only touch partition columns shouldn't scan through all files > -- > > Key: SPARK-34194 > URL: https://issues.apache.org/jira/browse/SPARK-34194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > When querying only the partition columns of a partitioned table, it seems > that Spark nonetheless scans through all files in the table, even though it > doesn't need to. > Here's an example: > {code:python} > >>> data = spark.read.option('mergeSchema', > >>> 'false').parquet('s3a://some/dataset') > [Stage 0:==> (407 + 12) / > 1158] > {code} > Note the 1158 tasks. This matches the number of partitions in the table, > which is partitioned on a single field named {{file_date}}: > {code:sh} > $ aws s3 ls s3://some/dataset | head -n 3 >PRE file_date=2017-05-01/ >PRE file_date=2017-05-02/ >PRE file_date=2017-05-03/ > $ aws s3 ls s3://some/dataset | wc -l > 1158 > {code} > The table itself has over 138K files, though: > {code:sh} > $ aws s3 ls --recursive --human --summarize s3://some/dataset > ... > Total Objects: 138708 >Total Size: 3.7 TiB > {code} > Now let's try to query just the {{file_date}} field and see what Spark does. > {code:python} > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).explain() > == Physical Plan == > TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], > output=[file_date#11]) > +- *(1) ColumnarToRow >+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: > Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct<> > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).show() > [Stage 2:> (179 + 12) / > 41011] > {code} > Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the > job progresses? I'm not sure. > What I do know is that this operation takes a long time (~20 min) running > from my laptop, whereas to list the top-level {{file_date}} partitions via > the AWS CLI take a second or two. > Spark appears to be going through all the files in the table, when it just > needs to list the partitions captured in the S3 "directory" structure. The > query is only touching {{file_date}}, after all. > The current workaround for this performance problem / optimizer wastefulness, > is to [query the catalog > directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot > of extra work compared to the elegant query against {{file_date}} that users > actually intend. > Spark should somehow know when it is only querying partition fields and skip > iterating through all the individual files in a table. > Tested on Spark 3.0.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34295) Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude
[ https://issues.apache.org/jira/browse/SPARK-34295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-34295: --- Assignee: L. C. Hsieh > Allow option similar to mapreduce.job.hdfs-servers.token-renewal.exclude > > > Key: SPARK-34295 > URL: https://issues.apache.org/jira/browse/SPARK-34295 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > MapReduce jobs can instruct YARN to skip renewal of tokens obtained from > certain hosts by specifying the hosts with configuration > mapreduce.job.hdfs-servers.token-renewal.exclude=,,..,. > But seems Spark lacks of similar option. So the job submission fails if YARN > fails to renew DelegationToken for any of the remote HDFS cluster. The > failure in DT renewal can happen due to many reason like Remote HDFS does not > trust Kerberos identity of YARN etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files
[ https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276869#comment-17276869 ] Nicholas Chammas commented on SPARK-34194: -- Interesting reference, [~attilapiros]. It looks like that config was [deprecated in Spark 3.0|https://github.com/apache/spark/blob/bec80d7eec91ee83fcbb0e022b33bd526c80f423/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L918-L929] due to the correctness issue mentioned in that warning and documented in SPARK-26709. > Queries that only touch partition columns shouldn't scan through all files > -- > > Key: SPARK-34194 > URL: https://issues.apache.org/jira/browse/SPARK-34194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > When querying only the partition columns of a partitioned table, it seems > that Spark nonetheless scans through all files in the table, even though it > doesn't need to. > Here's an example: > {code:python} > >>> data = spark.read.option('mergeSchema', > >>> 'false').parquet('s3a://some/dataset') > [Stage 0:==> (407 + 12) / > 1158] > {code} > Note the 1158 tasks. This matches the number of partitions in the table, > which is partitioned on a single field named {{file_date}}: > {code:sh} > $ aws s3 ls s3://some/dataset | head -n 3 >PRE file_date=2017-05-01/ >PRE file_date=2017-05-02/ >PRE file_date=2017-05-03/ > $ aws s3 ls s3://some/dataset | wc -l > 1158 > {code} > The table itself has over 138K files, though: > {code:sh} > $ aws s3 ls --recursive --human --summarize s3://some/dataset > ... > Total Objects: 138708 >Total Size: 3.7 TiB > {code} > Now let's try to query just the {{file_date}} field and see what Spark does. > {code:python} > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).explain() > == Physical Plan == > TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], > output=[file_date#11]) > +- *(1) ColumnarToRow >+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: > Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct<> > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).show() > [Stage 2:> (179 + 12) / > 41011] > {code} > Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the > job progresses? I'm not sure. > What I do know is that this operation takes a long time (~20 min) running > from my laptop, whereas to list the top-level {{file_date}} partitions via > the AWS CLI take a second or two. > Spark appears to be going through all the files in the table, when it just > needs to list the partitions captured in the S3 "directory" structure. The > query is only touching {{file_date}}, after all. > The current workaround for this performance problem / optimizer wastefulness, > is to [query the catalog > directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot > of extra work compared to the elegant query against {{file_date}} that users > actually intend. > Spark should somehow know when it is only querying partition fields and skip > iterating through all the individual files in a table. > Tested on Spark 3.0.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34321) Fix the guarantee of foreachBatch
[ https://issues.apache.org/jira/browse/SPARK-34321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276865#comment-17276865 ] L. C. Hsieh commented on SPARK-34321: - Err...I made a mistake when reading the document and code. This is invalid. > Fix the guarantee of foreachBatch > - > > Key: SPARK-34321 > URL: https://issues.apache.org/jira/browse/SPARK-34321 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Similar to SPARK-28650, {{foreachBatch}} API document also documents the > guarantee: > The batchId can be used to deduplicate and transactionally write the output > (that is, the provided Dataset) to external systems. The output Dataset is > guaranteed to be exactly the same for the same batchId > But like the reason of fixing the document of {{ForeachWriter}} in > SPARK-28650, it is not hard to break the guarantee by changing the partition > number. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12497) thriftServer does not support semicolon in sql
[ https://issues.apache.org/jira/browse/SPARK-12497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276864#comment-17276864 ] xinzhang edited comment on SPARK-12497 at 2/2/21, 5:44 AM: --- [~kabhwan] Sorry for the mixed up Tests. Please recheck the new test. # It's good with Spark 3.0.0 . (BTW: semicolon is good in beeline # It's still a bug with Spark 2.4.7 . [root@actuatorx-dispatcher-172-25-48-173 spark]# env|grep spark SPARK_HOME=/opt/spark/spark-bin PATH=/root/perl5/bin:/opt/scala/scala-bin//bin:/opt/spark/spark-bin/bin:172.25.52.34:/opt/hive/hive-bin/bin/:172.31.10.86:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/usr/local/swosbf/bin:/usr/local/swosbf/bin/system:/usr/java/jdk/bin:/usr/bin:/usr/sbin:/bin:/sbin:/usr/X11R6/bin:/root/bin PWD=/opt/spark [root@actuatorx-dispatcher-172-25-48-173 spark]# ll total 4 -rw-r--r-- 1 root root 646 Feb 1 17:44 derby.log drwxr-xr-x 5 root root 133 Feb 1 17:44 metastore_db drwxr-xr-x 14 root root 255 Sep 22 13:57 spark-2.3.0-bin-hadoop2.6 drwxr-xr-x 14 1000 1000 240 Feb 2 13:32 spark-2.4.7-bin-hadoop2.6 drwxr-xr-x 14 root root 240 Feb 2 13:26 spark-3.0.0-bin-hadoop2.7 lrwxrwxrwx 1 root root 25 Feb 1 15:42 spark-bin -> spark-2.4.7-bin-hadoop2.6 [root@actuatorx-dispatcher-172-25-48-173 spark]# jps 3348544 RunJar 3354564 Jps 3354234 RunJar 984853 JarLauncher [root@actuatorx-dispatcher-172-25-48-173 spark]# sh spark-bin/sbin/start-thriftserver.sh starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /opt/spark/spark-bin/logs/spark-root-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-actuatorx-dispatcher-172-25-48-173.out [root@actuatorx-dispatcher-172-25-48-173 spark]# jps 3362650 Jps 984853 JarLauncher 3355197 SparkSubmit 3362444 RunJar [root@actuatorx-dispatcher-172-25-48-173 spark]# netstat -anp|grep 3355197 tcp 0 0 172.25.48.173:21120 0.0.0.0:* LISTEN 3355197/java tcp 0 0 0.0.0.0:4040 0.0.0.0:* LISTEN 3355197/java tcp 0 0 172.25.48.173:22219 0.0.0.0:* LISTEN 3355197/java tcp 0 0 0.0.0.0:50031 0.0.0.0:* LISTEN 3355197/java tcp 0 0 172.25.48.173:51797 172.25.48.231:6033 ESTABLISHED 3355197/java tcp 0 0 172.25.48.173:51795 172.25.48.231:6033 ESTABLISHED 3355197/java tcp 0 0 172.25.48.173:51787 172.25.48.231:6033 ESTABLISHED 3355197/java tcp 0 0 172.25.48.173:51789 172.25.48.231:6033 ESTABLISHED 3355197/java unix 3 [ ] STREAM CONNECTED 534110569 3355197/java unix 3 [ ] STREAM CONNECTED 534110568 3355197/java unix 2 [ ] STREAM CONNECTED 534050562 3355197/java unix 2 [ ] STREAM CONNECTED 534110572 3355197/java [root@actuatorx-dispatcher-172-25-48-173 spark]# /opt/spark/spark-bin/bin/beeline -u jdbc:hive2://172.25.48.173:50031/tools -n tools Connecting to jdbc:hive2://172.25.48.173:50031/tools 21/02/02 13:38:57 INFO jdbc.Utils: Supplied authorities: 172.25.48.173:50031 21/02/02 13:38:57 INFO jdbc.Utils: Resolved authority: 172.25.48.173:50031 21/02/02 13:38:57 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://172.25.48.173:50031/tools Connected to: Spark SQL (version 2.4.7) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1.spark2 by Apache Hive 0: jdbc:hive2://172.25.48.173:50031/tools> select '\;'; Error: org.apache.spark.sql.catalyst.parser.ParseException: no viable alternative at input 'select ''(line 1, pos 7) == SQL == select '\ ---^^^ (state=,code=0) 0: jdbc:hive2://172.25.48.173:50031/tools> !exit Closing: 0: jdbc:hive2://172.25.48.173:50031/tools [root@actuatorx-dispatcher-172-25-48-173 spark]# was (Author: zhangxin0112zx): [~kabhwan] Sorry for the mixed up Tests. Please recheck the new test. # It's good with Spark 3.0.0 . (BTW: semicolon is good in beeline # It's still a bug with Spark 2.4.7 . [root@actuatorx-dispatcher-172-25-48-173 spark]# env|grep spark SPARK_HOME=/opt/spark/spark-bin PATH=/root/perl5/bin:/opt/scala/scala-bin//bin:/opt/spark/spark-bin/bin:172.25.52.34:/opt/hive/hive-bin/bin/:172.31.10.86:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/usr/local/swosbf/bin:/usr/local/swosbf/bin/system:/usr/java/jdk/bin:/usr/bin:/usr/sbin:/bin:/sbin:/usr/X11R6/bin:/root/bin PWD=/opt/spark [root@actuatorx-dispatcher-172-25-48-173 spark]# ll total 4 -rw-r--r-- 1 root root 646 Feb 1 17:44 derby.log drwxr-xr-x 5 root root 133 Feb 1 17:44 metastore_db drwxr-xr-x 14 root root 255 Sep 22 13:57 spark-2.3.0-bin-hadoop2.6 drwxr-xr-x 14 1000 1000 240 Feb 2 13:32 spark-2.4.7-bin-hadoop2.6 drwxr-xr-x 14 root root 240 Feb 2 13:26 spark-3.0.0-bin-hadoop2.7 lrwxrwxrwx 1 root root 25 Feb 1 15:42 spark-bin -> spark-2.4.7-bin-hadoop2.6 [root@actuatorx-dispatcher-172-25-48-173 spark]# jps 3348544 RunJar 3354564 Jps 3354234 RunJar 984853 JarLauncher [root@actuatorx-dispatcher-172-25-48-173 spark]# sh spark-bin/sbin/start-thrif
[jira] [Resolved] (SPARK-34321) Fix the guarantee of foreachBatch
[ https://issues.apache.org/jira/browse/SPARK-34321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-34321. - Resolution: Invalid > Fix the guarantee of foreachBatch > - > > Key: SPARK-34321 > URL: https://issues.apache.org/jira/browse/SPARK-34321 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Similar to SPARK-28650, {{foreachBatch}} API document also documents the > guarantee: > The batchId can be used to deduplicate and transactionally write the output > (that is, the provided Dataset) to external systems. The output Dataset is > guaranteed to be exactly the same for the same batchId > But like the reason of fixing the document of {{ForeachWriter}} in > SPARK-28650, it is not hard to break the guarantee by changing the partition > number. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34321) Fix the guarantee of foreachBatch
L. C. Hsieh created SPARK-34321: --- Summary: Fix the guarantee of foreachBatch Key: SPARK-34321 URL: https://issues.apache.org/jira/browse/SPARK-34321 Project: Spark Issue Type: Documentation Components: Structured Streaming Affects Versions: 3.2.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh Similar to SPARK-28650, {{foreachBatch}} API document also documents the guarantee: The batchId can be used to deduplicate and transactionally write the output (that is, the provided Dataset) to external systems. The output Dataset is guaranteed to be exactly the same for the same batchId But like the reason of fixing the document of {{ForeachWriter}} in SPARK-28650, it is not hard to break the guarantee by changing the partition number. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12497) thriftServer does not support semicolon in sql
[ https://issues.apache.org/jira/browse/SPARK-12497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276864#comment-17276864 ] xinzhang commented on SPARK-12497: -- [~kabhwan] Sorry for the mixed up Tests. Please recheck the new test. # It's good with Spark 3.0.0 . (BTW: semicolon is good in beeline # It's still a bug with Spark 2.4.7 . [root@actuatorx-dispatcher-172-25-48-173 spark]# env|grep spark SPARK_HOME=/opt/spark/spark-bin PATH=/root/perl5/bin:/opt/scala/scala-bin//bin:/opt/spark/spark-bin/bin:172.25.52.34:/opt/hive/hive-bin/bin/:172.31.10.86:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/usr/local/swosbf/bin:/usr/local/swosbf/bin/system:/usr/java/jdk/bin:/usr/bin:/usr/sbin:/bin:/sbin:/usr/X11R6/bin:/root/bin PWD=/opt/spark [root@actuatorx-dispatcher-172-25-48-173 spark]# ll total 4 -rw-r--r-- 1 root root 646 Feb 1 17:44 derby.log drwxr-xr-x 5 root root 133 Feb 1 17:44 metastore_db drwxr-xr-x 14 root root 255 Sep 22 13:57 spark-2.3.0-bin-hadoop2.6 drwxr-xr-x 14 1000 1000 240 Feb 2 13:32 spark-2.4.7-bin-hadoop2.6 drwxr-xr-x 14 root root 240 Feb 2 13:26 spark-3.0.0-bin-hadoop2.7 lrwxrwxrwx 1 root root 25 Feb 1 15:42 spark-bin -> spark-2.4.7-bin-hadoop2.6 [root@actuatorx-dispatcher-172-25-48-173 spark]# jps 3348544 RunJar 3354564 Jps 3354234 RunJar 984853 JarLauncher [root@actuatorx-dispatcher-172-25-48-173 spark]# sh spark-bin/sbin/start-thriftserver.sh starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /opt/spark/spark-bin/logs/spark-root-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-actuatorx-dispatcher-172-25-48-173.out [root@actuatorx-dispatcher-172-25-48-173 spark]# netstat -anp|grep 3355197 tcp 0 0 172.25.48.173:21120 0.0.0.0:* LISTEN 3355197/java tcp 0 0 0.0.0.0:4040 0.0.0.0:* LISTEN 3355197/java tcp 0 0 172.25.48.173:22219 0.0.0.0:* LISTEN 3355197/java tcp 0 0 0.0.0.0:50031 0.0.0.0:* LISTEN 3355197/java tcp 0 0 172.25.48.173:51797 172.25.48.231:6033 ESTABLISHED 3355197/java tcp 0 0 172.25.48.173:51795 172.25.48.231:6033 ESTABLISHED 3355197/java tcp 0 0 172.25.48.173:51787 172.25.48.231:6033 ESTABLISHED 3355197/java tcp 0 0 172.25.48.173:51789 172.25.48.231:6033 ESTABLISHED 3355197/java unix 3 [ ] STREAM CONNECTED 534110569 3355197/java unix 3 [ ] STREAM CONNECTED 534110568 3355197/java unix 2 [ ] STREAM CONNECTED 534050562 3355197/java unix 2 [ ] STREAM CONNECTED 534110572 3355197/java [root@actuatorx-dispatcher-172-25-48-173 spark]# /opt/spark/spark-bin/bin/beeline -u jdbc:hive2://172.25.48.173:50031/tools -n tools Connecting to jdbc:hive2://172.25.48.173:50031/tools 21/02/02 13:38:57 INFO jdbc.Utils: Supplied authorities: 172.25.48.173:50031 21/02/02 13:38:57 INFO jdbc.Utils: Resolved authority: 172.25.48.173:50031 21/02/02 13:38:57 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://172.25.48.173:50031/tools Connected to: Spark SQL (version 2.4.7) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1.spark2 by Apache Hive 0: jdbc:hive2://172.25.48.173:50031/tools> select '\;'; Error: org.apache.spark.sql.catalyst.parser.ParseException: no viable alternative at input 'select ''(line 1, pos 7) == SQL == select '\ ---^^^ (state=,code=0) 0: jdbc:hive2://172.25.48.173:50031/tools> !exit Closing: 0: jdbc:hive2://172.25.48.173:50031/tools [root@actuatorx-dispatcher-172-25-48-173 spark]# > thriftServer does not support semicolon in sql > --- > > Key: SPARK-12497 > URL: https://issues.apache.org/jira/browse/SPARK-12497 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: nilonealex >Priority: Major > > 0: jdbc:hive2://192.168.128.130:14005> SELECT ';' from tx_1 limit 1 ; > Error: org.apache.spark.sql.AnalysisException: cannot recognize input near > '' '' '' in select clause; line 1 pos 8 (state=,code=0) > 0: jdbc:hive2://192.168.128.130:14005> > 0: jdbc:hive2://192.168.128.130:14005> select '\;' from tx_1 limit 1 ; > Error: org.apache.spark.sql.AnalysisException: cannot recognize input near > '' '' '' in select clause; line 1 pos 9 (state=,code=0) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-34198: --- Assignee: L. C. Hsieh > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276863#comment-17276863 ] L. C. Hsieh commented on SPARK-34198: - If you are asking why adding it as an external module instead of directly into streaming codebase, one previous concern is that this introduces extra dependency of RocksDB. So to add it as external module, we hope to relieve the concern. We will add RocksDB StateStore code as an external module as the JIRA title describes. Spark SS already can use a config provider class to choose what StateStore provider to use. So I think there won't be too many tasks involved. > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module
[ https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276857#comment-17276857 ] Cheng Su commented on SPARK-34198: -- [~viirya] - could you help elaborate what's the benefit of adding as an external module? Also do you mind sharing a list of potential things/sub-tasks need to be done to make it work? Thanks. > Add RocksDB StateStore as external module > - > > Key: SPARK-34198 > URL: https://issues.apache.org/jira/browse/SPARK-34198 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Priority: Major > > Currently Spark SS only has one built-in StateStore implementation > HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As > there are more and more streaming applications, some of them requires to use > large state in stateful operations such as streaming aggregation and join. > Several other major streaming frameworks already use RocksDB for state > management. So it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > We would like to explore the possibility to add RocksDB-based StateStore into > Spark SS. For the concern about adding RocksDB as a direct dependency, our > plan is to add this StateStore as an external module first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34320) Migrate ALTER TABLE drop columns command to the new resolution framework
Terry Kim created SPARK-34320: - Summary: Migrate ALTER TABLE drop columns command to the new resolution framework Key: SPARK-34320 URL: https://issues.apache.org/jira/browse/SPARK-34320 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Terry Kim Migrate ALTER TABLE drop columns command to the new resolution framework -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34319) Self-join after cogroup applyInPandas fails due to unresolved conflicting attributes
[ https://issues.apache.org/jira/browse/SPARK-34319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276820#comment-17276820 ] Apache Spark commented on SPARK-34319: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/31429 > Self-join after cogroup applyInPandas fails due to unresolved conflicting > attributes > > > Key: SPARK-34319 > URL: https://issues.apache.org/jira/browse/SPARK-34319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0 >Reporter: wuyi >Priority: Major > > > {code:java} > df = spark.createDataFrame([(1, 1)], ("column", "value"))row = > df.groupby("ColUmn").cogroup( > df.groupby("COLUMN") > ).applyInPandas(lambda r, l: r + l, "column long, value long") > row.join(row).show() > {code} > {code:java} > Conflicting attributes: column#163321L,value#163322L > ;; > ’Join Inner > :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], > (column#163312L, value#163313L, column#163312L, value#163313L), > [column#163321L, value#163322L] > : :- Project [ColUmn#163312L, column#163312L, value#163313L] > : : +- LogicalRDD [column#163312L, value#163313L], false > : +- Project [COLUMN#163312L, column#163312L, value#163313L] > : +- LogicalRDD [column#163312L, value#163313L], false > +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], > (column#163312L, value#163313L, column#163312L, value#163313L), > [column#163321L, value#163322L] > :- Project [ColUmn#163312L, column#163312L, value#163313L] > : +- LogicalRDD [column#163312L, value#163313L], false > +- Project [COLUMN#163312L, column#163312L, value#163313L] > +- LogicalRDD [column#163312L, value#163313L], false > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34319) Self-join after cogroup applyInPandas fails due to unresolved conflicting attributes
[ https://issues.apache.org/jira/browse/SPARK-34319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276818#comment-17276818 ] Apache Spark commented on SPARK-34319: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/31429 > Self-join after cogroup applyInPandas fails due to unresolved conflicting > attributes > > > Key: SPARK-34319 > URL: https://issues.apache.org/jira/browse/SPARK-34319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0 >Reporter: wuyi >Priority: Major > > > {code:java} > df = spark.createDataFrame([(1, 1)], ("column", "value"))row = > df.groupby("ColUmn").cogroup( > df.groupby("COLUMN") > ).applyInPandas(lambda r, l: r + l, "column long, value long") > row.join(row).show() > {code} > {code:java} > Conflicting attributes: column#163321L,value#163322L > ;; > ’Join Inner > :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], > (column#163312L, value#163313L, column#163312L, value#163313L), > [column#163321L, value#163322L] > : :- Project [ColUmn#163312L, column#163312L, value#163313L] > : : +- LogicalRDD [column#163312L, value#163313L], false > : +- Project [COLUMN#163312L, column#163312L, value#163313L] > : +- LogicalRDD [column#163312L, value#163313L], false > +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], > (column#163312L, value#163313L, column#163312L, value#163313L), > [column#163321L, value#163322L] > :- Project [ColUmn#163312L, column#163312L, value#163313L] > : +- LogicalRDD [column#163312L, value#163313L], false > +- Project [COLUMN#163312L, column#163312L, value#163313L] > +- LogicalRDD [column#163312L, value#163313L], false > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34319) Self-join after cogroup applyInPandas fails due to unresolved conflicting attributes
[ https://issues.apache.org/jira/browse/SPARK-34319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34319: Assignee: Apache Spark > Self-join after cogroup applyInPandas fails due to unresolved conflicting > attributes > > > Key: SPARK-34319 > URL: https://issues.apache.org/jira/browse/SPARK-34319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > > {code:java} > df = spark.createDataFrame([(1, 1)], ("column", "value"))row = > df.groupby("ColUmn").cogroup( > df.groupby("COLUMN") > ).applyInPandas(lambda r, l: r + l, "column long, value long") > row.join(row).show() > {code} > {code:java} > Conflicting attributes: column#163321L,value#163322L > ;; > ’Join Inner > :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], > (column#163312L, value#163313L, column#163312L, value#163313L), > [column#163321L, value#163322L] > : :- Project [ColUmn#163312L, column#163312L, value#163313L] > : : +- LogicalRDD [column#163312L, value#163313L], false > : +- Project [COLUMN#163312L, column#163312L, value#163313L] > : +- LogicalRDD [column#163312L, value#163313L], false > +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], > (column#163312L, value#163313L, column#163312L, value#163313L), > [column#163321L, value#163322L] > :- Project [ColUmn#163312L, column#163312L, value#163313L] > : +- LogicalRDD [column#163312L, value#163313L], false > +- Project [COLUMN#163312L, column#163312L, value#163313L] > +- LogicalRDD [column#163312L, value#163313L], false > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34319) Self-join after cogroup applyInPandas fails due to unresolved conflicting attributes
[ https://issues.apache.org/jira/browse/SPARK-34319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34319: Assignee: (was: Apache Spark) > Self-join after cogroup applyInPandas fails due to unresolved conflicting > attributes > > > Key: SPARK-34319 > URL: https://issues.apache.org/jira/browse/SPARK-34319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0 >Reporter: wuyi >Priority: Major > > > {code:java} > df = spark.createDataFrame([(1, 1)], ("column", "value"))row = > df.groupby("ColUmn").cogroup( > df.groupby("COLUMN") > ).applyInPandas(lambda r, l: r + l, "column long, value long") > row.join(row).show() > {code} > {code:java} > Conflicting attributes: column#163321L,value#163322L > ;; > ’Join Inner > :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], > (column#163312L, value#163313L, column#163312L, value#163313L), > [column#163321L, value#163322L] > : :- Project [ColUmn#163312L, column#163312L, value#163313L] > : : +- LogicalRDD [column#163312L, value#163313L], false > : +- Project [COLUMN#163312L, column#163312L, value#163313L] > : +- LogicalRDD [column#163312L, value#163313L], false > +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], > (column#163312L, value#163313L, column#163312L, value#163313L), > [column#163321L, value#163322L] > :- Project [ColUmn#163312L, column#163312L, value#163313L] > : +- LogicalRDD [column#163312L, value#163313L], false > +- Project [COLUMN#163312L, column#163312L, value#163313L] > +- LogicalRDD [column#163312L, value#163313L], false > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34319) Self-join after cogroup applyInPandas fails due to unresolved conflicting attributes
wuyi created SPARK-34319: Summary: Self-join after cogroup applyInPandas fails due to unresolved conflicting attributes Key: SPARK-34319 URL: https://issues.apache.org/jira/browse/SPARK-34319 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 3.0.0, 3.1.0, 3.2.0 Reporter: wuyi {code:java} df = spark.createDataFrame([(1, 1)], ("column", "value"))row = df.groupby("ColUmn").cogroup( df.groupby("COLUMN") ).applyInPandas(lambda r, l: r + l, "column long, value long") row.join(row).show() {code} {code:java} Conflicting attributes: column#163321L,value#163322L ;; ’Join Inner :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], (column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L] : :- Project [ColUmn#163312L, column#163312L, value#163313L] : : +- LogicalRDD [column#163312L, value#163313L], false : +- Project [COLUMN#163312L, column#163312L, value#163313L] : +- LogicalRDD [column#163312L, value#163313L], false +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], (column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L] :- Project [ColUmn#163312L, column#163312L, value#163313L] : +- LogicalRDD [column#163312L, value#163313L], false +- Project [COLUMN#163312L, column#163312L, value#163313L] +- LogicalRDD [column#163312L, value#163313L], false {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34309) Use Caffeine instead of Guava Cache
[ https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276806#comment-17276806 ] Yang Jie commented on SPARK-34309: -- There has been a patch to replace all the places. However, when using RemovalListener, it seems that the timing behavior is inconsistent, data was deleted 3 ~5 ms later than expected :( > Use Caffeine instead of Guava Cache > --- > > Key: SPARK-34309 > URL: https://issues.apache.org/jira/browse/SPARK-34309 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > Caffeine is a high performance, near optimal caching library based on Java 8, > it is used in a similar way to guava cache, but with better performance. The > comparison results are as follow are on the [caffeine benchmarks > |https://github.com/ben-manes/caffeine/wiki/Benchmarks] > At the same time, caffeine has been used in some open source projects like > Cassandra, Hbase, Neo4j, Druid, Spring and so on. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34316) Optional Propagation of SPARK_CONF_DIR in K8s
[ https://issues.apache.org/jira/browse/SPARK-34316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34316: Assignee: (was: Apache Spark) > Optional Propagation of SPARK_CONF_DIR in K8s > - > > Key: SPARK-34316 > URL: https://issues.apache.org/jira/browse/SPARK-34316 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Zhou JIANG >Priority: Major > > In shared Kubernetes clusters, Spark could be restricted from creating and > deleting config maps in job namespaces. > It would be helpful if the current mandatory config map creation could be > optional. User may still take responsibility of handing Spark conf files > separately. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34316) Optional Propagation of SPARK_CONF_DIR in K8s
[ https://issues.apache.org/jira/browse/SPARK-34316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34316: Assignee: Apache Spark > Optional Propagation of SPARK_CONF_DIR in K8s > - > > Key: SPARK-34316 > URL: https://issues.apache.org/jira/browse/SPARK-34316 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Zhou JIANG >Assignee: Apache Spark >Priority: Major > > In shared Kubernetes clusters, Spark could be restricted from creating and > deleting config maps in job namespaces. > It would be helpful if the current mandatory config map creation could be > optional. User may still take responsibility of handing Spark conf files > separately. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34316) Optional Propagation of SPARK_CONF_DIR in K8s
[ https://issues.apache.org/jira/browse/SPARK-34316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276802#comment-17276802 ] Apache Spark commented on SPARK-34316: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/31428 > Optional Propagation of SPARK_CONF_DIR in K8s > - > > Key: SPARK-34316 > URL: https://issues.apache.org/jira/browse/SPARK-34316 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Zhou JIANG >Priority: Major > > In shared Kubernetes clusters, Spark could be restricted from creating and > deleting config maps in job namespaces. > It would be helpful if the current mandatory config map creation could be > optional. User may still take responsibility of handing Spark conf files > separately. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34316) Optional Propagation of SPARK_CONF_DIR in K8s
[ https://issues.apache.org/jira/browse/SPARK-34316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34316: -- Affects Version/s: (was: 3.0.1) 3.2.0 > Optional Propagation of SPARK_CONF_DIR in K8s > - > > Key: SPARK-34316 > URL: https://issues.apache.org/jira/browse/SPARK-34316 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Zhou JIANG >Priority: Major > > In shared Kubernetes clusters, Spark could be restricted from creating and > deleting config maps in job namespaces. > It would be helpful if the current mandatory config map creation could be > optional. User may still take responsibility of handing Spark conf files > separately. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34316) Optional Propagation of SPARK_CONF_DIR in K8s
[ https://issues.apache.org/jira/browse/SPARK-34316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276782#comment-17276782 ] Dongjoon Hyun commented on SPARK-34316: --- Thank you for filing a Jira issue, [~zhou_jiang]. > Optional Propagation of SPARK_CONF_DIR in K8s > - > > Key: SPARK-34316 > URL: https://issues.apache.org/jira/browse/SPARK-34316 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Zhou JIANG >Priority: Major > > In shared Kubernetes clusters, Spark could be restricted from creating and > deleting config maps in job namespaces. > It would be helpful if the current mandatory config map creation could be > optional. User may still take responsibility of handing Spark conf files > separately. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34309) Use Caffeine instead of Guava Cache
[ https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276766#comment-17276766 ] Dongjoon Hyun commented on SPARK-34309: --- Thank you for pinging me, [~LuciferYang]. The benchmark seems to be written in 2015. We have multiple places of Guava cache usage. Which part are you testing? > Use Caffeine instead of Guava Cache > --- > > Key: SPARK-34309 > URL: https://issues.apache.org/jira/browse/SPARK-34309 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > Caffeine is a high performance, near optimal caching library based on Java 8, > it is used in a similar way to guava cache, but with better performance. The > comparison results are as follow are on the [caffeine benchmarks > |https://github.com/ben-manes/caffeine/wiki/Benchmarks] > At the same time, caffeine has been used in some open source projects like > Cassandra, Hbase, Neo4j, Druid, Spring and so on. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34300) Fix of typos in documentation of pyspark.sql.functions and output of lint-python
[ https://issues.apache.org/jira/browse/SPARK-34300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-34300: Assignee: David Toneian > Fix of typos in documentation of pyspark.sql.functions and output of > lint-python > > > Key: SPARK-34300 > URL: https://issues.apache.org/jira/browse/SPARK-34300 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1 >Reporter: David Toneian >Assignee: David Toneian >Priority: Trivial > > Minor documentation and standard output issues: > * {{dev/lint-python}} contains a typo when printing a warning regarding bad > Sphinx version ("lower then 3.1" rather than "lower than 3.1") > * The documentations of the functions {{lag}} and {{lead}} of > {{pyspark.sql.functions}} refer to a parameter {{defaultValue}}, which in > reality is named {{default}}. > * The documentation strings of functions in {{pyspark.sql.functions}} make > reference to the {{Column}} class, which is not resolved by Sphinx unless > fully qualified as {{pyspark.sql.Column}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34300) Fix of typos in documentation of pyspark.sql.functions and output of lint-python
[ https://issues.apache.org/jira/browse/SPARK-34300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34300. -- Fix Version/s: 3.1.2 Resolution: Fixed Issue resolved by pull request 31401 [https://github.com/apache/spark/pull/31401] > Fix of typos in documentation of pyspark.sql.functions and output of > lint-python > > > Key: SPARK-34300 > URL: https://issues.apache.org/jira/browse/SPARK-34300 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1 >Reporter: David Toneian >Assignee: David Toneian >Priority: Trivial > Fix For: 3.1.2 > > > Minor documentation and standard output issues: > * {{dev/lint-python}} contains a typo when printing a warning regarding bad > Sphinx version ("lower then 3.1" rather than "lower than 3.1") > * The documentations of the functions {{lag}} and {{lead}} of > {{pyspark.sql.functions}} refer to a parameter {{defaultValue}}, which in > reality is named {{default}}. > * The documentation strings of functions in {{pyspark.sql.functions}} make > reference to the {{Column}} class, which is not resolved by Sphinx unless > fully qualified as {{pyspark.sql.Column}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34306) Use Snake naming rule across the function APIs
[ https://issues.apache.org/jira/browse/SPARK-34306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-34306: Assignee: Hyukjin Kwon > Use Snake naming rule across the function APIs > -- > > Key: SPARK-34306 > URL: https://issues.apache.org/jira/browse/SPARK-34306 > Project: Spark > Issue Type: Documentation > Components: PySpark, SparkR, SQL >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > There are some of functions missed in SPARK-10621. > This JIRA targets to rename everything under functions APIs to use Snake > naming rule. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34306) Use Snake naming rule across the function APIs
[ https://issues.apache.org/jira/browse/SPARK-34306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34306. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31408 [https://github.com/apache/spark/pull/31408] > Use Snake naming rule across the function APIs > -- > > Key: SPARK-34306 > URL: https://issues.apache.org/jira/browse/SPARK-34306 > Project: Spark > Issue Type: Documentation > Components: PySpark, SparkR, SQL >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.2.0 > > > There are some of functions missed in SPARK-10621. > This JIRA targets to rename everything under functions APIs to use Snake > naming rule. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34310) Replaces map and flatten with flatMap
[ https://issues.apache.org/jira/browse/SPARK-34310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34310: - Fix Version/s: 3.1.2 3.0.2 > Replaces map and flatten with flatMap > - > > Key: SPARK-34310 > URL: https://issues.apache.org/jira/browse/SPARK-34310 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Trivial > Fix For: 3.0.2, 3.2.0, 3.1.2 > > > Replaces collection.map(f1).flatten(f2) with collection.flatMap if possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34310) Replaces map and flatten with flatMap
[ https://issues.apache.org/jira/browse/SPARK-34310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34310: - Fix Version/s: 2.4.8 > Replaces map and flatten with flatMap > - > > Key: SPARK-34310 > URL: https://issues.apache.org/jira/browse/SPARK-34310 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Trivial > Fix For: 2.4.8, 3.0.2, 3.2.0, 3.1.2 > > > Replaces collection.map(f1).flatten(f2) with collection.flatMap if possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34209) Allow multiple namespaces with session catalog
[ https://issues.apache.org/jira/browse/SPARK-34209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276728#comment-17276728 ] Apache Spark commented on SPARK-34209: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/31427 > Allow multiple namespaces with session catalog > -- > > Key: SPARK-34209 > URL: https://issues.apache.org/jira/browse/SPARK-34209 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: Holden Karau >Priority: Trivial > > SPARK-30885 removed the ability for tables in session catalogs being queried > with SQL to have multiple namespaces. This seems to have been added as a > follow up, not as part of the core change. We should explore if this > restriction can be relaxed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34209) Allow multiple namespaces with session catalog
[ https://issues.apache.org/jira/browse/SPARK-34209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276727#comment-17276727 ] Apache Spark commented on SPARK-34209: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/31427 > Allow multiple namespaces with session catalog > -- > > Key: SPARK-34209 > URL: https://issues.apache.org/jira/browse/SPARK-34209 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: Holden Karau >Priority: Trivial > > SPARK-30885 removed the ability for tables in session catalogs being queried > with SQL to have multiple namespaces. This seems to have been added as a > follow up, not as part of the core change. We should explore if this > restriction can be relaxed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34209) Allow multiple namespaces with session catalog
[ https://issues.apache.org/jira/browse/SPARK-34209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34209: Assignee: Apache Spark > Allow multiple namespaces with session catalog > -- > > Key: SPARK-34209 > URL: https://issues.apache.org/jira/browse/SPARK-34209 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Trivial > > SPARK-30885 removed the ability for tables in session catalogs being queried > with SQL to have multiple namespaces. This seems to have been added as a > follow up, not as part of the core change. We should explore if this > restriction can be relaxed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34209) Allow multiple namespaces with session catalog
[ https://issues.apache.org/jira/browse/SPARK-34209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34209: Assignee: (was: Apache Spark) > Allow multiple namespaces with session catalog > -- > > Key: SPARK-34209 > URL: https://issues.apache.org/jira/browse/SPARK-34209 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: Holden Karau >Priority: Trivial > > SPARK-30885 removed the ability for tables in session catalogs being queried > with SQL to have multiple namespaces. This seems to have been added as a > follow up, not as part of the core change. We should explore if this > restriction can be relaxed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26325) Interpret timestamp fields in Spark while reading json (timestampFormat)
[ https://issues.apache.org/jira/browse/SPARK-26325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276711#comment-17276711 ] Daniel Himmelstein edited comment on SPARK-26325 at 2/1/21, 10:53 PM: -- Here's the code from the original post, but using an RDD rather than JSON file and applying [~maxgekk]'s suggestion to "try Z instead of 'Z'": {code:python} line = '{"time_field" : "2017-09-30 04:53:39.412496Z"}' rdd = spark.sparkContext.parallelize([line]) ( spark.read .option("timestampFormat", "-MM-dd HH:mm:ss.SSZ") .json(path=rdd) ){code} The output I get with pyspark 3.0.1 is `DataFrame[time_field: string]`. So it looks like the issue remains. I'd be interested if there are any examples where spark infers a date or timestamp from a JSON string or whether dateFormat and timestampFormat do not work at all? was (Author: dhimmel): Here's the code from the original post, but using an RDD rather than JSON file and applying [~maxgekk]'s suggestion to "try Z instead of 'Z'": {code:python} line = '{"time_field" : "2017-09-30 04:53:39.412496Z"}' rdd = spark.sparkContext.parallelize([line]) ( spark.read .option("timestampFormat", "-MM-dd HH:mm:ss.SSZ") .json(path=rdd) ){code} The output I get with pyspark 3.0.1 is `DataFrame[time_field: string]`. So it looks like the issue remains. I'd be interested if there are any examples where spark infers a timestamp from a JSON string or whether timestampFormat does not work at all? > Interpret timestamp fields in Spark while reading json (timestampFormat) > > > Key: SPARK-26325 > URL: https://issues.apache.org/jira/browse/SPARK-26325 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Veenit Shah >Priority: Major > > I am trying to read a pretty printed json which has time fields in it. I want > to interpret the timestamps columns as timestamp fields while reading the > json itself. However, it's still reading them as string when I {{printSchema}} > E.g. Input json file - > {code:java} > [{ > "time_field" : "2017-09-30 04:53:39.412496Z" > }] > {code} > Code - > {code:java} > df = spark.read.option("multiLine", > "true").option("timestampFormat","-MM-dd > HH:mm:ss.SS'Z'").json('path_to_json_file') > {code} > Output of df.printSchema() - > {code:java} > root > |-- time_field: string (nullable = true) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26325) Interpret timestamp fields in Spark while reading json (timestampFormat)
[ https://issues.apache.org/jira/browse/SPARK-26325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276711#comment-17276711 ] Daniel Himmelstein commented on SPARK-26325: Here's the code from the original post, but using an RDD rather than JSON file and applying [~maxgekk]'s suggestion to "try Z instead of 'Z'": {code:python} line = '{"time_field" : "2017-09-30 04:53:39.412496Z"}' rdd = spark.sparkContext.parallelize([line]) ( spark.read .option("timestampFormat", "-MM-dd HH:mm:ss.SSZ") .json(path=rdd) ){code} The output I get with pyspark 3.0.1 is `DataFrame[time_field: string]`. So it looks like the issue remains. I'd be interested if there are any examples where spark infers a timestamp from a JSON string or whether timestampFormat does not work at all? > Interpret timestamp fields in Spark while reading json (timestampFormat) > > > Key: SPARK-26325 > URL: https://issues.apache.org/jira/browse/SPARK-26325 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Veenit Shah >Priority: Major > > I am trying to read a pretty printed json which has time fields in it. I want > to interpret the timestamps columns as timestamp fields while reading the > json itself. However, it's still reading them as string when I {{printSchema}} > E.g. Input json file - > {code:java} > [{ > "time_field" : "2017-09-30 04:53:39.412496Z" > }] > {code} > Code - > {code:java} > df = spark.read.option("multiLine", > "true").option("timestampFormat","-MM-dd > HH:mm:ss.SS'Z'").json('path_to_json_file') > {code} > Output of df.printSchema() - > {code:java} > root > |-- time_field: string (nullable = true) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34318) Dataset.colRegex should work with column names and qualifiers which contain newlines
[ https://issues.apache.org/jira/browse/SPARK-34318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34318: Assignee: Kousuke Saruta (was: Apache Spark) > Dataset.colRegex should work with column names and qualifiers which contain > newlines > > > Key: SPARK-34318 > URL: https://issues.apache.org/jira/browse/SPARK-34318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > In the current master, Dataset.colRegex doesn't work with column names or > qualifiers which contain newlines. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34318) Dataset.colRegex should work with column names and qualifiers which contain newlines
[ https://issues.apache.org/jira/browse/SPARK-34318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276708#comment-17276708 ] Apache Spark commented on SPARK-34318: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/31426 > Dataset.colRegex should work with column names and qualifiers which contain > newlines > > > Key: SPARK-34318 > URL: https://issues.apache.org/jira/browse/SPARK-34318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > In the current master, Dataset.colRegex doesn't work with column names or > qualifiers which contain newlines. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34318) Dataset.colRegex should work with column names and qualifiers which contain newlines
[ https://issues.apache.org/jira/browse/SPARK-34318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34318: Assignee: Apache Spark (was: Kousuke Saruta) > Dataset.colRegex should work with column names and qualifiers which contain > newlines > > > Key: SPARK-34318 > URL: https://issues.apache.org/jira/browse/SPARK-34318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > In the current master, Dataset.colRegex doesn't work with column names or > qualifiers which contain newlines. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34318) Dataset.colRegex should work with column names and qualifiers which contain newlines
Kousuke Saruta created SPARK-34318: -- Summary: Dataset.colRegex should work with column names and qualifiers which contain newlines Key: SPARK-34318 URL: https://issues.apache.org/jira/browse/SPARK-34318 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta In the current master, Dataset.colRegex doesn't work with column names or qualifiers which contain newlines. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34315) docker-image-tool.sh debconf trying to configure kerberos
[ https://issues.apache.org/jira/browse/SPARK-34315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276682#comment-17276682 ] Apache Spark commented on SPARK-34315: -- User 'timhughes' has created a pull request for this issue: https://github.com/apache/spark/pull/31425 > docker-image-tool.sh debconf trying to configure kerberos > - > > Key: SPARK-34315 > URL: https://issues.apache.org/jira/browse/SPARK-34315 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1 >Reporter: Tim Hughes >Priority: Critical > Attachments: full-logs.txt > > > When building the docker containers using the docker-image-tool.sh there is > RUN `apt install -y bash tini libc6 libpam-modules krb5-user libnss3` Which > leads to `debconf` trying to configure kerberos. I have tried putting > nothing, EXAMPLE.COM, my corporate kerberos realm and none of them work. it > just hangs after enter is pressed > > {{Setting up krb5-config (2.6) ...}} > {{debconf: unable to initialize frontend: Dialog}} > {{debconf: (TERM is not set, so the dialog frontend is not usable.)}} > {{debconf: falling back to frontend: Readline}} > {{debconf: unable to initialize frontend: Readline}} > {{debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install > the Term::ReadLine module) (@INC contains: /etc/perl > /usr/local/lib/x86_64-linux-gnu/perl/5.28.1 /usr/local/share/perl/5.28.1 > /usr/lib/x86_64-linux-gnu/perl5/5.28 /usr/share/perl5 > /usr/lib/x86_64-linux-gnu/perl/5.28 /usr/share/perl/5.28 > /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at > /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)}} > {{debconf: falling back to frontend: Teletype}} > {{Configuring Kerberos Authentication}} > {{---}} > {{When users attempt to use Kerberos and specify a principal or user name > without}} > {{specifying what administrative Kerberos realm that principal belongs to, > the}} > {{system appends the default realm. The default realm may also be used as > the}} > {{realm of a Kerberos service running on the local machine. Often, the > default}} > {{realm is the uppercase version of the local DNS domain.}} > {{Default Kerberos version 5 realm: EXAMPLE.ORG}} > {{^CFailed to build Spark JVM Docker image, please refer to Docker build > output for details.}} > > > {{## Steps to reproduce}} > {{```}} > {{wget -qO- > https://www.mirrorservice.org/sites/ftp.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz > | tar -xzf -}} > {{cd spark-3.0.1-bin-hadoop3.2/}} > {{./bin/docker-image-tool.sh build}} > {{```}} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34315) docker-image-tool.sh debconf trying to configure kerberos
[ https://issues.apache.org/jira/browse/SPARK-34315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34315: Assignee: Apache Spark > docker-image-tool.sh debconf trying to configure kerberos > - > > Key: SPARK-34315 > URL: https://issues.apache.org/jira/browse/SPARK-34315 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1 >Reporter: Tim Hughes >Assignee: Apache Spark >Priority: Critical > Attachments: full-logs.txt > > > When building the docker containers using the docker-image-tool.sh there is > RUN `apt install -y bash tini libc6 libpam-modules krb5-user libnss3` Which > leads to `debconf` trying to configure kerberos. I have tried putting > nothing, EXAMPLE.COM, my corporate kerberos realm and none of them work. it > just hangs after enter is pressed > > {{Setting up krb5-config (2.6) ...}} > {{debconf: unable to initialize frontend: Dialog}} > {{debconf: (TERM is not set, so the dialog frontend is not usable.)}} > {{debconf: falling back to frontend: Readline}} > {{debconf: unable to initialize frontend: Readline}} > {{debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install > the Term::ReadLine module) (@INC contains: /etc/perl > /usr/local/lib/x86_64-linux-gnu/perl/5.28.1 /usr/local/share/perl/5.28.1 > /usr/lib/x86_64-linux-gnu/perl5/5.28 /usr/share/perl5 > /usr/lib/x86_64-linux-gnu/perl/5.28 /usr/share/perl/5.28 > /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at > /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)}} > {{debconf: falling back to frontend: Teletype}} > {{Configuring Kerberos Authentication}} > {{---}} > {{When users attempt to use Kerberos and specify a principal or user name > without}} > {{specifying what administrative Kerberos realm that principal belongs to, > the}} > {{system appends the default realm. The default realm may also be used as > the}} > {{realm of a Kerberos service running on the local machine. Often, the > default}} > {{realm is the uppercase version of the local DNS domain.}} > {{Default Kerberos version 5 realm: EXAMPLE.ORG}} > {{^CFailed to build Spark JVM Docker image, please refer to Docker build > output for details.}} > > > {{## Steps to reproduce}} > {{```}} > {{wget -qO- > https://www.mirrorservice.org/sites/ftp.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz > | tar -xzf -}} > {{cd spark-3.0.1-bin-hadoop3.2/}} > {{./bin/docker-image-tool.sh build}} > {{```}} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34315) docker-image-tool.sh debconf trying to configure kerberos
[ https://issues.apache.org/jira/browse/SPARK-34315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34315: Assignee: (was: Apache Spark) > docker-image-tool.sh debconf trying to configure kerberos > - > > Key: SPARK-34315 > URL: https://issues.apache.org/jira/browse/SPARK-34315 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1 >Reporter: Tim Hughes >Priority: Critical > Attachments: full-logs.txt > > > When building the docker containers using the docker-image-tool.sh there is > RUN `apt install -y bash tini libc6 libpam-modules krb5-user libnss3` Which > leads to `debconf` trying to configure kerberos. I have tried putting > nothing, EXAMPLE.COM, my corporate kerberos realm and none of them work. it > just hangs after enter is pressed > > {{Setting up krb5-config (2.6) ...}} > {{debconf: unable to initialize frontend: Dialog}} > {{debconf: (TERM is not set, so the dialog frontend is not usable.)}} > {{debconf: falling back to frontend: Readline}} > {{debconf: unable to initialize frontend: Readline}} > {{debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install > the Term::ReadLine module) (@INC contains: /etc/perl > /usr/local/lib/x86_64-linux-gnu/perl/5.28.1 /usr/local/share/perl/5.28.1 > /usr/lib/x86_64-linux-gnu/perl5/5.28 /usr/share/perl5 > /usr/lib/x86_64-linux-gnu/perl/5.28 /usr/share/perl/5.28 > /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at > /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)}} > {{debconf: falling back to frontend: Teletype}} > {{Configuring Kerberos Authentication}} > {{---}} > {{When users attempt to use Kerberos and specify a principal or user name > without}} > {{specifying what administrative Kerberos realm that principal belongs to, > the}} > {{system appends the default realm. The default realm may also be used as > the}} > {{realm of a Kerberos service running on the local machine. Often, the > default}} > {{realm is the uppercase version of the local DNS domain.}} > {{Default Kerberos version 5 realm: EXAMPLE.ORG}} > {{^CFailed to build Spark JVM Docker image, please refer to Docker build > output for details.}} > > > {{## Steps to reproduce}} > {{```}} > {{wget -qO- > https://www.mirrorservice.org/sites/ftp.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz > | tar -xzf -}} > {{cd spark-3.0.1-bin-hadoop3.2/}} > {{./bin/docker-image-tool.sh build}} > {{```}} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34315) docker-image-tool.sh debconf trying to configure kerberos
[ https://issues.apache.org/jira/browse/SPARK-34315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276681#comment-17276681 ] Tim Hughes commented on SPARK-34315: Created a pull request https://github.com/apache/spark/pull/31425 > docker-image-tool.sh debconf trying to configure kerberos > - > > Key: SPARK-34315 > URL: https://issues.apache.org/jira/browse/SPARK-34315 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1 >Reporter: Tim Hughes >Priority: Critical > Attachments: full-logs.txt > > > When building the docker containers using the docker-image-tool.sh there is > RUN `apt install -y bash tini libc6 libpam-modules krb5-user libnss3` Which > leads to `debconf` trying to configure kerberos. I have tried putting > nothing, EXAMPLE.COM, my corporate kerberos realm and none of them work. it > just hangs after enter is pressed > > {{Setting up krb5-config (2.6) ...}} > {{debconf: unable to initialize frontend: Dialog}} > {{debconf: (TERM is not set, so the dialog frontend is not usable.)}} > {{debconf: falling back to frontend: Readline}} > {{debconf: unable to initialize frontend: Readline}} > {{debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install > the Term::ReadLine module) (@INC contains: /etc/perl > /usr/local/lib/x86_64-linux-gnu/perl/5.28.1 /usr/local/share/perl/5.28.1 > /usr/lib/x86_64-linux-gnu/perl5/5.28 /usr/share/perl5 > /usr/lib/x86_64-linux-gnu/perl/5.28 /usr/share/perl/5.28 > /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at > /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)}} > {{debconf: falling back to frontend: Teletype}} > {{Configuring Kerberos Authentication}} > {{---}} > {{When users attempt to use Kerberos and specify a principal or user name > without}} > {{specifying what administrative Kerberos realm that principal belongs to, > the}} > {{system appends the default realm. The default realm may also be used as > the}} > {{realm of a Kerberos service running on the local machine. Often, the > default}} > {{realm is the uppercase version of the local DNS domain.}} > {{Default Kerberos version 5 realm: EXAMPLE.ORG}} > {{^CFailed to build Spark JVM Docker image, please refer to Docker build > output for details.}} > > > {{## Steps to reproduce}} > {{```}} > {{wget -qO- > https://www.mirrorservice.org/sites/ftp.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz > | tar -xzf -}} > {{cd spark-3.0.1-bin-hadoop3.2/}} > {{./bin/docker-image-tool.sh build}} > {{```}} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34315) docker-image-tool.sh debconf trying to configure kerberos
[ https://issues.apache.org/jira/browse/SPARK-34315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276680#comment-17276680 ] Tim Hughes commented on SPARK-34315: [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile#L34] Prefixing the line that installs krb5-user with `DEBIAN_FRONTEND=noninteractive` allows the container to be built. > docker-image-tool.sh debconf trying to configure kerberos > - > > Key: SPARK-34315 > URL: https://issues.apache.org/jira/browse/SPARK-34315 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1 >Reporter: Tim Hughes >Priority: Critical > Attachments: full-logs.txt > > > When building the docker containers using the docker-image-tool.sh there is > RUN `apt install -y bash tini libc6 libpam-modules krb5-user libnss3` Which > leads to `debconf` trying to configure kerberos. I have tried putting > nothing, EXAMPLE.COM, my corporate kerberos realm and none of them work. it > just hangs after enter is pressed > > {{Setting up krb5-config (2.6) ...}} > {{debconf: unable to initialize frontend: Dialog}} > {{debconf: (TERM is not set, so the dialog frontend is not usable.)}} > {{debconf: falling back to frontend: Readline}} > {{debconf: unable to initialize frontend: Readline}} > {{debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install > the Term::ReadLine module) (@INC contains: /etc/perl > /usr/local/lib/x86_64-linux-gnu/perl/5.28.1 /usr/local/share/perl/5.28.1 > /usr/lib/x86_64-linux-gnu/perl5/5.28 /usr/share/perl5 > /usr/lib/x86_64-linux-gnu/perl/5.28 /usr/share/perl/5.28 > /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at > /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)}} > {{debconf: falling back to frontend: Teletype}} > {{Configuring Kerberos Authentication}} > {{---}} > {{When users attempt to use Kerberos and specify a principal or user name > without}} > {{specifying what administrative Kerberos realm that principal belongs to, > the}} > {{system appends the default realm. The default realm may also be used as > the}} > {{realm of a Kerberos service running on the local machine. Often, the > default}} > {{realm is the uppercase version of the local DNS domain.}} > {{Default Kerberos version 5 realm: EXAMPLE.ORG}} > {{^CFailed to build Spark JVM Docker image, please refer to Docker build > output for details.}} > > > {{## Steps to reproduce}} > {{```}} > {{wget -qO- > https://www.mirrorservice.org/sites/ftp.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz > | tar -xzf -}} > {{cd spark-3.0.1-bin-hadoop3.2/}} > {{./bin/docker-image-tool.sh build}} > {{```}} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34317) Introduce relationTypeMismatchHint to UnresolvedTable for a better error message
[ https://issues.apache.org/jira/browse/SPARK-34317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276655#comment-17276655 ] Apache Spark commented on SPARK-34317: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/31424 > Introduce relationTypeMismatchHint to UnresolvedTable for a better error > message > > > Key: SPARK-34317 > URL: https://issues.apache.org/jira/browse/SPARK-34317 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Terry Kim >Priority: Major > > The relationTypeMismatchHint in UnresolvedTable can be used to give a hint if > the resolved relation is a view. For example, for "ALTER TABLE t ...", if "t" > is resolved a view, the error message will also contain a hint, "Please use > ALTER VIEW instead." -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34317) Introduce relationTypeMismatchHint to UnresolvedTable for a better error message
[ https://issues.apache.org/jira/browse/SPARK-34317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34317: Assignee: (was: Apache Spark) > Introduce relationTypeMismatchHint to UnresolvedTable for a better error > message > > > Key: SPARK-34317 > URL: https://issues.apache.org/jira/browse/SPARK-34317 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Terry Kim >Priority: Major > > The relationTypeMismatchHint in UnresolvedTable can be used to give a hint if > the resolved relation is a view. For example, for "ALTER TABLE t ...", if "t" > is resolved a view, the error message will also contain a hint, "Please use > ALTER VIEW instead." -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34317) Introduce relationTypeMismatchHint to UnresolvedTable for a better error message
[ https://issues.apache.org/jira/browse/SPARK-34317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34317: Assignee: Apache Spark > Introduce relationTypeMismatchHint to UnresolvedTable for a better error > message > > > Key: SPARK-34317 > URL: https://issues.apache.org/jira/browse/SPARK-34317 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Terry Kim >Assignee: Apache Spark >Priority: Major > > The relationTypeMismatchHint in UnresolvedTable can be used to give a hint if > the resolved relation is a view. For example, for "ALTER TABLE t ...", if "t" > is resolved a view, the error message will also contain a hint, "Please use > ALTER VIEW instead." -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34317) Introduce relationTypeMismatchHint to UnresolvedTable for a better error message
[ https://issues.apache.org/jira/browse/SPARK-34317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276654#comment-17276654 ] Apache Spark commented on SPARK-34317: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/31424 > Introduce relationTypeMismatchHint to UnresolvedTable for a better error > message > > > Key: SPARK-34317 > URL: https://issues.apache.org/jira/browse/SPARK-34317 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Terry Kim >Priority: Major > > The relationTypeMismatchHint in UnresolvedTable can be used to give a hint if > the resolved relation is a view. For example, for "ALTER TABLE t ...", if "t" > is resolved a view, the error message will also contain a hint, "Please use > ALTER VIEW instead." -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34315) docker-image-tool.sh debconf trying to configure kerberos
[ https://issues.apache.org/jira/browse/SPARK-34315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Hughes updated SPARK-34315: --- Environment: (was: # ## Full logs {{}}{{}}) > docker-image-tool.sh debconf trying to configure kerberos > - > > Key: SPARK-34315 > URL: https://issues.apache.org/jira/browse/SPARK-34315 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1 >Reporter: Tim Hughes >Priority: Critical > Attachments: full-logs.txt > > > When building the docker containers using the docker-image-tool.sh there is > RUN `apt install -y bash tini libc6 libpam-modules krb5-user libnss3` Which > leads to `debconf` trying to configure kerberos. I have tried putting > nothing, EXAMPLE.COM, my corporate kerberos realm and none of them work. it > just hangs after enter is pressed > > {{Setting up krb5-config (2.6) ...}} > {{debconf: unable to initialize frontend: Dialog}} > {{debconf: (TERM is not set, so the dialog frontend is not usable.)}} > {{debconf: falling back to frontend: Readline}} > {{debconf: unable to initialize frontend: Readline}} > {{debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install > the Term::ReadLine module) (@INC contains: /etc/perl > /usr/local/lib/x86_64-linux-gnu/perl/5.28.1 /usr/local/share/perl/5.28.1 > /usr/lib/x86_64-linux-gnu/perl5/5.28 /usr/share/perl5 > /usr/lib/x86_64-linux-gnu/perl/5.28 /usr/share/perl/5.28 > /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at > /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)}} > {{debconf: falling back to frontend: Teletype}} > {{Configuring Kerberos Authentication}} > {{---}} > {{When users attempt to use Kerberos and specify a principal or user name > without}} > {{specifying what administrative Kerberos realm that principal belongs to, > the}} > {{system appends the default realm. The default realm may also be used as > the}} > {{realm of a Kerberos service running on the local machine. Often, the > default}} > {{realm is the uppercase version of the local DNS domain.}} > {{Default Kerberos version 5 realm: EXAMPLE.ORG}} > {{^CFailed to build Spark JVM Docker image, please refer to Docker build > output for details.}} > > > {{## Steps to reproduce}} > {{```}} > {{wget -qO- > https://www.mirrorservice.org/sites/ftp.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz > | tar -xzf -}} > {{cd spark-3.0.1-bin-hadoop3.2/}} > {{./bin/docker-image-tool.sh build}} > {{```}} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34317) Introduce relationTypeMismatchHint to UnresolvedTable for a better error message
Terry Kim created SPARK-34317: - Summary: Introduce relationTypeMismatchHint to UnresolvedTable for a better error message Key: SPARK-34317 URL: https://issues.apache.org/jira/browse/SPARK-34317 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Terry Kim The relationTypeMismatchHint in UnresolvedTable can be used to give a hint if the resolved relation is a view. For example, for "ALTER TABLE t ...", if "t" is resolved a view, the error message will also contain a hint, "Please use ALTER VIEW instead." -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34316) Optional Propagation of SPARK_CONF_DIR in K8s
Zhou JIANG created SPARK-34316: -- Summary: Optional Propagation of SPARK_CONF_DIR in K8s Key: SPARK-34316 URL: https://issues.apache.org/jira/browse/SPARK-34316 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 3.0.1 Reporter: Zhou JIANG In shared Kubernetes clusters, Spark could be restricted from creating and deleting config maps in job namespaces. It would be helpful if the current mandatory config map creation could be optional. User may still take responsibility of handing Spark conf files separately. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34315) docker-image-tool.sh debconf trying to configure kerberos
[ https://issues.apache.org/jira/browse/SPARK-34315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Hughes updated SPARK-34315: --- Attachment: full-logs.txt > docker-image-tool.sh debconf trying to configure kerberos > - > > Key: SPARK-34315 > URL: https://issues.apache.org/jira/browse/SPARK-34315 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1 > Environment: ## Full logs > > {{}}{{$ bin/docker-image-tool.sh build}} > {{Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet > msg.}} > {{STEP 1: FROM openjdk:8-jre-slim}} > {{STEP 2: ARG spark_uid=185}} > {{--> Using cache > d24913e4f80a167a2682380bc0565b0eefac2e7e5b94f1491b99712e1154dd3b}} > {{--> d24913e4f80}} > {{STEP 3: RUN set -ex && sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' > /etc/apt/sources.list && apt-get update && ln -s /lib /lib64 && apt install > -y bash tini libc6 libpam-modules krb5-user libnss3 && mkdir -p /opt/spark && > mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && touch > /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth > required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && > chmod ug+rw /etc/passwd && rm -rf /var/cache/apt/*}} > {{+ sed -i s/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g /etc/apt/sources.list}} > {{+ apt-get update}} > {{Get:1 http://security.debian.org/debian-security buster/updates InRelease > [65.4 kB]}} > {{Get:2 https://deb.debian.org/debian buster InRelease [121 kB] }} > {{Get:3 https://deb.debian.org/debian buster-updates InRelease [51.9 kB]}} > {{Get:4 http://security.debian.org/debian-security buster/updates/main amd64 > Packages [271 kB]}} > {{Get:5 https://deb.debian.org/debian buster/main amd64 Packages [7907 kB]}} > {{Get:6 https://deb.debian.org/debian buster-updates/main amd64 Packages > [7848 B]}} > {{Fetched 8426 kB in 4s (1995 kB/s) }} > {{Reading package lists... Done}} > {{+ ln -s /lib /lib64}} > {{+ apt install -y bash tini libc6 libpam-modules krb5-user libnss3}} > {{Reading package lists... Done}} > {{Building dependency tree }} > {{Reading state information... Done}} > {{bash is already the newest version (5.0-4).}} > {{bash set to manually installed.}} > {{libc6 is already the newest version (2.28-10).}} > {{libc6 set to manually installed.}} > {{libpam-modules is already the newest version (1.3.1-5).}} > {{libpam-modules set to manually installed.}} > {{The following package was automatically installed and is no longer > required:}} > {{ lsb-base}} > {{Use 'apt autoremove' to remove it.}} > {{The following additional packages will be installed:}} > {{ bind9-host geoip-database krb5-config krb5-locales libbind9-161 libcap2}} > {{ libdns1104 libfstrm0 libgeoip1 libgssapi-krb5-2 libgssrpc4 libicu63}} > {{ libisc1100 libisccc161 libisccfg163 libjson-c3 libk5crypto3}} > {{ libkadm5clnt-mit11 libkadm5srv-mit11 libkdb5-9 libkeyutils1 libkrb5-3}} > {{ libkrb5support0 liblmdb0 liblwres161 libnspr4 libprotobuf-c1 libsqlite3-0}} > {{ libxml2}} > {{Suggested packages:}} > {{ krb5-k5tls geoip-bin krb5-doc}} > {{The following NEW packages will be installed:}} > {{ bind9-host geoip-database krb5-config krb5-locales krb5-user libbind9-161}} > {{ libcap2 libdns1104 libfstrm0 libgeoip1 libgssapi-krb5-2 libgssrpc4 > libicu63}} > {{ libisc1100 libisccc161 libisccfg163 libjson-c3 libk5crypto3}} > {{ libkadm5clnt-mit11 libkadm5srv-mit11 libkdb5-9 libkeyutils1 libkrb5-3}} > {{ libkrb5support0 liblmdb0 liblwres161 libnspr4 libnss3 libprotobuf-c1}} > {{ libsqlite3-0 libxml2 tini}} > {{0 upgraded, 32 newly installed, 0 to remove and 2 not upgraded.}} > {{Need to get 18.1 MB of archives.}} > {{After this operation, 61.3 MB of additional disk space will be used.}} > {{Get:1 https://deb.debian.org/debian buster/main amd64 libcap2 amd64 > 1:2.25-2 [17.6 kB]}} > {{Get:2 https://deb.debian.org/debian buster/main amd64 libfstrm0 amd64 > 0.4.0-1 [20.8 kB]}} > {{Get:3 https://deb.debian.org/debian buster/main amd64 libgeoip1 amd64 > 1.6.12-1 [93.1 kB]}} > {{Get:4 https://deb.debian.org/debian buster/main amd64 libjson-c3 amd64 > 0.12.1+ds-2+deb10u1 [27.3 kB]}} > {{Get:5 https://deb.debian.org/debian buster/main amd64 liblmdb0 amd64 > 0.9.22-1 [45.0 kB]}} > {{Get:6 https://deb.debian.org/debian buster/main amd64 libprotobuf-c1 amd64 > 1.3.1-1+b1 [26.5 kB]}} > {{Get:7 https://deb.debian.org/debian buster/main amd64 libicu63 amd64 > 63.1-6+deb10u1 [8300 kB]}} > {{Get:8 https://deb.debian.org/debian buster/main amd64 libxml2 amd64 > 2.9.4+dfsg1-7+deb10u1 [689 kB]}} > {{Get:9 https://deb.debian.org/debian buster/main amd64 libisc1100 amd64 > 1:9.11.5.P4+dfsg-5.1+deb10u2 [458 kB]}} > {{Get:10 https://deb.debian.org/debian buster/main amd64 libkeyutils1 amd64 > 1.6-6 [15.0 kB]}} > {{Get:11 https://
[jira] [Updated] (SPARK-34315) docker-image-tool.sh debconf trying to configure kerberos
[ https://issues.apache.org/jira/browse/SPARK-34315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Hughes updated SPARK-34315: --- Environment: # ## Full logs {{}}{{}} was: ## Full logs {{}}{{$ bin/docker-image-tool.sh build}} {{Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.}} {{STEP 1: FROM openjdk:8-jre-slim}} {{STEP 2: ARG spark_uid=185}} {{--> Using cache d24913e4f80a167a2682380bc0565b0eefac2e7e5b94f1491b99712e1154dd3b}} {{--> d24913e4f80}} {{STEP 3: RUN set -ex && sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && apt-get update && ln -s /lib /lib64 && apt install -y bash tini libc6 libpam-modules krb5-user libnss3 && mkdir -p /opt/spark && mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && touch /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && chmod ug+rw /etc/passwd && rm -rf /var/cache/apt/*}} {{+ sed -i s/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g /etc/apt/sources.list}} {{+ apt-get update}} {{Get:1 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]}} {{Get:2 https://deb.debian.org/debian buster InRelease [121 kB] }} {{Get:3 https://deb.debian.org/debian buster-updates InRelease [51.9 kB]}} {{Get:4 http://security.debian.org/debian-security buster/updates/main amd64 Packages [271 kB]}} {{Get:5 https://deb.debian.org/debian buster/main amd64 Packages [7907 kB]}} {{Get:6 https://deb.debian.org/debian buster-updates/main amd64 Packages [7848 B]}} {{Fetched 8426 kB in 4s (1995 kB/s) }} {{Reading package lists... Done}} {{+ ln -s /lib /lib64}} {{+ apt install -y bash tini libc6 libpam-modules krb5-user libnss3}} {{Reading package lists... Done}} {{Building dependency tree }} {{Reading state information... Done}} {{bash is already the newest version (5.0-4).}} {{bash set to manually installed.}} {{libc6 is already the newest version (2.28-10).}} {{libc6 set to manually installed.}} {{libpam-modules is already the newest version (1.3.1-5).}} {{libpam-modules set to manually installed.}} {{The following package was automatically installed and is no longer required:}} {{ lsb-base}} {{Use 'apt autoremove' to remove it.}} {{The following additional packages will be installed:}} {{ bind9-host geoip-database krb5-config krb5-locales libbind9-161 libcap2}} {{ libdns1104 libfstrm0 libgeoip1 libgssapi-krb5-2 libgssrpc4 libicu63}} {{ libisc1100 libisccc161 libisccfg163 libjson-c3 libk5crypto3}} {{ libkadm5clnt-mit11 libkadm5srv-mit11 libkdb5-9 libkeyutils1 libkrb5-3}} {{ libkrb5support0 liblmdb0 liblwres161 libnspr4 libprotobuf-c1 libsqlite3-0}} {{ libxml2}} {{Suggested packages:}} {{ krb5-k5tls geoip-bin krb5-doc}} {{The following NEW packages will be installed:}} {{ bind9-host geoip-database krb5-config krb5-locales krb5-user libbind9-161}} {{ libcap2 libdns1104 libfstrm0 libgeoip1 libgssapi-krb5-2 libgssrpc4 libicu63}} {{ libisc1100 libisccc161 libisccfg163 libjson-c3 libk5crypto3}} {{ libkadm5clnt-mit11 libkadm5srv-mit11 libkdb5-9 libkeyutils1 libkrb5-3}} {{ libkrb5support0 liblmdb0 liblwres161 libnspr4 libnss3 libprotobuf-c1}} {{ libsqlite3-0 libxml2 tini}} {{0 upgraded, 32 newly installed, 0 to remove and 2 not upgraded.}} {{Need to get 18.1 MB of archives.}} {{After this operation, 61.3 MB of additional disk space will be used.}} {{Get:1 https://deb.debian.org/debian buster/main amd64 libcap2 amd64 1:2.25-2 [17.6 kB]}} {{Get:2 https://deb.debian.org/debian buster/main amd64 libfstrm0 amd64 0.4.0-1 [20.8 kB]}} {{Get:3 https://deb.debian.org/debian buster/main amd64 libgeoip1 amd64 1.6.12-1 [93.1 kB]}} {{Get:4 https://deb.debian.org/debian buster/main amd64 libjson-c3 amd64 0.12.1+ds-2+deb10u1 [27.3 kB]}} {{Get:5 https://deb.debian.org/debian buster/main amd64 liblmdb0 amd64 0.9.22-1 [45.0 kB]}} {{Get:6 https://deb.debian.org/debian buster/main amd64 libprotobuf-c1 amd64 1.3.1-1+b1 [26.5 kB]}} {{Get:7 https://deb.debian.org/debian buster/main amd64 libicu63 amd64 63.1-6+deb10u1 [8300 kB]}} {{Get:8 https://deb.debian.org/debian buster/main amd64 libxml2 amd64 2.9.4+dfsg1-7+deb10u1 [689 kB]}} {{Get:9 https://deb.debian.org/debian buster/main amd64 libisc1100 amd64 1:9.11.5.P4+dfsg-5.1+deb10u2 [458 kB]}} {{Get:10 https://deb.debian.org/debian buster/main amd64 libkeyutils1 amd64 1.6-6 [15.0 kB]}} {{Get:11 https://deb.debian.org/debian buster/main amd64 libkrb5support0 amd64 1.17-3+deb10u1 [65.8 kB]}} {{Get:12 https://deb.debian.org/debian buster/main amd64 libk5crypto3 amd64 1.17-3+deb10u1 [122 kB]}} {{Get:13 https://deb.debian.org/debian buster/main amd64 libkrb5-3 amd64 1.17-3+deb10u1 [369 kB]}} {{Get:14 https://deb.debian.org/debian buster/main amd64 libgssapi-krb5-2 amd64 1.17-3+deb10u1 [158 kB]}} {{Get:15 https://deb.debian.org/debian buster/main amd64 libdns1104 amd64 1:9.11.5.P4+dfsg-5.1+deb10u2 [1223 kB]}} {