[jira] [Resolved] (SPARK-29298) Separate block manager heartbeat endpoint from driver endpoint
[ https://issues.apache.org/jira/browse/SPARK-29298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29298. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25971 [https://github.com/apache/spark/pull/25971] > Separate block manager heartbeat endpoint from driver endpoint > -- > > Key: SPARK-29298 > URL: https://issues.apache.org/jira/browse/SPARK-29298 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Major > Fix For: 3.0.0 > > > Executor's heartbeat will send synchronously to BlockManagerMaster to let it > know that the block manager is still alive. In a heavy cluster, it will > timeout and cause block manager re-register unexpected. > This improvement will separate a heartbeat endpoint from the driver endpoint. > In our production environment, this is really helpful to prevent executors > from unstable up and down. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29298) Separate block manager heartbeat endpoint from driver endpoint
[ https://issues.apache.org/jira/browse/SPARK-29298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29298: --- Assignee: Lantao Jin > Separate block manager heartbeat endpoint from driver endpoint > -- > > Key: SPARK-29298 > URL: https://issues.apache.org/jira/browse/SPARK-29298 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Major > > Executor's heartbeat will send synchronously to BlockManagerMaster to let it > know that the block manager is still alive. In a heavy cluster, it will > timeout and cause block manager re-register unexpected. > This improvement will separate a heartbeat endpoint from the driver endpoint. > In our production environment, this is really helpful to prevent executors > from unstable up and down. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972184#comment-16972184 ] zhao bo commented on SPARK-29106: - And for the above several comments I left, we hope [~shaneknapp], you could give us some kind advices, we plan to on board it recently, but need some test, so we think to keep the existing worker for a period, and we integrate the new VM at the same time. Here I can share current VM summary to you: cpu: 8u Memory: 16G location: BeiJing, China (This could affect the network performance as the VM is in China, but we had improved the network performance via configuring internal source) As we tested, the same maven test will last 5h33min, and build will last 43min。So I think it's good to be a test worker in jenkins. ;) Thanks > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > Attachments: R-ansible.yml, R-libs.txt, arm-python36.txt > > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
[ https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972223#comment-16972223 ] Hyukjin Kwon commented on SPARK-29764: -- priority doesn't matter about getting a help. It is just related to release processes. Issues that are difficult to read or don't look clear to reproduce don't get attention a lot often. If I were you, I would try to make a minimsied reproducer without unrelated information, rather than just copy and past the whole codes. > Error on Serializing POJO with java datetime property to a Parquet file > --- > > Key: SPARK-29764 > URL: https://issues.apache.org/jira/browse/SPARK-29764 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > Attachments: SparkParquetSampleCode.docx > > > Hello, > I have been doing a proof of concept for data lake structure and analytics > using Apache Spark. > When I add a java java.time.LocalDateTime/java.time.LocalDate properties in > my data model, the serialization to Parquet start failing. > *My Data Model:* > @Data > public class Employee > { private UUID id = UUID.randomUUID(); private String name; private int age; > private LocalDate dob; private LocalDateTime startDateTime; private String > phone; private Address address; } > > *Serialization Snippet* > {color:#0747a6}public void serialize(){color} > {color:#0747a6}{ List inputDataToSerialize = > getInputDataToSerialize(); // this creates 100,000 employee objects > Encoder employeeEncoder = Encoders.bean(Employee.class); > Dataset employeeDataset = sparkSession.createDataset( > inputDataToSerialize, employeeEncoder ); employeeDataset.write() > .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); > }{color} > +*Exception Stack Trace:* > + > *java.lang.IllegalStateException: Failed to execute > CommandLineRunnerjava.lang.IllegalStateException: Failed to execute > CommandLineRunner at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) > at > org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) > at > org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) > at com.felix.Application.main(Application.java:45)Caused by: > org.apache.spark.SparkException: Job aborted. at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at > com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at > com.felix.Application.run(Application.java:63) at
[jira] [Commented] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972224#comment-16972224 ] Enzo Bonnal commented on SPARK-29856: - Just a note: if I am not wrong_, findBestSplits_ may leverage the caching if _nodeIdCache.nonEmpty._ Have you took this into account ? > Conditional unnecessary persist on RDDs in ML algorithms > > > Key: SPARK-29856 > URL: https://issues.apache.org/jira/browse/SPARK-29856 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 3.0.0 >Reporter: Dong Wang >Priority: Major > > When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD > _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is > persisted, but it only used once. So this persist operation is unnecessary. > {code:scala} > val baggedInput = BaggedPoint > .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, > withReplacement, > (tp: TreePoint) => tp.weight, seed = seed) > .persist(StorageLevel.MEMORY_AND_DISK) > ... >while (nodeStack.nonEmpty) { > ... > timer.start("findBestSplits") > RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, > nodesForGroup, > treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache) > timer.stop("findBestSplits") > } > baggedInput.unpersist() > {code} > However, the action on {color:#DE350B}_baggedInput_{color} is in a while > loop. > In GradientBoostedTreeRegressorExample, this loop only executes once, so only > one action uses {color:#DE350B}_baggedInput_{color}. > In most of ML applications, the loop will executes for many times, which > means {color:#DE350B}_baggedInput_{color} will be used in many actions. So > the persist is necessary now. > That's the point why the persist operation is "conditional" unnecessary. > Same situations exist in many other ML algorithms, e.g., RDD > {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD > {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run(). > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972224#comment-16972224 ] Enzo Bonnal edited comment on SPARK-29856 at 11/12/19 9:39 AM: --- Just a note: if I am not wrong, _findBestSplits_ may leverage the caching if _nodeIdCache.nonEmpty._ Have you took this into account ? was (Author: enzobnl): Just a note: if I am not wrong_, findBestSplits_ may leverage the caching if _nodeIdCache.nonEmpty._ Have you took this into account ? > Conditional unnecessary persist on RDDs in ML algorithms > > > Key: SPARK-29856 > URL: https://issues.apache.org/jira/browse/SPARK-29856 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 3.0.0 >Reporter: Dong Wang >Priority: Major > > When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD > _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is > persisted, but it only used once. So this persist operation is unnecessary. > {code:scala} > val baggedInput = BaggedPoint > .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, > withReplacement, > (tp: TreePoint) => tp.weight, seed = seed) > .persist(StorageLevel.MEMORY_AND_DISK) > ... >while (nodeStack.nonEmpty) { > ... > timer.start("findBestSplits") > RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, > nodesForGroup, > treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache) > timer.stop("findBestSplits") > } > baggedInput.unpersist() > {code} > However, the action on {color:#DE350B}_baggedInput_{color} is in a while > loop. > In GradientBoostedTreeRegressorExample, this loop only executes once, so only > one action uses {color:#DE350B}_baggedInput_{color}. > In most of ML applications, the loop will executes for many times, which > means {color:#DE350B}_baggedInput_{color} will be used in many actions. So > the persist is necessary now. > That's the point why the persist operation is "conditional" unnecessary. > Same situations exist in many other ML algorithms, e.g., RDD > {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD > {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run(). > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972224#comment-16972224 ] EnzoBnl edited comment on SPARK-29856 at 11/12/19 9:42 AM: --- Just a note: If I am not wrong, _findBestSplits_ may leverage the caching if _nodeIdCache.nonEmpty._ Have you taken this into account ? was (Author: enzobnl): Just a note: if I am not wrong, _findBestSplits_ may leverage the caching if _nodeIdCache.nonEmpty._ Have you taken this into account ? > Conditional unnecessary persist on RDDs in ML algorithms > > > Key: SPARK-29856 > URL: https://issues.apache.org/jira/browse/SPARK-29856 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 3.0.0 >Reporter: Dong Wang >Priority: Major > > When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD > _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is > persisted, but it only used once. So this persist operation is unnecessary. > {code:scala} > val baggedInput = BaggedPoint > .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, > withReplacement, > (tp: TreePoint) => tp.weight, seed = seed) > .persist(StorageLevel.MEMORY_AND_DISK) > ... >while (nodeStack.nonEmpty) { > ... > timer.start("findBestSplits") > RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, > nodesForGroup, > treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache) > timer.stop("findBestSplits") > } > baggedInput.unpersist() > {code} > However, the action on {color:#DE350B}_baggedInput_{color} is in a while > loop. > In GradientBoostedTreeRegressorExample, this loop only executes once, so only > one action uses {color:#DE350B}_baggedInput_{color}. > In most of ML applications, the loop will executes for many times, which > means {color:#DE350B}_baggedInput_{color} will be used in many actions. So > the persist is necessary now. > That's the point why the persist operation is "conditional" unnecessary. > Same situations exist in many other ML algorithms, e.g., RDD > {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD > {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run(). > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972224#comment-16972224 ] EnzoBnl edited comment on SPARK-29856 at 11/12/19 9:41 AM: --- Just a note: if I am not wrong, _findBestSplits_ may leverage the caching if _nodeIdCache.nonEmpty._ Have you taken this into account ? was (Author: enzobnl): Just a note: if I am not wrong, _findBestSplits_ may leverage the caching if _nodeIdCache.nonEmpty._ Have you took this into account ? > Conditional unnecessary persist on RDDs in ML algorithms > > > Key: SPARK-29856 > URL: https://issues.apache.org/jira/browse/SPARK-29856 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 3.0.0 >Reporter: Dong Wang >Priority: Major > > When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD > _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is > persisted, but it only used once. So this persist operation is unnecessary. > {code:scala} > val baggedInput = BaggedPoint > .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, > withReplacement, > (tp: TreePoint) => tp.weight, seed = seed) > .persist(StorageLevel.MEMORY_AND_DISK) > ... >while (nodeStack.nonEmpty) { > ... > timer.start("findBestSplits") > RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, > nodesForGroup, > treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache) > timer.stop("findBestSplits") > } > baggedInput.unpersist() > {code} > However, the action on {color:#DE350B}_baggedInput_{color} is in a while > loop. > In GradientBoostedTreeRegressorExample, this loop only executes once, so only > one action uses {color:#DE350B}_baggedInput_{color}. > In most of ML applications, the loop will executes for many times, which > means {color:#DE350B}_baggedInput_{color} will be used in many actions. So > the persist is necessary now. > That's the point why the persist operation is "conditional" unnecessary. > Same situations exist in many other ML algorithms, e.g., RDD > {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD > {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run(). > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972224#comment-16972224 ] EnzoBnl edited comment on SPARK-29856 at 11/12/19 9:47 AM: --- If I am not wrong, _findBestSplits_ may leverage the caching if _nodeIdCache.nonEmpty._ Have you taken this into account ? was (Author: enzobnl): Just a note: If I am not wrong, _findBestSplits_ may leverage the caching if _nodeIdCache.nonEmpty._ Have you taken this into account ? > Conditional unnecessary persist on RDDs in ML algorithms > > > Key: SPARK-29856 > URL: https://issues.apache.org/jira/browse/SPARK-29856 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 3.0.0 >Reporter: Dong Wang >Priority: Major > > When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD > _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is > persisted, but it only used once. So this persist operation is unnecessary. > {code:scala} > val baggedInput = BaggedPoint > .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, > withReplacement, > (tp: TreePoint) => tp.weight, seed = seed) > .persist(StorageLevel.MEMORY_AND_DISK) > ... >while (nodeStack.nonEmpty) { > ... > timer.start("findBestSplits") > RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, > nodesForGroup, > treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache) > timer.stop("findBestSplits") > } > baggedInput.unpersist() > {code} > However, the action on {color:#DE350B}_baggedInput_{color} is in a while > loop. > In GradientBoostedTreeRegressorExample, this loop only executes once, so only > one action uses {color:#DE350B}_baggedInput_{color}. > In most of ML applications, the loop will executes for many times, which > means {color:#DE350B}_baggedInput_{color} will be used in many actions. So > the persist is necessary now. > That's the point why the persist operation is "conditional" unnecessary. > Same situations exist in many other ML algorithms, e.g., RDD > {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD > {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run(). > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29855) typed literals with negative sign with proper result or exception
[ https://issues.apache.org/jira/browse/SPARK-29855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972292#comment-16972292 ] Hyukjin Kwon commented on SPARK-29855: -- [~Qin Yao] Please link related JIRAs for the initial literal supports. > typed literals with negative sign with proper result or exception > - > > Key: SPARK-29855 > URL: https://issues.apache.org/jira/browse/SPARK-29855 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Major > > {code:java} > -- !query 83 > select -integer '7' > -- !query 83 schema > struct<7:int> > -- !query 83 output > 7 > -- !query 86 > select -date '1999-01-01' > -- !query 86 schema > struct > -- !query 86 output > 1999-01-01 > -- !query 87 > select -timestamp '1999-01-01' > -- !query 87 schema > struct > -- !query 87 output > 1999-01-01 00:00:00 > {code} > the integer should be -7 and the date and timestamp results are confusing > which should throw exception. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972294#comment-16972294 ] Dong Wang commented on SPARK-29856: --- But there is only one action _collectAsMap()_ using _baggedInput_. If the loop only execute once, _baggedInput_ is only used once. > Conditional unnecessary persist on RDDs in ML algorithms > > > Key: SPARK-29856 > URL: https://issues.apache.org/jira/browse/SPARK-29856 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 3.0.0 >Reporter: Dong Wang >Priority: Major > > When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD > _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is > persisted, but it only used once. So this persist operation is unnecessary. > {code:scala} > val baggedInput = BaggedPoint > .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, > withReplacement, > (tp: TreePoint) => tp.weight, seed = seed) > .persist(StorageLevel.MEMORY_AND_DISK) > ... >while (nodeStack.nonEmpty) { > ... > timer.start("findBestSplits") > RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, > nodesForGroup, > treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache) > timer.stop("findBestSplits") > } > baggedInput.unpersist() > {code} > However, the action on {color:#DE350B}_baggedInput_{color} is in a while > loop. > In GradientBoostedTreeRegressorExample, this loop only executes once, so only > one action uses {color:#DE350B}_baggedInput_{color}. > In most of ML applications, the loop will executes for many times, which > means {color:#DE350B}_baggedInput_{color} will be used in many actions. So > the persist is necessary now. > That's the point why the persist operation is "conditional" unnecessary. > Same situations exist in many other ML algorithms, e.g., RDD > {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD > {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run(). > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972294#comment-16972294 ] Dong Wang edited comment on SPARK-29856 at 11/12/19 11:32 AM: -- But there is only one action _collectAsMap()_ using _baggedInput_ in _findBestSplits_. If the loop only execute once, _baggedInput_ is only used once. was (Author: spark_cachecheck): But there is only one action _collectAsMap()_ using _baggedInput_. If the loop only execute once, _baggedInput_ is only used once. > Conditional unnecessary persist on RDDs in ML algorithms > > > Key: SPARK-29856 > URL: https://issues.apache.org/jira/browse/SPARK-29856 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 3.0.0 >Reporter: Dong Wang >Priority: Major > > When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD > _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is > persisted, but it only used once. So this persist operation is unnecessary. > {code:scala} > val baggedInput = BaggedPoint > .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, > withReplacement, > (tp: TreePoint) => tp.weight, seed = seed) > .persist(StorageLevel.MEMORY_AND_DISK) > ... >while (nodeStack.nonEmpty) { > ... > timer.start("findBestSplits") > RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, > nodesForGroup, > treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache) > timer.stop("findBestSplits") > } > baggedInput.unpersist() > {code} > However, the action on {color:#DE350B}_baggedInput_{color} is in a while > loop. > In GradientBoostedTreeRegressorExample, this loop only executes once, so only > one action uses {color:#DE350B}_baggedInput_{color}. > In most of ML applications, the loop will executes for many times, which > means {color:#DE350B}_baggedInput_{color} will be used in many actions. So > the persist is necessary now. > That's the point why the persist operation is "conditional" unnecessary. > Same situations exist in many other ML algorithms, e.g., RDD > {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD > {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run(). > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value
[ https://issues.apache.org/jira/browse/SPARK-29854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972302#comment-16972302 ] Hyukjin Kwon commented on SPARK-29854: -- [~abhishek.akg] Can you please use \{code\} ... \{code\} block. > lpad and rpad built in function not throw Exception for invalid len value > - > > Key: SPARK-29854 > URL: https://issues.apache.org/jira/browse/SPARK-29854 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Spark Returns Empty String) > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT > lpad('hihhh', 5000, ''); > ++ > |lpad(hihhh, CAST(5000 AS INT), > )| > ++ > ++ > Hive: > SELECT lpad('hihhh', 5000, > ''); > Error: Error while compiling statement: FAILED: SemanticException [Error > 10016]: Line 1:67 Argument type mismatch '''': lpad only takes > INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016) > PostgreSQL > function lpad(unknown, numeric, unknown) does not exist > > Expected output: > In Spark also it should throw Exception like Hive > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value
[ https://issues.apache.org/jira/browse/SPARK-29854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972303#comment-16972303 ] Hyukjin Kwon commented on SPARK-29854: -- Other DMBSes have rpad and lpad too. Can you check? > lpad and rpad built in function not throw Exception for invalid len value > - > > Key: SPARK-29854 > URL: https://issues.apache.org/jira/browse/SPARK-29854 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Spark Returns Empty String) > {code} > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT > lpad('hihhh', 5000, ''); > ++ > |lpad(hihhh, CAST(5000 AS INT), > )| > ++ > ++ > {code} > Hive: > {code} > SELECT lpad('hihhh', 5000, > ''); > Error: Error while compiling statement: FAILED: SemanticException [Error > 10016]: Line 1:67 Argument type mismatch '''': lpad only takes > INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016) > {code} > PostgreSQL > {code} > function lpad(unknown, numeric, unknown) does not exist > {code} > > Expected output: > In Spark also it should throw Exception like Hive > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value
[ https://issues.apache.org/jira/browse/SPARK-29854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29854: - Description: Spark Returns Empty String) {code} 0: jdbc:hive2://10.18.19.208:23040/default> SELECT lpad('hihhh', 5000, ''); ++ |lpad(hihhh, CAST(5000 AS INT), )| ++ ++ {code} Hive: {code} SELECT lpad('hihhh', 5000, ''); Error: Error while compiling statement: FAILED: SemanticException [Error 10016]: Line 1:67 Argument type mismatch '''': lpad only takes INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016) {code} PostgreSQL {code} function lpad(unknown, numeric, unknown) does not exist {code} Expected output: In Spark also it should throw Exception like Hive was: Spark Returns Empty String) 0: jdbc:hive2://10.18.19.208:23040/default> SELECT lpad('hihhh', 5000, ''); ++ |lpad(hihhh, CAST(5000 AS INT), )| ++ ++ Hive: SELECT lpad('hihhh', 5000, ''); Error: Error while compiling statement: FAILED: SemanticException [Error 10016]: Line 1:67 Argument type mismatch '''': lpad only takes INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016) PostgreSQL function lpad(unknown, numeric, unknown) does not exist Expected output: In Spark also it should throw Exception like Hive > lpad and rpad built in function not throw Exception for invalid len value > - > > Key: SPARK-29854 > URL: https://issues.apache.org/jira/browse/SPARK-29854 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Spark Returns Empty String) > {code} > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT > lpad('hihhh', 5000, ''); > ++ > |lpad(hihhh, CAST(5000 AS INT), > )| > ++ > ++ > {code} > Hive: > {code} > SELECT lpad('hihhh', 5000, > ''); > Error: Error while compiling statement: FAILED: SemanticException [Error > 10016]: Line 1:67 Argument type mismatch '''': lpad only takes > INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016) > {code} > PostgreSQL > {code} > function lpad(unknown, numeric, unknown) does not exist > {code} > > Expected output: > In Spark also it should throw Exception like Hive > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29771) Limit executor max failures before failing the application
[ https://issues.apache.org/jira/browse/SPARK-29771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jackey Lee updated SPARK-29771: --- Description: ExecutorPodsAllocator does not limit the number of executor errors or deletions, which may cause executor restart continuously without application failure. A simple example for this, add {{--conf spark.executor.extraJavaOptions=-Xmse}} after spark-submit, which can make executor restart thousands of times without application failure. was: At present, K8S scheduling does not limit the number of failures of the executors, which may cause executor retried continuously without failing. A simple example, we add a resource limit on default namespace. After the driver is started, if the quota is full, the executor will retry the creation continuously, resulting in a large amount of pod information accumulation. When many applications encounter such situations, they will affect the K8S cluster. > Limit executor max failures before failing the application > -- > > Key: SPARK-29771 > URL: https://issues.apache.org/jira/browse/SPARK-29771 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Jackey Lee >Priority: Major > > ExecutorPodsAllocator does not limit the number of executor errors or > deletions, which may cause executor restart continuously without application > failure. > A simple example for this, add {{--conf > spark.executor.extraJavaOptions=-Xmse}} after spark-submit, which can make > executor restart thousands of times without application failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-29771) Limit executor max failures before failing the application
[ https://issues.apache.org/jira/browse/SPARK-29771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jackey Lee updated SPARK-29771: --- Comment: was deleted (was: This patch is mainly used in the scenario where the executor started failed. The executor runtime failure, which is caused by task errors is controlled by spark.executor.maxFailures. Another Example, add `--conf spark.executor.extraJavaOptions=-Xmse` after spark-submit, which can also appear executor crazy retry.) > Limit executor max failures before failing the application > -- > > Key: SPARK-29771 > URL: https://issues.apache.org/jira/browse/SPARK-29771 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Jackey Lee >Priority: Major > > ExecutorPodsAllocator does not limit the number of executor errors or > deletions, which may cause executor restart continuously without application > failure. > A simple example for this, add {{--conf > spark.executor.extraJavaOptions=-Xmse}} after spark-submit, which can make > executor restart thousands of times without application failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29349) Support FETCH_PRIOR in Thriftserver query results fetching
[ https://issues.apache.org/jira/browse/SPARK-29349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972345#comment-16972345 ] runzhiwang commented on SPARK-29349: [~juliuszsompolski] Hi, Could you explain how to use this feature ? Because hive-jdbc does not support this. > Support FETCH_PRIOR in Thriftserver query results fetching > -- > > Key: SPARK-29349 > URL: https://issues.apache.org/jira/browse/SPARK-29349 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski >Priority: Major > Fix For: 3.0.0 > > > Support FETCH_PRIOR fetching in Thriftserver -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29349) Support FETCH_PRIOR in Thriftserver query results fetching
[ https://issues.apache.org/jira/browse/SPARK-29349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972369#comment-16972369 ] Juliusz Sompolski commented on SPARK-29349: --- [~runzhiwang] it is used by Simba Spark ODBC driver to recover after a connection failure during fetching when AutoReconnect=1. It is available in Simba Spark ODBC driver v2.6.10 - I believe it is used only to recover the cursor position after reconnecting, not as a user facing feature to allow fetching backwards. > Support FETCH_PRIOR in Thriftserver query results fetching > -- > > Key: SPARK-29349 > URL: https://issues.apache.org/jira/browse/SPARK-29349 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski >Priority: Major > Fix For: 3.0.0 > > > Support FETCH_PRIOR fetching in Thriftserver -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29857) [WEB UI] Support defer render the spark history summary page.
feiwang created SPARK-29857: --- Summary: [WEB UI] Support defer render the spark history summary page. Key: SPARK-29857 URL: https://issues.apache.org/jira/browse/SPARK-29857 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.4.4 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29823) Improper persist strategy in mllib.clustering.KMeans.run()
[ https://issues.apache.org/jira/browse/SPARK-29823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972473#comment-16972473 ] Aman Omer commented on SPARK-29823: --- I will check this one. > Improper persist strategy in mllib.clustering.KMeans.run() > -- > > Key: SPARK-29823 > URL: https://issues.apache.org/jira/browse/SPARK-29823 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.4.3 >Reporter: Dong Wang >Priority: Major > > In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is > persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., > the rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not > persisted{color}{color}{color:#172b4d}.{color} So all the actions that reply > on _norms_ also reply on _{color:#de350b}zippedData{color}._ The rdd > {color:#de350b}_zippedData_{color} will be used by multiple times in > _runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be > persisted, not _{color:#de350b}norms{color}_. > {code:scala} > private[spark] def run( > data: RDD[Vector], > instr: Option[Instrumentation]): KMeansModel = { > if (data.getStorageLevel == StorageLevel.NONE) { > logWarning("The input data is not directly cached, which may hurt > performance if its" > + " parent RDDs are also uncached.") > } > // Compute squared norms and cache them. > val norms = data.map(Vectors.norm(_, 2.0)) > norms.persist() // Unnecessary persist. Only used to generate zippedData. > val zippedData = data.zip(norms).map { case (v, norm) => > new VectorWithNorm(v, norm) > } // needs to persist > val model = runAlgorithm(zippedData, instr) > norms.unpersist() // Change to zippedData.unpersist() > {code} > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29855) typed literals with negative sign with proper result or exception
[ https://issues.apache.org/jira/browse/SPARK-29855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-29855: Assignee: Kent Yao > typed literals with negative sign with proper result or exception > - > > Key: SPARK-29855 > URL: https://issues.apache.org/jira/browse/SPARK-29855 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > {code:java} > -- !query 83 > select -integer '7' > -- !query 83 schema > struct<7:int> > -- !query 83 output > 7 > -- !query 86 > select -date '1999-01-01' > -- !query 86 schema > struct > -- !query 86 output > 1999-01-01 > -- !query 87 > select -timestamp '1999-01-01' > -- !query 87 schema > struct > -- !query 87 output > 1999-01-01 00:00:00 > {code} > the integer should be -7 and the date and timestamp results are confusing > which should throw exception. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29855) typed literals with negative sign with proper result or exception
[ https://issues.apache.org/jira/browse/SPARK-29855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29855. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26479 [https://github.com/apache/spark/pull/26479] > typed literals with negative sign with proper result or exception > - > > Key: SPARK-29855 > URL: https://issues.apache.org/jira/browse/SPARK-29855 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > {code:java} > -- !query 83 > select -integer '7' > -- !query 83 schema > struct<7:int> > -- !query 83 output > 7 > -- !query 86 > select -date '1999-01-01' > -- !query 86 schema > struct > -- !query 86 output > 1999-01-01 > -- !query 87 > select -timestamp '1999-01-01' > -- !query 87 schema > struct > -- !query 87 output > 1999-01-01 00:00:00 > {code} > the integer should be -7 and the date and timestamp results are confusing > which should throw exception. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value
[ https://issues.apache.org/jira/browse/SPARK-29854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK KUMAR GUPTA updated SPARK-29854: - Description: Spark Returns Empty String) {code} 0: jdbc:hive2://10.18.19.208:23040/default> SELECT lpad('hihhh', 5000, ''); ++ |lpad(hihhh, CAST(5000 AS INT), )| ++ ++ Hive: SELECT lpad('hihhh', 5000, ''); Error: Error while compiling statement: FAILED: SemanticException [Error 10016]: Line 1:67 Argument type mismatch '''': lpad only takes INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016) PostgreSQL function lpad(unknown, numeric, unknown) does not exist Expected output: In Spark also it should throw Exception like Hive {code} was: Spark Returns Empty String) {code} 0: jdbc:hive2://10.18.19.208:23040/default> SELECT lpad('hihhh', 5000, ''); ++ |lpad(hihhh, CAST(5000 AS INT), )| ++ ++ {code} Hive: {code} SELECT lpad('hihhh', 5000, ''); Error: Error while compiling statement: FAILED: SemanticException [Error 10016]: Line 1:67 Argument type mismatch '''': lpad only takes INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016) {code} PostgreSQL {code} function lpad(unknown, numeric, unknown) does not exist {code} Expected output: In Spark also it should throw Exception like Hive > lpad and rpad built in function not throw Exception for invalid len value > - > > Key: SPARK-29854 > URL: https://issues.apache.org/jira/browse/SPARK-29854 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Spark Returns Empty String) > {code} > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT > lpad('hihhh', 5000, ''); > ++ > |lpad(hihhh, CAST(5000 AS INT), > )| > ++ > ++ > Hive: > SELECT lpad('hihhh', 5000, > ''); > Error: Error while compiling statement: FAILED: SemanticException [Error > 10016]: Line 1:67 Argument type mismatch '''': lpad only takes > INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016) > PostgreSQL > function lpad(unknown, numeric, unknown) does not exist > > Expected output: > In Spark also it should throw Exception like Hive > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29710) Seeing offsets not resetting even when reset policy is configured explicitly
[ https://issues.apache.org/jira/browse/SPARK-29710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972515#comment-16972515 ] Gabor Somogyi commented on SPARK-29710: --- [~BdLearner] bump, can you provide any useful information like logs? Otherwise I'm going to close this as incomplete. > Seeing offsets not resetting even when reset policy is configured explicitly > > > Key: SPARK-29710 > URL: https://issues.apache.org/jira/browse/SPARK-29710 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.1 > Environment: Window10 , eclipse neos >Reporter: Shyam >Priority: Major > > > even after setting *"auto.offset.reset" to "latest"* I am getting below > error > > org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of > range with no configured reset policy for partitions: > \{COMPANY_TRANSACTIONS_INBOUND-16=168}org.apache.kafka.clients.consumer.OffsetOutOfRangeException: > Offsets out of range with no configured reset policy for partitions: > \{COMPANY_TRANSACTIONS_INBOUND-16=168} at > org.apache.kafka.clients.consumer.internals.Fetcher.throwIfOffsetOutOfRange(Fetcher.java:348) > at > org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:396) > at > org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:999) > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:937) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer.fetchData(KafkaDataConsumer.scala:470) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchRecord(KafkaDataConsumer.scala:361) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:251) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:234) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:209) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:234) > > [https://stackoverflow.com/questions/58653885/even-after-setting-auto-offset-reset-to-latest-getting-error-offsetoutofrang] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown
[ https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972524#comment-16972524 ] Julian Eberius commented on SPARK-20050: Sorry to comment on an old, closed issue, but: what is the solution for this issue? I have noticed the same issue: the offsets of a last batch of a Spark/Kafka streaming app is never commited to Kafka, as commitAll() is only called in compute(). In effect, offsets commited by the user via the CanCommitOffsets interface are actually only ever commited in the following batch. If there is no following batch, nothing is commited. Therefore, on the next streaming job instance, duplicate processing will occur. The Github PR therefore makes a lot of sense to me. Is there anything that I'm missing? What is the recommended way to commit offsets into Kafka on Spark Streaming application shutdown? > Kafka 0.10 DirectStream doesn't commit last processed batch's offset when > graceful shutdown > --- > > Key: SPARK-20050 > URL: https://issues.apache.org/jira/browse/SPARK-20050 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Sasaki Toru >Priority: Major > > I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and > call 'DirectKafkaInputDStream#commitAsync' finally in each batches, such > below > {code} > val kafkaStream = KafkaUtils.createDirectStream[String, String](...) > kafkaStream.map { input => > "key: " + input.key.toString + " value: " + input.value.toString + " > offset: " + input.offset.toString > }.foreachRDD { rdd => > rdd.foreach { input => > println(input) > } > } > kafkaStream.foreachRDD { rdd => > val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges > kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) > } > {\code} > Some records which processed in the last batch before Streaming graceful > shutdown reprocess in the first batch after Spark Streaming restart, such > below > * output first run of this application > {code} > key: null value: 1 offset: 101452472 > key: null value: 2 offset: 101452473 > key: null value: 3 offset: 101452474 > key: null value: 4 offset: 101452475 > key: null value: 5 offset: 101452476 > key: null value: 6 offset: 101452477 > key: null value: 7 offset: 101452478 > key: null value: 8 offset: 101452479 > key: null value: 9 offset: 101452480 // this is a last record before > shutdown Spark Streaming gracefully > {\code} > * output re-run of this application > {code} > key: null value: 7 offset: 101452478 // duplication > key: null value: 8 offset: 101452479 // duplication > key: null value: 9 offset: 101452480 // duplication > key: null value: 10 offset: 101452481 > {\code} > It may cause offsets specified in commitAsync will commit in the head of next > batch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29858) ALTER DATABASE (SET DBPROPERTIES) should look up catalog like v2 commands
Hu Fuwang created SPARK-29858: - Summary: ALTER DATABASE (SET DBPROPERTIES) should look up catalog like v2 commands Key: SPARK-29858 URL: https://issues.apache.org/jira/browse/SPARK-29858 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hu Fuwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29858) ALTER DATABASE (SET DBPROPERTIES) should look up catalog like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972530#comment-16972530 ] Hu Fuwang commented on SPARK-29858: --- Working on this. > ALTER DATABASE (SET DBPROPERTIES) should look up catalog like v2 commands > - > > Key: SPARK-29858 > URL: https://issues.apache.org/jira/browse/SPARK-29858 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hu Fuwang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29859) ALTER DATABASE (SET LOCATION) should look up catalog like v2 commands
Hu Fuwang created SPARK-29859: - Summary: ALTER DATABASE (SET LOCATION) should look up catalog like v2 commands Key: SPARK-29859 URL: https://issues.apache.org/jira/browse/SPARK-29859 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Hu Fuwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29859) ALTER DATABASE (SET LOCATION) should look up catalog like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972532#comment-16972532 ] Hu Fuwang commented on SPARK-29859: --- Working on this. > ALTER DATABASE (SET LOCATION) should look up catalog like v2 commands > - > > Key: SPARK-29859 > URL: https://issues.apache.org/jira/browse/SPARK-29859 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hu Fuwang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery
feiwang created SPARK-29860: --- Summary: [SQL] Fix data type mismatch issue for inSubQuery Key: SPARK-29860 URL: https://issues.apache.org/jira/browse/SPARK-29860 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.4 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29861) Reduce leader election downtime in Spark standalone HA
Robin Wolters created SPARK-29861: - Summary: Reduce leader election downtime in Spark standalone HA Key: SPARK-29861 URL: https://issues.apache.org/jira/browse/SPARK-29861 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.1 Reporter: Robin Wolters As officially stated in the spark [HA documention|https://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper], the recovery process of Spark (standalone) master in HA with zookeeper takes about 1-2 minutes. During this time no spark master is active, which makes interaction with spark essentially impossible. After looking for a way to reduce this downtime, it seems that this is mainly caused by the leader election, which waits for open zookeeper connections to be closed. This seems like an unnecessary downtime for example in case of a planned VM update. I have fixed this in my setup by: # Closing open zookeeper connections during spark shutdown # Bumping the curator version and implementing a custom error policy that is tolerant to a zookeeper connection suspension. I am preparing a pull request for review / further discussion on this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery
[ https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29860: Description: The follow statement would throw an exception. {code:java} sql("create table ta(id Decimal(18,0)) using parquet") sql("create table tb(id Decimal(19,0)) using parquet") sql("select * from ta where id in (select id from tb)").shown() {code} {code:java} // Some comments here public String getFoo() { return foo; } {code} was: The follow statement would throw an exception. {code:java} sql("create table ta(id Decimal(18,0)) using parquet") sql("create table tb(id Decimal(19,0)) using parquet") sql("select * from ta where id in (select id from tb)").shown() {code} cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: The data type of one or more elements in the left hand side of an IN subquery is not compatible with the data type of the output of the subquery Mismatched columns: [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))] Left side: [decimal(18,0)]. Right side: [decimal(19,0)].;; 'Project [*] +- 'Filter id#219 IN (list#218 []) : +- Project [id#220] : +- SubqueryAlias `default`.`tb` :+- Relation[id#220] parquet +- SubqueryAlias `default`.`ta` +- Relation[id#219] parquet > [SQL] Fix data type mismatch issue for inSubQuery > - > > Key: SPARK-29860 > URL: https://issues.apache.org/jira/browse/SPARK-29860 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > > The follow statement would throw an exception. > {code:java} > sql("create table ta(id Decimal(18,0)) using parquet") > sql("create table tb(id Decimal(19,0)) using parquet") > sql("select * from ta where id in (select id from tb)").shown() > {code} > {code:java} > // Some comments here > public String getFoo() > { > return foo; > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery
[ https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29860: Description: The follow statement would throw an exception. {code:java} sql("create table ta(id Decimal(18,0)) using parquet") sql("create table tb(id Decimal(19,0)) using parquet") sql("select * from ta where id in (select id from tb)").shown() {code} {code:java} // Some comments here cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: The data type of one or more elements in the left hand side of an IN subquery is not compatible with the data type of the output of the subquery Mismatched columns: [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))] Left side: [decimal(18,0)]. Right side: [decimal(19,0)].;; 'Project [*] +- 'Filter id#219 IN (list#218 []) : +- Project [id#220] : +- SubqueryAlias `default`.`tb` :+- Relation[id#220] parquet +- SubqueryAlias `default`.`ta` +- Relation[id#219] parquet {code} was: The follow statement would throw an exception. {code:java} sql("create table ta(id Decimal(18,0)) using parquet") sql("create table tb(id Decimal(19,0)) using parquet") sql("select * from ta where id in (select id from tb)").shown() {code} {code:java} // Some comments here public String getFoo() { return foo; } {code} > [SQL] Fix data type mismatch issue for inSubQuery > - > > Key: SPARK-29860 > URL: https://issues.apache.org/jira/browse/SPARK-29860 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > > The follow statement would throw an exception. > {code:java} > sql("create table ta(id Decimal(18,0)) using parquet") > sql("create table tb(id Decimal(19,0)) using parquet") > sql("select * from ta where id in (select id from tb)").shown() > {code} > {code:java} > // Some comments here > cannot resolve '(default.ta.`id` IN (listquery()))' due to data type > mismatch: > The data type of one or more elements in the left hand side of an IN subquery > is not compatible with the data type of the output of the subquery > Mismatched columns: > [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))] > Left side: > [decimal(18,0)]. > Right side: > [decimal(19,0)].;; > 'Project [*] > +- 'Filter id#219 IN (list#218 []) >: +- Project [id#220] >: +- SubqueryAlias `default`.`tb` >:+- Relation[id#220] parquet >+- SubqueryAlias `default`.`ta` > +- Relation[id#219] parquet > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery
[ https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29860: Description: The follow statement would throw an exception. {code:java} sql("create table ta(id Decimal(18,0)) using parquet") sql("create table tb(id Decimal(19,0)) using parquet") sql("select * from ta where id in (select id from tb)").shown() {code} cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: The data type of one or more elements in the left hand side of an IN subquery is not compatible with the data type of the output of the subquery Mismatched columns: [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))] Left side: [decimal(18,0)]. Right side: [decimal(19,0)].;; 'Project [*] +- 'Filter id#219 IN (list#218 []) : +- Project [id#220] : +- SubqueryAlias `default`.`tb` :+- Relation[id#220] parquet +- SubqueryAlias `default`.`ta` +- Relation[id#219] parquet > [SQL] Fix data type mismatch issue for inSubQuery > - > > Key: SPARK-29860 > URL: https://issues.apache.org/jira/browse/SPARK-29860 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > > The follow statement would throw an exception. > {code:java} > sql("create table ta(id Decimal(18,0)) using parquet") > sql("create table tb(id Decimal(19,0)) using parquet") > sql("select * from ta where id in (select id from tb)").shown() > {code} > cannot resolve '(default.ta.`id` IN (listquery()))' due to data type > mismatch: > The data type of one or more elements in the left hand side of an IN subquery > is not compatible with the data type of the output of the subquery > Mismatched columns: > [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))] > Left side: > [decimal(18,0)]. > Right side: > [decimal(19,0)].;; > 'Project [*] > +- 'Filter id#219 IN (list#218 []) >: +- Project [id#220] >: +- SubqueryAlias `default`.`tb` >:+- Relation[id#220] parquet >+- SubqueryAlias `default`.`ta` > +- Relation[id#219] parquet -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery
[ https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29860: Description: The follow statement would throw an exception. {code:java} sql("create table ta(id Decimal(18,0)) using parquet") sql("create table tb(id Decimal(19,0)) using parquet") sql("select * from ta where id in (select id from tb)").shown() {code} {code:java} // Exception information cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: The data type of one or more elements in the left hand side of an IN subquery is not compatible with the data type of the output of the subquery Mismatched columns: [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))] Left side: [decimal(18,0)]. Right side: [decimal(19,0)].;; 'Project [*] +- 'Filter id#219 IN (list#218 []) : +- Project [id#220] : +- SubqueryAlias `default`.`tb` :+- Relation[id#220] parquet +- SubqueryAlias `default`.`ta` +- Relation[id#219] parquet {code} was: The follow statement would throw an exception. {code:java} sql("create table ta(id Decimal(18,0)) using parquet") sql("create table tb(id Decimal(19,0)) using parquet") sql("select * from ta where id in (select id from tb)").shown() {code} {code:java} // Some comments here cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: The data type of one or more elements in the left hand side of an IN subquery is not compatible with the data type of the output of the subquery Mismatched columns: [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))] Left side: [decimal(18,0)]. Right side: [decimal(19,0)].;; 'Project [*] +- 'Filter id#219 IN (list#218 []) : +- Project [id#220] : +- SubqueryAlias `default`.`tb` :+- Relation[id#220] parquet +- SubqueryAlias `default`.`ta` +- Relation[id#219] parquet {code} > [SQL] Fix data type mismatch issue for inSubQuery > - > > Key: SPARK-29860 > URL: https://issues.apache.org/jira/browse/SPARK-29860 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > > The follow statement would throw an exception. > {code:java} > sql("create table ta(id Decimal(18,0)) using parquet") > sql("create table tb(id Decimal(19,0)) using parquet") > sql("select * from ta where id in (select id from tb)").shown() > {code} > {code:java} > // Exception information > cannot resolve '(default.ta.`id` IN (listquery()))' due to data type > mismatch: > The data type of one or more elements in the left hand side of an IN subquery > is not compatible with the data type of the output of the subquery > Mismatched columns: > [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))] > Left side: > [decimal(18,0)]. > Right side: > [decimal(19,0)].;; > 'Project [*] > +- 'Filter id#219 IN (list#218 []) >: +- Project [id#220] >: +- SubqueryAlias `default`.`tb` >:+- Relation[id#220] parquet >+- SubqueryAlias `default`.`ta` > +- Relation[id#219] parquet > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29862) CREATE (OR REPLACE) ... VIEW should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-29862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972560#comment-16972560 ] Huaxin Gao commented on SPARK-29862: I will work on this. > CREATE (OR REPLACE) ... VIEW should look up catalog/table like v2 commands > -- > > Key: SPARK-29862 > URL: https://issues.apache.org/jira/browse/SPARK-29862 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > > CREATE (OR REPLACE) ... VIEW should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29862) CREATE (OR REPLACE) ... VIEW should look up catalog/table like v2 commands
Huaxin Gao created SPARK-29862: -- Summary: CREATE (OR REPLACE) ... VIEW should look up catalog/table like v2 commands Key: SPARK-29862 URL: https://issues.apache.org/jira/browse/SPARK-29862 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Huaxin Gao CREATE (OR REPLACE) ... VIEW should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29857) [WEB UI] Support defer render the spark history summary page.
[ https://issues.apache.org/jira/browse/SPARK-29857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29857: Description: When there are many applications in spark history server, the renderer of history summary page is heavy, we can enable deferRender to tuning it. See details https://datatables.net/examples/ajax/defer_render.html > [WEB UI] Support defer render the spark history summary page. > -- > > Key: SPARK-29857 > URL: https://issues.apache.org/jira/browse/SPARK-29857 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > > When there are many applications in spark history server, the renderer of > history summary page is heavy, we can enable deferRender to tuning it. > See details https://datatables.net/examples/ajax/defer_render.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
[ https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972605#comment-16972605 ] Felix Kizhakkel Jose commented on SPARK-29764: -- So it's still hard to follow the reproducer? Do I need to skim it further? > Error on Serializing POJO with java datetime property to a Parquet file > --- > > Key: SPARK-29764 > URL: https://issues.apache.org/jira/browse/SPARK-29764 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > Attachments: SparkParquetSampleCode.docx > > > Hello, > I have been doing a proof of concept for data lake structure and analytics > using Apache Spark. > When I add a java java.time.LocalDateTime/java.time.LocalDate properties in > my data model, the serialization to Parquet start failing. > *My Data Model:* > @Data > public class Employee > { private UUID id = UUID.randomUUID(); private String name; private int age; > private LocalDate dob; private LocalDateTime startDateTime; private String > phone; private Address address; } > > *Serialization Snippet* > {color:#0747a6}public void serialize(){color} > {color:#0747a6}{ List inputDataToSerialize = > getInputDataToSerialize(); // this creates 100,000 employee objects > Encoder employeeEncoder = Encoders.bean(Employee.class); > Dataset employeeDataset = sparkSession.createDataset( > inputDataToSerialize, employeeEncoder ); employeeDataset.write() > .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); > }{color} > +*Exception Stack Trace:* > + > *java.lang.IllegalStateException: Failed to execute > CommandLineRunnerjava.lang.IllegalStateException: Failed to execute > CommandLineRunner at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) > at > org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) > at > org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) > at com.felix.Application.main(Application.java:45)Caused by: > org.apache.spark.SparkException: Job aborted. at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at > com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at > com.felix.Application.run(Application.java:63) at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800) > ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to > stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
[jira] [Commented] (SPARK-29762) GPU Scheduling - default task resource amount to 1
[ https://issues.apache.org/jira/browse/SPARK-29762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972631#comment-16972631 ] Imran Rashid commented on SPARK-29762: -- I don't really understand the complication. I know there would be some special casing for GPUs in the config parsing code (eg. in {{org.apache.spark.resource.ResourceUtils#parseResourceRequirements}}), but doesn't seem anything too bad. I did think about this more, and realize it gets a bit confusing when you add in task-level resource constraints. you won't schedule optimally for tasks that don't need gpu, and you won't have gpus leftover for the tasks that do need them. Eg, say you had each executor setup with 4 cores and 2 gpus. If you had one task set come in which only needed cpu, you would only run 2 copies. And then if another taskset came in which did need the gpus, you woudn't be able to schedule it. You can't end up in that situation until you have task-specific resource constraints. But does it get too messy to have sensible defaults in that situation? Maybe the user specifies gpus as an executor resource up front, for the whole cluster, because they have them available and they know some significant fraction of the workloads need them. They might think that the regular tasks will just ignore the gpus, and the tasks that do need gpus would just specify them as task-level constraints. I guess this might have been a bad suggestion after all, sorry. > GPU Scheduling - default task resource amount to 1 > -- > > Key: SPARK-29762 > URL: https://issues.apache.org/jira/browse/SPARK-29762 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > Default the task level resource configs (for gpu/fpga, etc) to 1. So if the > user specifies the executor resource then to make it more user friendly lets > have the task resource config default to 1. This is ok right now since we > require resources to have an address. It also matches what we do for the > spark.task.cpus configs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29850) sort-merge-join an empty table should not memory leak
[ https://issues.apache.org/jira/browse/SPARK-29850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29850. - Fix Version/s: 3.0.0 2.4.5 Resolution: Fixed Issue resolved by pull request 26471 [https://github.com/apache/spark/pull/26471] > sort-merge-join an empty table should not memory leak > - > > Key: SPARK-29850 > URL: https://issues.apache.org/jira/browse/SPARK-29850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.5, 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29863) rename EveryAgg/AnyAgg to BoolAnd/BoolOr
Wenchen Fan created SPARK-29863: --- Summary: rename EveryAgg/AnyAgg to BoolAnd/BoolOr Key: SPARK-29863 URL: https://issues.apache.org/jira/browse/SPARK-29863 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29762) GPU Scheduling - default task resource amount to 1
[ https://issues.apache.org/jira/browse/SPARK-29762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972661#comment-16972661 ] Thomas Graves commented on SPARK-29762: --- The code that gets the task requirements is generic and is reused by more then just TASKs. Also there is code that loops over the task requirements and assume all task requirements are there, but you can't do that if you are relying on a default value because they won't be present. At some point you either have to look at all executor requirements and then build up the task requirements that include the default ones, or you just have to change those loops to be over the executor requirements. I'm not saying its not possible, my comment was just saying it isn't as straight forward as I originally thought. I don't think you have the issue you are talking about because you have to have the task requirements be based on the executor resource requests (which is part of complication). as long as you do that with the stage level scheduling I think its ok. ResourceProfile ExecutorRequest = 1 cpu, 1 GPU, user defined taskRequest = null, then translate into taskRequest = 1 cpu, 1 GPU. > GPU Scheduling - default task resource amount to 1 > -- > > Key: SPARK-29762 > URL: https://issues.apache.org/jira/browse/SPARK-29762 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > Default the task level resource configs (for gpu/fpga, etc) to 1. So if the > user specifies the executor resource then to make it more user friendly lets > have the task resource config default to 1. This is ok right now since we > require resources to have an address. It also matches what we do for the > spark.task.cpus configs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29864) Strict parsing of day-time strings to intervals
Maxim Gekk created SPARK-29864: -- Summary: Strict parsing of day-time strings to intervals Key: SPARK-29864 URL: https://issues.apache.org/jira/browse/SPARK-29864 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Currently, the IntervalUtils.fromDayTimeString() method does not takes into account the left bound `from` and truncates the result using the right bound `to`. The method should respect to the bounds specified by an user. Oracle and MySQL respect to user's bounds, see https://github.com/apache/spark/pull/26358#issuecomment-551942719 and https://github.com/apache/spark/pull/26358#issuecomment-549272475 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark
[ https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972697#comment-16972697 ] Shane Knapp commented on SPARK-29106: - i would definitely be happy to have a faster VM available for testing... 5h33m is a major improvement over 10h. :) at the same time, i'm working w/my lead sysadmin and we will most likely be purchasing an actual aarch64/ARM server for our build system. the ETA for this will be most likely in early 2020, so we can start testing on the new VM as soon as you're ready. > Add jenkins arm test for spark > -- > > Key: SPARK-29106 > URL: https://issues.apache.org/jira/browse/SPARK-29106 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > Attachments: R-ansible.yml, R-libs.txt, arm-python36.txt > > > Add arm test jobs to amplab jenkins for spark. > Till now we made two arm test periodic jobs for spark in OpenLab, one is > based on master with hadoop 2.7(similar with QA test of amplab jenkins), > other one is based on a new branch which we made on date 09-09, see > [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64] > and > [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64] > We only have to care about the first one when integrate arm test with amplab > jenkins. > About the k8s test on arm, we have took test it, see > [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it > later. > And we plan test on other stable branches too, and we can integrate them to > amplab when they are ready. > We have offered an arm instance and sent the infos to shane knapp, thanks > shane to add the first arm job to amplab jenkins :) > The other important thing is about the leveldbjni > [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80] > spark depends on leveldbjni-all-1.8 > [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8], > we can see there is no arm64 supporting. So we build an arm64 supporting > release of leveldbjni see > [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8], > but we can't modified the spark pom.xml directly with something like > 'property'/'profile' to choose correct jar package on arm or x86 platform, > because spark depends on some hadoop packages like hadoop-hdfs, the packages > depend on leveldbjni-all-1.8 too, unless hadoop release with new arm > supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of > openlabtesting and 'mvn install' to use it when arm testing for spark. > PS: The issues found and fixed: > SPARK-28770 > [https://github.com/apache/spark/pull/25673] > > SPARK-28519 > [https://github.com/apache/spark/pull/25279] > > SPARK-28433 > [https://github.com/apache/spark/pull/25186] > > SPARK-28467 > [https://github.com/apache/spark/pull/25864] > > SPARK-29286 > [https://github.com/apache/spark/pull/26021] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29762) GPU Scheduling - default task resource amount to 1
[ https://issues.apache.org/jira/browse/SPARK-29762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972703#comment-16972703 ] Imran Rashid commented on SPARK-29762: -- The scenario I was thinking is: application is submitted with app-wide resource requirements for 4 cores & 2 gpus executors, but not for tasks. (I guess they specify "spark.executor.resource.gpu" but not "spark.task.resource.gpu".). One taskset is submitted with no task resource requirements, because it doesn't need any gpus. Another taskset is submitted which does require 1 cpu & 1 gpu per task. The user does it this way because they don't want all of their tasks to use gpus, but they know enough of them will need gpus that they'd rather just request gpus upfront, than scale up & down two different types of executors. > GPU Scheduling - default task resource amount to 1 > -- > > Key: SPARK-29762 > URL: https://issues.apache.org/jira/browse/SPARK-29762 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > Default the task level resource configs (for gpu/fpga, etc) to 1. So if the > user specifies the executor resource then to make it more user friendly lets > have the task resource config default to 1. This is ok right now since we > require resources to have an address. It also matches what we do for the > spark.task.cpus configs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29487) Ability to run Spark Kubernetes other than from /opt/spark
[ https://issues.apache.org/jira/browse/SPARK-29487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-29487. Resolution: Duplicate This is basically allowing people to customize their docker images, which is SPARK-24655. > Ability to run Spark Kubernetes other than from /opt/spark > -- > > Key: SPARK-29487 > URL: https://issues.apache.org/jira/browse/SPARK-29487 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Submit >Affects Versions: 2.4.4 >Reporter: Benjamin Miao CAI >Priority: Minor > > On spark kubernetes Dockerfile, the spark binaries are copied to > */opt/spark.* > If we try to create our own Dockerfile without using */opt/spark* then the > image will not run. > After looking at the source code, it seem that in various places, the path is > hard-coded to */opt/spark* > *Example :* > Constants.scala : > {color:#808080}// Spark app configs for containers > {color}{color:#80}val {color}SPARK_CONF_VOLUME = > {color:#008000}"spark-conf-volume"{color} > *{color:#80}val {color}SPARK_CONF_DIR_INTERNAL = > {color:#008000}"/opt/spark/conf"{color}* > > Is it possible to make this configurable so we can put spark elsewhere than > /opt/. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29865) k8s executor pods all have different prefixes in client mode
Marcelo Masiero Vanzin created SPARK-29865: -- Summary: k8s executor pods all have different prefixes in client mode Key: SPARK-29865 URL: https://issues.apache.org/jira/browse/SPARK-29865 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.0.0 Reporter: Marcelo Masiero Vanzin This works in cluster mode since the features set things up so that all executor pods have the same name prefix. But in client mode features are not used; so each executor ends up with a different name prefix, which makes debugging a little bit annoying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29866) Upper case enum values
Maxim Gekk created SPARK-29866: -- Summary: Upper case enum values Key: SPARK-29866 URL: https://issues.apache.org/jira/browse/SPARK-29866 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.4 Reporter: Maxim Gekk Unify naming of enum values and upper case their names. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29797) Read key-value metadata in Parquet files written by Apache Arrow
[ https://issues.apache.org/jira/browse/SPARK-29797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972725#comment-16972725 ] Isaac Myers commented on SPARK-29797: - To be clear, Arrow does write the metadata because it can be retrieved per the example code (attached). The issue lies with Spark's ability to read it. It does not need to be associated with the df, though 'df.file_metadata()' would be fine. Otherwise, maybe 'spark.read.parquet_metadata()'. > Read key-value metadata in Parquet files written by Apache Arrow > > > Key: SPARK-29797 > URL: https://issues.apache.org/jira/browse/SPARK-29797 > Project: Spark > Issue Type: New Feature > Components: Java API, PySpark >Affects Versions: 2.4.4 > Environment: Apache Arrow 0.14.1 built on Windows x86. > >Reporter: Isaac Myers >Priority: Major > Labels: features > Attachments: minimal_working_example.cpp > > > Key-value (user) metadata written to Parquet file from Apache Arrow c++ is > not readable in Spark (PySpark or Java API). I can only find field-level > metadata dictionaries in the schema and no other functions in the API that > indicate the presence of file-level key-value metadata. The attached code > demonstrates creation and retrieval of file-level metadata using the Apache > Arrow API. > {code:java} > #include #include #include #include #include > #include #include > #include #include > //#include > int main(int argc, char* argv[]){ /* Create > Parquet File **/ arrow::Status st; > arrow::MemoryPool* pool = arrow::default_memory_pool(); > // Create Schema and fields with metadata > std::vector> fields; > std::unordered_map a_keyval; a_keyval["unit"] = > "sec"; a_keyval["note"] = "not the standard millisecond unit"; > arrow::KeyValueMetadata a_md(a_keyval); std::shared_ptr a_field > = arrow::field("a", arrow::int16(), false, a_md.Copy()); > fields.push_back(a_field); > std::unordered_map b_keyval; b_keyval["unit"] = > "ft"; arrow::KeyValueMetadata b_md(b_keyval); std::shared_ptr > b_field = arrow::field("b", arrow::int16(), false, b_md.Copy()); > fields.push_back(b_field); > std::shared_ptr schema = arrow::schema(fields); > // Add metadata to schema. std::unordered_map > schema_keyval; schema_keyval["classification"] = "Type 0"; > arrow::KeyValueMetadata schema_md(schema_keyval); schema = > schema->AddMetadata(schema_md.Copy()); > // Build arrays of data and add to Table. const int64_t rowgroup_size = 100; > std::vector a_data(rowgroup_size, 0); std::vector > b_data(rowgroup_size, 0); > for (int16_t i = 0; i < rowgroup_size; i++) { a_data[i] = i; b_data[i] = > rowgroup_size - i; } arrow::Int16Builder a_bldr(pool); arrow::Int16Builder > b_bldr(pool); st = a_bldr.Resize(rowgroup_size); if (!st.ok()) return 1; st = > b_bldr.Resize(rowgroup_size); if (!st.ok()) return 1; > st = a_bldr.AppendValues(a_data); if (!st.ok()) return 1; > st = b_bldr.AppendValues(b_data); if (!st.ok()) return 1; > std::shared_ptr a_arr_ptr; std::shared_ptr > b_arr_ptr; > arrow::ArrayVector arr_vec; st = a_bldr.Finish(&a_arr_ptr); if (!st.ok()) > return 1; arr_vec.push_back(a_arr_ptr); st = b_bldr.Finish(&b_arr_ptr); if > (!st.ok()) return 1; arr_vec.push_back(b_arr_ptr); > std::shared_ptr table = arrow::Table::Make(schema, arr_vec); > // Test metadata printf("\nMetadata from original schema:\n"); > printf("%s\n", schema->metadata()->ToString().c_str()); printf("%s\n", > schema->field(0)->metadata()->ToString().c_str()); printf("%s\n", > schema->field(1)->metadata()->ToString().c_str()); > std::shared_ptr table_schema = table->schema(); > printf("\nMetadata from schema retrieved from table (should be the > same):\n"); printf("%s\n", table_schema->metadata()->ToString().c_str()); > printf("%s\n", table_schema->field(0)->metadata()->ToString().c_str()); > printf("%s\n", table_schema->field(1)->metadata()->ToString().c_str()); > // Open file and write table. std::string file_name = "test.parquet"; > std::shared_ptr ostream; st = > arrow::io::FileOutputStream::Open(file_name, &ostream); if (!st.ok()) return > 1; > std::unique_ptr writer; > std::shared_ptr props = > parquet::default_writer_properties(); st = > parquet::arrow::FileWriter::Open(*schema, pool, ostream, props, &writer); if > (!st.ok()) return 1; st = writer->WriteTable(*table, rowgroup_size); if > (!st.ok()) return 1; > // Close file and stream. st = writer->Close(); if (!st.ok()) return 1; st = > ostream->Close(); if (!st.ok()) return 1; > /* Read Parquet File > **/ > // Create new memory pool. Not sure if this is necessary. > //arrow::MemoryPool* pool
[jira] [Resolved] (SPARK-29789) should not parse the bucket column name again when creating v2 tables
[ https://issues.apache.org/jira/browse/SPARK-29789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-29789. --- Fix Version/s: 3.0.0 Resolution: Fixed > should not parse the bucket column name again when creating v2 tables > - > > Key: SPARK-29789 > URL: https://issues.apache.org/jira/browse/SPARK-29789 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29762) GPU Scheduling - default task resource amount to 1
[ https://issues.apache.org/jira/browse/SPARK-29762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972750#comment-16972750 ] Thomas Graves commented on SPARK-29762: --- Ok, in the scenario above you would need 2 different ResourceProfile's and they would not use the same containers. 1 profile would need executors with only cpu, the other profile would need executors with cpus and gpus. At least that is how it would be in this initial stage level scheduling proposal. Anything more complex could come later. It does bring up a good point that my base pr doesn't have the checks to ensure that though, but that is separate > GPU Scheduling - default task resource amount to 1 > -- > > Key: SPARK-29762 > URL: https://issues.apache.org/jira/browse/SPARK-29762 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > Default the task level resource configs (for gpu/fpga, etc) to 1. So if the > user specifies the executor resource then to make it more user friendly lets > have the task resource config default to 1. This is ok right now since we > require resources to have an address. It also matches what we do for the > spark.task.cpus configs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29762) GPU Scheduling - default task resource amount to 1
[ https://issues.apache.org/jira/browse/SPARK-29762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972755#comment-16972755 ] Thomas Graves commented on SPARK-29762: --- But like you mention, if we were to support something like that, then the default of 1 when task requirement wasn't specified would be an issue. > GPU Scheduling - default task resource amount to 1 > -- > > Key: SPARK-29762 > URL: https://issues.apache.org/jira/browse/SPARK-29762 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > Default the task level resource configs (for gpu/fpga, etc) to 1. So if the > user specifies the executor resource then to make it more user friendly lets > have the task resource config default to 1. This is ok right now since we > require resources to have an address. It also matches what we do for the > spark.task.cpus configs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29867) add __repr__ in Python ML Models
Huaxin Gao created SPARK-29867: -- Summary: add __repr__ in Python ML Models Key: SPARK-29867 URL: https://issues.apache.org/jira/browse/SPARK-29867 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 3.0.0 Reporter: Huaxin Gao In Python ML Models, some of them have __repr__, others don't. In the doctest, when calling Model.setXXX, some of the Models print out the xxxModel... correctly, some of them can't because of lacking the __repr__ method. This Jira addresses this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29867) add __repr__ in Python ML Models
[ https://issues.apache.org/jira/browse/SPARK-29867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-29867: --- Description: In Python ML Models, some of them have ___repr, others don't. In the doctest, when calling Model.setXXX, some of the Models print out the xxxModel... correctly, some of them can't because of lacking the repr___ method. This Jira addresses this issue. (was: In Python ML Models, some of them have __repr__, others don't. In the doctest, when calling Model.setXXX, some of the Models print out the xxxModel... correctly, some of them can't because of lacking the __repr__ method. This Jira addresses this issue. ) > add __repr__ in Python ML Models > > > Key: SPARK-29867 > URL: https://issues.apache.org/jira/browse/SPARK-29867 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Minor > > In Python ML Models, some of them have ___repr, others don't. In the > doctest, when calling Model.setXXX, some of the Models print out the > xxxModel... correctly, some of them can't because of lacking the repr___ > method. This Jira addresses this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29493) Add MapType support for Arrow Java
[ https://issues.apache.org/jira/browse/SPARK-29493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972809#comment-16972809 ] Jalpan Randeri commented on SPARK-29493: Hi Bryan, Are you working on this issue? If not i want to raise a pull request for this. > Add MapType support for Arrow Java > -- > > Key: SPARK-29493 > URL: https://issues.apache.org/jira/browse/SPARK-29493 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > This will add MapType support for Arrow in Spark ArrowConverters. This can > happen after the Arrow 0.15.0 upgrade, but MapType is not available in > pyarrow yet, so pyspark and pandas_udf will come later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow
[ https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reopened SPARK-25351: -- reopening, this should be straightforward to add > Handle Pandas category type when converting from Python with Arrow > -- > > Key: SPARK-25351 > URL: https://issues.apache.org/jira/browse/SPARK-25351 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Bryan Cutler >Priority: Major > Labels: bulk-closed > > There needs to be some handling of category types done when calling > {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}. > Without Arrow, Spark casts each element to the category. For example > {noformat} > In [1]: import pandas as pd > In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]}) > In [3]: pdf["B"] = pdf["A"].astype('category') > In [4]: pdf > Out[4]: >A B > 0 a a > 1 b b > 2 c c > 3 a a > In [5]: pdf.dtypes > Out[5]: > A object > Bcategory > dtype: object > In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False) > In [8]: df = spark.createDataFrame(pdf) > In [9]: df.show() > +---+---+ > | A| B| > +---+---+ > | a| a| > | b| b| > | c| c| > | a| a| > +---+---+ > In [10]: df.printSchema() > root > |-- A: string (nullable = true) > |-- B: string (nullable = true) > In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True) > In [19]: df = spark.createDataFrame(pdf) >1667 spark_type = ArrayType(from_arrow_type(at.value_type)) >1668 else: > -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " > + str(at)) >1670 return spark_type >1671 > TypeError: Unsupported type in conversion from Arrow: > dictionary > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29493) Add MapType support for Arrow Java
[ https://issues.apache.org/jira/browse/SPARK-29493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972833#comment-16972833 ] Bryan Cutler commented on SPARK-29493: -- [~jalpan.randeri] this depends on SPARK-29376 for a newer version of Arrow. I was planning on doing this after because it might be a little tricky. If you want to take a look at SPARK-25351, I think that would be more straightforward to add. > Add MapType support for Arrow Java > -- > > Key: SPARK-29493 > URL: https://issues.apache.org/jira/browse/SPARK-29493 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > This will add MapType support for Arrow in Spark ArrowConverters. This can > happen after the Arrow 0.15.0 upgrade, but MapType is not available in > pyarrow yet, so pyspark and pandas_udf will come later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23485) Kubernetes should support node blacklist
[ https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-23485: - Affects Version/s: (was: 2.3.0) 3.0.0 2.4.0 > Kubernetes should support node blacklist > > > Key: SPARK-23485 > URL: https://issues.apache.org/jira/browse/SPARK-23485 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Scheduler >Affects Versions: 2.4.0, 3.0.0 >Reporter: Imran Rashid >Priority: Major > Labels: bulk-closed > > Spark's BlacklistTracker maintains a list of "bad nodes" which it will not > use for running tasks (eg., because of bad hardware). When running in yarn, > this blacklist is used to avoid ever allocating resources on blacklisted > nodes: > https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128 > I'm just beginning to poke around the kubernetes code, so apologies if this > is incorrect -- but I didn't see any references to > {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it > seems this is missing. Thought of this while looking at SPARK-19755, a > similar issue on mesos. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23485) Kubernetes should support node blacklist
[ https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-23485: - Labels: (was: bulk-closed) > Kubernetes should support node blacklist > > > Key: SPARK-23485 > URL: https://issues.apache.org/jira/browse/SPARK-23485 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Scheduler >Affects Versions: 2.4.0, 3.0.0 >Reporter: Imran Rashid >Priority: Major > > Spark's BlacklistTracker maintains a list of "bad nodes" which it will not > use for running tasks (eg., because of bad hardware). When running in yarn, > this blacklist is used to avoid ever allocating resources on blacklisted > nodes: > https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128 > I'm just beginning to poke around the kubernetes code, so apologies if this > is incorrect -- but I didn't see any references to > {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it > seems this is missing. Thought of this while looking at SPARK-19755, a > similar issue on mesos. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-23485) Kubernetes should support node blacklist
[ https://issues.apache.org/jira/browse/SPARK-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reopened SPARK-23485: -- I think this issue is still valid for current versions, it was just opened against 2.3.0, so reopening. > Kubernetes should support node blacklist > > > Key: SPARK-23485 > URL: https://issues.apache.org/jira/browse/SPARK-23485 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Scheduler >Affects Versions: 2.4.0, 3.0.0 >Reporter: Imran Rashid >Priority: Major > Labels: bulk-closed > > Spark's BlacklistTracker maintains a list of "bad nodes" which it will not > use for running tasks (eg., because of bad hardware). When running in yarn, > this blacklist is used to avoid ever allocating resources on blacklisted > nodes: > https://github.com/apache/spark/blob/e836c27ce011ca9aef822bef6320b4a7059ec343/resource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala#L128 > I'm just beginning to poke around the kubernetes code, so apologies if this > is incorrect -- but I didn't see any references to > {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}} so it > seems this is missing. Thought of this while looking at SPARK-19755, a > similar issue on mesos. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29493) Add MapType support for Arrow Java
[ https://issues.apache.org/jira/browse/SPARK-29493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972898#comment-16972898 ] Jalpan Randeri commented on SPARK-29493: Sure. i will include python changes also in my pr. > Add MapType support for Arrow Java > -- > > Key: SPARK-29493 > URL: https://issues.apache.org/jira/browse/SPARK-29493 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > This will add MapType support for Arrow in Spark ArrowConverters. This can > happen after the Arrow 0.15.0 upgrade, but MapType is not available in > pyarrow yet, so pyspark and pandas_udf will come later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29844) Improper unpersist strategy in ml.recommendation.ASL.train
[ https://issues.apache.org/jira/browse/SPARK-29844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29844. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26469 [https://github.com/apache/spark/pull/26469] > Improper unpersist strategy in ml.recommendation.ASL.train > -- > > Key: SPARK-29844 > URL: https://issues.apache.org/jira/browse/SPARK-29844 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.3 >Reporter: Dong Wang >Assignee: Dong Wang >Priority: Minor > Fix For: 3.0.0 > > > In ml.recommendation.ASL.train(), there are many intermediate RDDs. At the > end of the method, these RDDs invoke unpersist(), but the timings of > unpersist is not right, which will cause recomputation and memory waste. > {code:scala} > val userIdAndFactors = userInBlocks > .mapValues(_.srcIds) > .join(userFactors) > .mapPartitions({ items => > items.flatMap { case (_, (ids, factors)) => > ids.view.zip(factors) > } > // Preserve the partitioning because IDs are consistent with the > partitioners in userInBlocks > // and userFactors. > }, preservesPartitioning = true) > .setName("userFactors") > .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix > val itemIdAndFactors = itemInBlocks > .mapValues(_.srcIds) > .join(itemFactors) > .mapPartitions({ items => > items.flatMap { case (_, (ids, factors)) => > ids.view.zip(factors) > } > }, preservesPartitioning = true) > .setName("itemFactors") > .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix > if (finalRDDStorageLevel != StorageLevel.NONE) { > userIdAndFactors.count() > itemFactors.unpersist() // Premature unpersist > itemIdAndFactors.count() > userInBlocks.unpersist() // Lagging unpersist > userOutBlocks.unpersist() // Lagging unpersist > itemInBlocks.unpersist() > itemOutBlocks.unpersist() // Lagging unpersist > blockRatings.unpersist() // Lagging unpersist > } > (userIdAndFactors, itemIdAndFactors) > } > {code} > 1. Unpersist itemFactors too early. itemIdAndFactors.count() will use > itemFactors. So itemFactors will be recomputed. > 2. Unpersist userInBlocks, userOutBlocks, itemOutBlocks, and blockRatings too > late. The final action - itemIdAndFactors.count() will not use these RDDs, so > these RDDs can be unpersisted before it to save memory. > By the way, itemIdAndFactors is persisted here but will never be unpersisted > util the application ends. It may hurts the performance, but I think it's > hard to fix. > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29844) Improper unpersist strategy in ml.recommendation.ASL.train
[ https://issues.apache.org/jira/browse/SPARK-29844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-29844: Assignee: Dong Wang > Improper unpersist strategy in ml.recommendation.ASL.train > -- > > Key: SPARK-29844 > URL: https://issues.apache.org/jira/browse/SPARK-29844 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.3 >Reporter: Dong Wang >Assignee: Dong Wang >Priority: Minor > > In ml.recommendation.ASL.train(), there are many intermediate RDDs. At the > end of the method, these RDDs invoke unpersist(), but the timings of > unpersist is not right, which will cause recomputation and memory waste. > {code:scala} > val userIdAndFactors = userInBlocks > .mapValues(_.srcIds) > .join(userFactors) > .mapPartitions({ items => > items.flatMap { case (_, (ids, factors)) => > ids.view.zip(factors) > } > // Preserve the partitioning because IDs are consistent with the > partitioners in userInBlocks > // and userFactors. > }, preservesPartitioning = true) > .setName("userFactors") > .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix > val itemIdAndFactors = itemInBlocks > .mapValues(_.srcIds) > .join(itemFactors) > .mapPartitions({ items => > items.flatMap { case (_, (ids, factors)) => > ids.view.zip(factors) > } > }, preservesPartitioning = true) > .setName("itemFactors") > .persist(finalRDDStorageLevel) // Missing unpersist, but hard to fix > if (finalRDDStorageLevel != StorageLevel.NONE) { > userIdAndFactors.count() > itemFactors.unpersist() // Premature unpersist > itemIdAndFactors.count() > userInBlocks.unpersist() // Lagging unpersist > userOutBlocks.unpersist() // Lagging unpersist > itemInBlocks.unpersist() > itemOutBlocks.unpersist() // Lagging unpersist > blockRatings.unpersist() // Lagging unpersist > } > (userIdAndFactors, itemIdAndFactors) > } > {code} > 1. Unpersist itemFactors too early. itemIdAndFactors.count() will use > itemFactors. So itemFactors will be recomputed. > 2. Unpersist userInBlocks, userOutBlocks, itemOutBlocks, and blockRatings too > late. The final action - itemIdAndFactors.count() will not use these RDDs, so > these RDDs can be unpersisted before it to save memory. > By the way, itemIdAndFactors is persisted here but will never be unpersisted > util the application ends. It may hurts the performance, but I think it's > hard to fix. > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29570) Improve tooltip for Executor Tab for Shuffle Write,Blacklisted,Logs,Threaddump columns
[ https://issues.apache.org/jira/browse/SPARK-29570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-29570: Assignee: Ankit Raj Boudh > Improve tooltip for Executor Tab for Shuffle > Write,Blacklisted,Logs,Threaddump columns > -- > > Key: SPARK-29570 > URL: https://issues.apache.org/jira/browse/SPARK-29570 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: Ankit Raj Boudh >Priority: Minor > > When User move mouse over the columns under Executors Shuffle > Write,Blacklisted,Logs,Threaddump columns, tooltip not display at center. > Check the other columns it display at center. > Please fix this issue in all Spark WEB UI page and History UI Page. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29570) Improve tooltip for Executor Tab for Shuffle Write,Blacklisted,Logs,Threaddump columns
[ https://issues.apache.org/jira/browse/SPARK-29570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29570. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26263 [https://github.com/apache/spark/pull/26263] > Improve tooltip for Executor Tab for Shuffle > Write,Blacklisted,Logs,Threaddump columns > -- > > Key: SPARK-29570 > URL: https://issues.apache.org/jira/browse/SPARK-29570 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: Ankit Raj Boudh >Priority: Minor > Fix For: 3.0.0 > > > When User move mouse over the columns under Executors Shuffle > Write,Blacklisted,Logs,Threaddump columns, tooltip not display at center. > Check the other columns it display at center. > Please fix this issue in all Spark WEB UI page and History UI Page. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29399) Remove old ExecutorPlugin API (or wrap it using new API)
[ https://issues.apache.org/jira/browse/SPARK-29399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-29399: Assignee: Marcelo Masiero Vanzin > Remove old ExecutorPlugin API (or wrap it using new API) > > > Key: SPARK-29399 > URL: https://issues.apache.org/jira/browse/SPARK-29399 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Masiero Vanzin >Assignee: Marcelo Masiero Vanzin >Priority: Major > > The parent bug is proposing a new plugin API for Spark, so we could remove > the old one since it's a developer API. > If we can get the new API into Spark 3.0, then removal might be a better idea > than deprecation. That would be my preference since then we can remove the > new elements added as part of SPARK-28091 without having to deprecate them > first. > If it doesn't make it into 3.0, then we should deprecate the APIs from 3.0 > and manage old plugins through the new code being added. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29396) Extend Spark plugin interface to driver
[ https://issues.apache.org/jira/browse/SPARK-29396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29396. -- Fix Version/s: 3.0.0 Assignee: Marcelo Masiero Vanzin Resolution: Done > Extend Spark plugin interface to driver > --- > > Key: SPARK-29396 > URL: https://issues.apache.org/jira/browse/SPARK-29396 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Masiero Vanzin >Assignee: Marcelo Masiero Vanzin >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Spark provides an extension API for people to implement executor plugins, > added in SPARK-24918 and later extended in SPARK-28091. > That API does not offer any functionality for doing similar things on the > driver side, though. As a consequence of that, there is not a good way for > the executor plugins to get information or communicate in any way with the > Spark driver. > I've been playing with such an improved API for developing some new > functionality. I'll file a few child bugs for the work to get the changes in. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29399) Remove old ExecutorPlugin API (or wrap it using new API)
[ https://issues.apache.org/jira/browse/SPARK-29399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29399. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26390 [https://github.com/apache/spark/pull/26390] > Remove old ExecutorPlugin API (or wrap it using new API) > > > Key: SPARK-29399 > URL: https://issues.apache.org/jira/browse/SPARK-29399 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Masiero Vanzin >Assignee: Marcelo Masiero Vanzin >Priority: Major > Fix For: 3.0.0 > > > The parent bug is proposing a new plugin API for Spark, so we could remove > the old one since it's a developer API. > If we can get the new API into Spark 3.0, then removal might be a better idea > than deprecation. That would be my preference since then we can remove the > new elements added as part of SPARK-28091 without having to deprecate them > first. > If it doesn't make it into 3.0, then we should deprecate the APIs from 3.0 > and manage old plugins through the new code being added. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29868) Add a benchmark for Adaptive Execution
Yuming Wang created SPARK-29868: --- Summary: Add a benchmark for Adaptive Execution Key: SPARK-29868 URL: https://issues.apache.org/jira/browse/SPARK-29868 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Yuming Wang Add a benchmark for Adaptive Execution to evaluate SortMergeJoin to BroadcastJoin performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29868) Add a benchmark for Adaptive Execution
[ https://issues.apache.org/jira/browse/SPARK-29868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29868: Attachment: BroadcastJoin.jpeg SortMergeJoin.jpeg > Add a benchmark for Adaptive Execution > -- > > Key: SPARK-29868 > URL: https://issues.apache.org/jira/browse/SPARK-29868 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: BroadcastJoin.jpeg, SortMergeJoin.jpeg > > > Add a benchmark for Adaptive Execution to evaluate SortMergeJoin to > BroadcastJoin performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29868) Add a benchmark for Adaptive Execution
[ https://issues.apache.org/jira/browse/SPARK-29868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29868: Description: Add a benchmark for Adaptive Execution to evaluate SortMergeJoin to BroadcastJoin performance. It seem SortMergeJoin faster than BroadcastJoin if one side is a bucketed table and can convert to BroadcastJoin. was: Add a benchmark for Adaptive Execution to evaluate SortMergeJoin to BroadcastJoin performance. > Add a benchmark for Adaptive Execution > -- > > Key: SPARK-29868 > URL: https://issues.apache.org/jira/browse/SPARK-29868 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: BroadcastJoin.jpeg, SortMergeJoin.jpeg > > > Add a benchmark for Adaptive Execution to evaluate SortMergeJoin to > BroadcastJoin performance. > It seem SortMergeJoin faster than BroadcastJoin if one side is a bucketed > table and can convert to BroadcastJoin. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29868) Add a benchmark for Adaptive Execution
[ https://issues.apache.org/jira/browse/SPARK-29868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972967#comment-16972967 ] Yuming Wang commented on SPARK-29868: - Examples: [https://github.com/apache/spark/tree/cdea520ff8954cf415fd98d034d9b674d6ca4f67/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark] > Add a benchmark for Adaptive Execution > -- > > Key: SPARK-29868 > URL: https://issues.apache.org/jira/browse/SPARK-29868 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: BroadcastJoin.jpeg, SortMergeJoin.jpeg > > > Add a benchmark for Adaptive Execution to evaluate SortMergeJoin to > BroadcastJoin performance. > It seem SortMergeJoin faster than BroadcastJoin if one side is a bucketed > table and can convert to BroadcastJoin. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29869) HiveMetastoreCatalog#convertToLogicalRelation throws AssertionError
Yuming Wang created SPARK-29869: --- Summary: HiveMetastoreCatalog#convertToLogicalRelation throws AssertionError Key: SPARK-29869 URL: https://issues.apache.org/jira/browse/SPARK-29869 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang {noformat} scala> spark.table("table").show java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:261) at org.apache.spark.sql.hive.HiveMetastoreCatalog.convert(HiveMetastoreCatalog.scala:137) at org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:220) at org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:207) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29) at org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:207) at org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:191) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:130) at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:49) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:127) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:119) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:119) at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:168) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:162) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:122) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:98) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:98) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:146) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:145) at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:66) at org.apache.spark.sql.ca
[jira] [Updated] (SPARK-29869) HiveMetastoreCatalog#convertToLogicalRelation throws AssertionError
[ https://issues.apache.org/jira/browse/SPARK-29869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29869: Attachment: SPARK-29869.jpg > HiveMetastoreCatalog#convertToLogicalRelation throws AssertionError > --- > > Key: SPARK-29869 > URL: https://issues.apache.org/jira/browse/SPARK-29869 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: SPARK-29869.jpg > > > {noformat} > scala> spark.table("table").show > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:208) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:261) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog.convert(HiveMetastoreCatalog.scala:137) > at > org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:220) > at > org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:207) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29) > at > org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:207) > at > org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:191) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:130) > at > scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) > at > scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) > at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:49) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:127) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:119) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:119) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:168) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:162) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:122) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:98) > at > org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAn
[jira] [Commented] (SPARK-29868) Add a benchmark for Adaptive Execution
[ https://issues.apache.org/jira/browse/SPARK-29868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972974#comment-16972974 ] leonard ding commented on SPARK-29868: -- I will pick it > Add a benchmark for Adaptive Execution > -- > > Key: SPARK-29868 > URL: https://issues.apache.org/jira/browse/SPARK-29868 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: BroadcastJoin.jpeg, SortMergeJoin.jpeg > > > Add a benchmark for Adaptive Execution to evaluate SortMergeJoin to > BroadcastJoin performance. > It seem SortMergeJoin faster than BroadcastJoin if one side is a bucketed > table and can convert to BroadcastJoin. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29869) HiveMetastoreCatalog#convertToLogicalRelation throws AssertionError
[ https://issues.apache.org/jira/browse/SPARK-29869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29869: Description: {noformat} scala> spark.table("hive_table").show java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:261) at org.apache.spark.sql.hive.HiveMetastoreCatalog.convert(HiveMetastoreCatalog.scala:137) at org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:220) at org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:207) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29) at org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:207) at org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:191) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:130) at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:49) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:127) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:119) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:119) at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:168) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:162) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:122) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:98) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:98) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:146) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:145) at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:66) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:63) at org.apache.spark
[jira] [Commented] (SPARK-29855) typed literals with negative sign with proper result or exception
[ https://issues.apache.org/jira/browse/SPARK-29855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972983#comment-16972983 ] Kent Yao commented on SPARK-29855: -- This is the related ticket that involves the negative sign before typed literals > typed literals with negative sign with proper result or exception > - > > Key: SPARK-29855 > URL: https://issues.apache.org/jira/browse/SPARK-29855 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > {code:java} > -- !query 83 > select -integer '7' > -- !query 83 schema > struct<7:int> > -- !query 83 output > 7 > -- !query 86 > select -date '1999-01-01' > -- !query 86 schema > struct > -- !query 86 output > 1999-01-01 > -- !query 87 > select -timestamp '1999-01-01' > -- !query 87 schema > struct > -- !query 87 output > 1999-01-01 00:00:00 > {code} > the integer should be -7 and the date and timestamp results are confusing > which should throw exception. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29870) Unify the logic of multi-units interval string to CalendarInterval
Kent Yao created SPARK-29870: Summary: Unify the logic of multi-units interval string to CalendarInterval Key: SPARK-29870 URL: https://issues.apache.org/jira/browse/SPARK-29870 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao We now have two different implementation for multi-units interval strings to CalendarInterval type values. One is used to covert interval string literals to CalendarInterval. This approach will re-delegate the interval string to spark parser which handles the string as a `singleInterval` -> `multiUnitsInterval` -> eventually call `IntervalUtils.fromUnitStrings` The other is used in `Cast`, which eventually calls `IntervalUtils.stringToInterval`. This approach is ~10 times faster than the other. We should unify these two for better performance and simple logic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29870) Unify the logic of multi-units interval string to CalendarInterval
[ https://issues.apache.org/jira/browse/SPARK-29870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972992#comment-16972992 ] Kent Yao commented on SPARK-29870: -- working on this > Unify the logic of multi-units interval string to CalendarInterval > -- > > Key: SPARK-29870 > URL: https://issues.apache.org/jira/browse/SPARK-29870 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Major > > We now have two different implementation for multi-units interval strings to > CalendarInterval type values. > One is used to covert interval string literals to CalendarInterval. This > approach will re-delegate the interval string to spark parser which handles > the string as a `singleInterval` -> `multiUnitsInterval` -> eventually call > `IntervalUtils.fromUnitStrings` > The other is used in `Cast`, which eventually calls > `IntervalUtils.stringToInterval`. This approach is ~10 times faster than the > other. > We should unify these two for better performance and simple logic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29462) The data type of "array()" should be array
[ https://issues.apache.org/jira/browse/SPARK-29462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29462: - Fix Version/s: (was: 3.0.0) > The data type of "array()" should be array > > > Key: SPARK-29462 > URL: https://issues.apache.org/jira/browse/SPARK-29462 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Minor > > In the current implmentation: > > spark.sql("select array()") > res0: org.apache.spark.sql.DataFrame = [array(): array] > The output type should be array -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29462) The data type of "array()" should be array
[ https://issues.apache.org/jira/browse/SPARK-29462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973000#comment-16973000 ] Hyukjin Kwon commented on SPARK-29462: -- Reverted due to the reasons https://github.com/apache/spark/pull/26324#issuecomment-553230428 > The data type of "array()" should be array > > > Key: SPARK-29462 > URL: https://issues.apache.org/jira/browse/SPARK-29462 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Minor > > In the current implmentation: > > spark.sql("select array()") > res0: org.apache.spark.sql.DataFrame = [array(): array] > The output type should be array -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-29462) The data type of "array()" should be array
[ https://issues.apache.org/jira/browse/SPARK-29462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-29462: -- Assignee: (was: Aman Omer) > The data type of "array()" should be array > > > Key: SPARK-29462 > URL: https://issues.apache.org/jira/browse/SPARK-29462 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Minor > Fix For: 3.0.0 > > > In the current implmentation: > > spark.sql("select array()") > res0: org.apache.spark.sql.DataFrame = [array(): array] > The output type should be array -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value
[ https://issues.apache.org/jira/browse/SPARK-29854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973007#comment-16973007 ] ABHISHEK KUMAR GUPTA commented on SPARK-29854: -- [~hyukjin.kwon]: Hi I checked in Hive PostgreSQL : It gives Excpetion In my SQL It gives NULL mysql> SELECT lpad('hihhh', 5000, ''); +-+ | lpad('hihhh', 5000, '') | +-+ | NULL| +-+ 1 row in set, 3 warnings (0.00 sec) mysql> SELECT rpad('hihhh', 5000, ''); +-+ | rpad('hihhh', 5000, '') | +-+ | NULL| +-+ 1 row in set, 3 warnings (0.00 sec) > lpad and rpad built in function not throw Exception for invalid len value > - > > Key: SPARK-29854 > URL: https://issues.apache.org/jira/browse/SPARK-29854 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > Spark Returns Empty String) > {code} > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT > lpad('hihhh', 5000, ''); > ++ > |lpad(hihhh, CAST(5000 AS INT), > )| > ++ > ++ > Hive: > SELECT lpad('hihhh', 5000, > ''); > Error: Error while compiling statement: FAILED: SemanticException [Error > 10016]: Line 1:67 Argument type mismatch '''': lpad only takes > INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016) > PostgreSQL > function lpad(unknown, numeric, unknown) does not exist > > Expected output: > In Spark also it should throw Exception like Hive > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow
[ https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973023#comment-16973023 ] Jalpan Randeri commented on SPARK-25351: Yes, I will take up this changes as part of my PR > Handle Pandas category type when converting from Python with Arrow > -- > > Key: SPARK-25351 > URL: https://issues.apache.org/jira/browse/SPARK-25351 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Bryan Cutler >Priority: Major > Labels: bulk-closed > > There needs to be some handling of category types done when calling > {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}. > Without Arrow, Spark casts each element to the category. For example > {noformat} > In [1]: import pandas as pd > In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]}) > In [3]: pdf["B"] = pdf["A"].astype('category') > In [4]: pdf > Out[4]: >A B > 0 a a > 1 b b > 2 c c > 3 a a > In [5]: pdf.dtypes > Out[5]: > A object > Bcategory > dtype: object > In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False) > In [8]: df = spark.createDataFrame(pdf) > In [9]: df.show() > +---+---+ > | A| B| > +---+---+ > | a| a| > | b| b| > | c| c| > | a| a| > +---+---+ > In [10]: df.printSchema() > root > |-- A: string (nullable = true) > |-- B: string (nullable = true) > In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True) > In [19]: df = spark.createDataFrame(pdf) >1667 spark_type = ArrayType(from_arrow_type(at.value_type)) >1668 else: > -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " > + str(at)) >1670 return spark_type >1671 > TypeError: Unsupported type in conversion from Arrow: > dictionary > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29390) Add the justify_days(), justify_hours() and justify_interval() functions
[ https://issues.apache.org/jira/browse/SPARK-29390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-29390. -- Fix Version/s: 3.0.0 Assignee: Kent Yao Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/26465|https://github.com/apache/spark/pull/26465#] > Add the justify_days(), justify_hours() and justify_interval() functions > --- > > Key: SPARK-29390 > URL: https://issues.apache.org/jira/browse/SPARK-29390 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > See *Table 9.31. Date/Time Functions* > ([https://www.postgresql.org/docs/12/functions-datetime.html)] > |{{justify_days(}}{{interval}}{{)}}|{{interval}}|Adjust interval so 30-day > time periods are represented as months|{{justify_days(interval '35 > days')}}|{{1 mon 5 days}}| > | {{justify_hours(}}{{interval}}{{)}}|{{interval}}|Adjust interval so 24-hour > time periods are represented as days|{{justify_hours(interval '27 > hours')}}|{{1 day 03:00:00}}| > | {{justify_interval(}}{{interval}}{{)}}|{{interval}}|Adjust interval using > {{justify_days}} and {{justify_hours}}, with additional sign > adjustments|{{justify_interval(interval '1 mon -1 hour')}}|{{29 days > 23:00:00}}| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29871) Flaky test: ImageFileFormatTest.test_read_images
wuyi created SPARK-29871: Summary: Flaky test: ImageFileFormatTest.test_read_images Key: SPARK-29871 URL: https://issues.apache.org/jira/browse/SPARK-29871 Project: Spark Issue Type: Test Components: ML Affects Versions: 3.0.0 Reporter: wuyi Running tests... -- test_read_images (pyspark.ml.tests.test_image.ImageFileFormatTest) ... ERROR (12.050s) == ERROR [12.050s]: test_read_images (pyspark.ml.tests.test_image.ImageFileFormatTest) -- Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests/test_image.py", line 35, in test_read_images self.assertEqual(df.count(), 4) File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/dataframe.py", line 507, in count return int(self._jdf.count()) File "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py", line 98, in deco return f(*a, **kw) File "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o32.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, amp-jenkins-worker-05.amp, executor driver): javax.imageio.IIOException: Unsupported Image Type at com.sun.imageio.plugins.jpeg.JPEGImageReader.readInternal(JPEGImageReader.java:1079) at com.sun.imageio.plugins.jpeg.JPEGImageReader.read(JPEGImageReader.java:1050) at javax.imageio.ImageIO.read(ImageIO.java:1448) at javax.imageio.ImageIO.read(ImageIO.java:1352) at org.apache.spark.ml.image.ImageSchema$.decode(ImageSchema.scala:134) at org.apache.spark.ml.source.image.ImageFileFormat.$anonfun$buildReader$2(ImageFileFormat.scala:84) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:147) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:132) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(generated.java:33) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:63) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1979) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1967) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1966) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGS
[jira] [Updated] (SPARK-29869) HiveMetastoreCatalog#convertToLogicalRelation throws AssertionError
[ https://issues.apache.org/jira/browse/SPARK-29869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29869: Description: {noformat} scala> spark.table("hive_table").show java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:261) at org.apache.spark.sql.hive.HiveMetastoreCatalog.convert(HiveMetastoreCatalog.scala:137) at org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:220) at org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:207) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29) at org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:207) at org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:191) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:130) at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:49) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:127) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:119) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:119) at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:168) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:162) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:122) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:98) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:98) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:146) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:145) at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:66) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:63) at org.apache.spark
[jira] [Created] (SPARK-29872) Improper cache strategy in examples
Dong Wang created SPARK-29872: - Summary: Improper cache strategy in examples Key: SPARK-29872 URL: https://issues.apache.org/jira/browse/SPARK-29872 Project: Spark Issue Type: Improvement Components: Examples Affects Versions: 3.0.0 Reporter: Dong Wang 1. Improper cache in examples.SparkTC The RDD edges should be cached because it is used multiple times in while loop. And it should be unpersisted before the last action tc.count(), because tc has been persisted. On the other hand, many tc objects is cached in while loop but never uncached, which will waste memory. {code:scala} val edges = tc.map(x => (x._2, x._1)) // Edges should be cached // This join is iterated until a fixed point is reached. var oldCount = 0L var nextCount = tc.count() do { oldCount = nextCount // Perform the join, obtaining an RDD of (y, (z, x)) pairs, // then project the result to obtain the new (x, z) paths. tc = tc.union(tc.join(edges).map(x => (x._2._2, x._2._1))).distinct().cache() nextCount = tc.count() } while (nextCount != oldCount) println(s"TC has ${tc.count()} edges.") {code} 2. Cache needed in examples.ml.LogisticRegressionSummary The DataFrame fMeasure should be cached. {code:scala} // Set the model threshold to maximize F-Measure val fMeasure = trainingSummary.fMeasureByThreshold // fMeasures should be cached val maxFMeasure = fMeasure.select(max("F-Measure")).head().getDouble(0) val bestThreshold = fMeasure.where($"F-Measure" === maxFMeasure) .select("threshold").head().getDouble(0) lrModel.setThreshold(bestThreshold) {code} 3. Cache needed in examples.sql.SparkSQLExample {code:scala} val peopleDF = spark.sparkContext .textFile("examples/src/main/resources/people.txt") .map(_.split(",")) .map(attributes => Person(attributes(0), attributes(1).trim.toInt)) // This RDD should be cahced .toDF() // Register the DataFrame as a temporary view peopleDF.createOrReplaceTempView("people") val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19") teenagersDF.map(teenager => "Name: " + teenager(0)).show() teenagersDF.map(teenager => "Name: " + teenager.getAs[String]("name")).show() implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]] teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name", "age"))).collect() {code} This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29861) Reduce downtime in Spark standalone HA master switch
[ https://issues.apache.org/jira/browse/SPARK-29861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Wolters updated SPARK-29861: -- Summary: Reduce downtime in Spark standalone HA master switch (was: Reduce leader election downtime in Spark standalone HA) > Reduce downtime in Spark standalone HA master switch > > > Key: SPARK-29861 > URL: https://issues.apache.org/jira/browse/SPARK-29861 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Robin Wolters >Priority: Minor > > As officially stated in the spark [HA > documention|https://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper], > the recovery process of Spark (standalone) master in HA with zookeeper takes > about 1-2 minutes. During this time no spark master is active, which makes > interaction with spark essentially impossible. > After looking for a way to reduce this downtime, it seems that this is mainly > caused by the leader election, which waits for open zookeeper connections to > be closed. This seems like an unnecessary downtime for example in case of a > planned VM update. > I have fixed this in my setup by: > # Closing open zookeeper connections during spark shutdown > # Bumping the curator version and implementing a custom error policy that is > tolerant to a zookeeper connection suspension. > I am preparing a pull request for review / further discussion on this issue. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org