[jira] [Commented] (SPARK-30196) Bump lz4-java version to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-30196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010438#comment-17010438 ] Hyukjin Kwon commented on SPARK-30196: -- (y) > Bump lz4-java version to 1.7.0 > -- > > Key: SPARK-30196 > URL: https://issues.apache.org/jira/browse/SPARK-30196 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30425) FileScan of Data Source V2 doesn't implement Partition Pruning
[ https://issues.apache.org/jira/browse/SPARK-30425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30425. -- Resolution: Duplicate > FileScan of Data Source V2 doesn't implement Partition Pruning > -- > > Key: SPARK-30425 > URL: https://issues.apache.org/jira/browse/SPARK-30425 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Haifeng Chen >Priority: Major > Original Estimate: 168h > Remaining Estimate: 168h > > I was trying to understand how Data Source V2 handling partition pruning, I > didn't find the code anywhere which filtering out the unnecessary files in > current Data Source V2 implementation. For a File data source, the base class > FileScan of Data Source V2 possibly should handle this in "partitions" > method. But the current implementation is like the following: > protected def partitions: Seq[FilePartition] = { > val selectedPartitions = fileIndex.listFiles(Seq.empty, Seq.empty) > > listFiles passed to empty sequence where no files will be filtered by the > partition filter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30455) Select All should unselect after un-selecting any selected item from list.
[ https://issues.apache.org/jira/browse/SPARK-30455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010424#comment-17010424 ] Ankit Raj Boudh commented on SPARK-30455: - i will raise for this. > Select All should unselect after un-selecting any selected item from list. > -- > > Key: SPARK-30455 > URL: https://issues.apache.org/jira/browse/SPARK-30455 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.4 >Reporter: Ankit Raj Boudh >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30455) Select All should unselect after un-selecting any selected item from list.
[ https://issues.apache.org/jira/browse/SPARK-30455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010424#comment-17010424 ] Ankit Raj Boudh edited comment on SPARK-30455 at 1/8/20 7:06 AM: - i will raise pr for this. was (Author: ankitraj): i will raise for this. > Select All should unselect after un-selecting any selected item from list. > -- > > Key: SPARK-30455 > URL: https://issues.apache.org/jira/browse/SPARK-30455 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.4 >Reporter: Ankit Raj Boudh >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30455) Select All should unselect after un-selecting any selected item from list.
Ankit Raj Boudh created SPARK-30455: --- Summary: Select All should unselect after un-selecting any selected item from list. Key: SPARK-30455 URL: https://issues.apache.org/jira/browse/SPARK-30455 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.4.4 Reporter: Ankit Raj Boudh -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30425) FileScan of Data Source V2 doesn't implement Partition Pruning
[ https://issues.apache.org/jira/browse/SPARK-30425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010414#comment-17010414 ] Gengliang Wang commented on SPARK-30425: [~sandeep.katta2007]Yes [~jerrychenhf] Thanks for reporting the issue! Do you mind if I close this one? > FileScan of Data Source V2 doesn't implement Partition Pruning > -- > > Key: SPARK-30425 > URL: https://issues.apache.org/jira/browse/SPARK-30425 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Haifeng Chen >Priority: Major > Original Estimate: 168h > Remaining Estimate: 168h > > I was trying to understand how Data Source V2 handling partition pruning, I > didn't find the code anywhere which filtering out the unnecessary files in > current Data Source V2 implementation. For a File data source, the base class > FileScan of Data Source V2 possibly should handle this in "partitions" > method. But the current implementation is like the following: > protected def partitions: Seq[FilePartition] = { > val selectedPartitions = fileIndex.listFiles(Seq.empty, Seq.empty) > > listFiles passed to empty sequence where no files will be filtered by the > partition filter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30454) Null Dereference in HiveSQLException
[ https://issues.apache.org/jira/browse/SPARK-30454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010412#comment-17010412 ] pavithra ramachandran commented on SPARK-30454: --- I shall raise the PR > Null Dereference in HiveSQLException > > > Key: SPARK-30454 > URL: https://issues.apache.org/jira/browse/SPARK-30454 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: pavithra ramachandran >Priority: Major > > Null Pointer DeReferencing found in spark HiveSQLException code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30454) Null Dereference in HiveSQLException
pavithra ramachandran created SPARK-30454: - Summary: Null Dereference in HiveSQLException Key: SPARK-30454 URL: https://issues.apache.org/jira/browse/SPARK-30454 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4, 2.3.4, 3.0.0 Reporter: pavithra ramachandran Null Pointer DeReferencing found in spark HiveSQLException code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30425) FileScan of Data Source V2 doesn't implement Partition Pruning
[ https://issues.apache.org/jira/browse/SPARK-30425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010408#comment-17010408 ] Sandeep Katta commented on SPARK-30425: --- [~gengliang] > FileScan of Data Source V2 doesn't implement Partition Pruning > -- > > Key: SPARK-30425 > URL: https://issues.apache.org/jira/browse/SPARK-30425 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Haifeng Chen >Priority: Major > Original Estimate: 168h > Remaining Estimate: 168h > > I was trying to understand how Data Source V2 handling partition pruning, I > didn't find the code anywhere which filtering out the unnecessary files in > current Data Source V2 implementation. For a File data source, the base class > FileScan of Data Source V2 possibly should handle this in "partitions" > method. But the current implementation is like the following: > protected def partitions: Seq[FilePartition] = { > val selectedPartitions = fileIndex.listFiles(Seq.empty, Seq.empty) > > listFiles passed to empty sequence where no files will be filtered by the > partition filter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30425) FileScan of Data Source V2 doesn't implement Partition Pruning
[ https://issues.apache.org/jira/browse/SPARK-30425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010407#comment-17010407 ] Sandeep Katta commented on SPARK-30425: --- is this duplicate of [SPARK-30428|https://issues.apache.org/jira/browse/SPARK-30428] > FileScan of Data Source V2 doesn't implement Partition Pruning > -- > > Key: SPARK-30425 > URL: https://issues.apache.org/jira/browse/SPARK-30425 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Haifeng Chen >Priority: Major > Original Estimate: 168h > Remaining Estimate: 168h > > I was trying to understand how Data Source V2 handling partition pruning, I > didn't find the code anywhere which filtering out the unnecessary files in > current Data Source V2 implementation. For a File data source, the base class > FileScan of Data Source V2 possibly should handle this in "partitions" > method. But the current implementation is like the following: > protected def partitions: Seq[FilePartition] = { > val selectedPartitions = fileIndex.listFiles(Seq.empty, Seq.empty) > > listFiles passed to empty sequence where no files will be filtered by the > partition filter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28478) Optimizer rule to remove unnecessary explicit null checks for null-intolerant expressions (e.g. if(x is null, x, f(x)))
[ https://issues.apache.org/jira/browse/SPARK-28478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010387#comment-17010387 ] David Vrba commented on SPARK-28478: [~cloud_fan] what do you think about this? Is it worth implementing? If yes, I would like to do it. If not i won't bother. > Optimizer rule to remove unnecessary explicit null checks for null-intolerant > expressions (e.g. if(x is null, x, f(x))) > --- > > Key: SPARK-28478 > URL: https://issues.apache.org/jira/browse/SPARK-28478 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Priority: Major > > I ran across a family of expressions like > {code:java} > if(x is null, x, substring(x, 0, 1024)){code} > or > {code:java} > when($"x".isNull, $"x", substring($"x", 0, 1024)){code} > that were written this way because the query author was unsure about whether > {{substring}} would return {{null}} when its input string argument is null. > This explicit null-handling is unnecessary and adds bloat to the generated > code, especially if it's done via a {{CASE}} statement (which compiles down > to a {{do-while}} loop). > In another case I saw a query compiler which automatically generated this > type of code. > It would be cool if Spark could automatically optimize such queries to remove > these redundant null checks. Here's a sketch of what such a rule might look > like (assuming that SPARK-28477 has been implement so we only need to worry > about the {{IF}} case): > * In the pattern match, check the following three conditions in the > following order (to benefit from short-circuiting) > ** The {{IF}} condition is an explicit null-check of a column {{c}} > ** The {{true}} expression returns either {{c}} or {{null}} > ** The {{false}} expression is a _null-intolerant_ expression with {{c}} as > a _direct_ child. > * If this condition matches, replace the entire {{If}} with the {{false}} > branch's expression.. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30408) orderBy in sortBy clause is removed by EliminateSorts
[ https://issues.apache.org/jira/browse/SPARK-30408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] APeng Zhang updated SPARK-30408: Description: OrderBy in sortBy clause will be removed by EliminateSorts. code to reproduce: {code:java} val dataset = Seq( ("a", 1, 4), ("b", 2, 5), ("b2", 2, 2), ("c", 3, 6) ).toDF("a", "b", "c") val groupData = dataset.orderBy("b") val sortData = groupData.sortWithinPartitions("c") {code} The content of groupData is: {code:java} partition 0: [a,1,4] partition 1: [b,2,5] [b2,2,2] partition 2: [c,3,6]{code} The content of sortData is: {code:java} partition 0: [a,1,4] [b,2,5] partition 1: [b2,2,2] [c,3,6]{code} UT to cover this defect: In EliminateSortsSuite.scala {code:java} test("should not remove orderBy in sortBy clause") { val plan = testRelation.orderBy('a.asc).sortBy('b.desc) val optimized = Optimize.execute(plan.analyze) val correctAnswer = testRelation.orderBy('a.asc).sortBy('b.desc).analyze comparePlans(optimized, correctAnswer) }{code} This test will be failed because sortBy was removed by EliminateSorts. was: OrderBy in sortBy clause will be removed by EliminateSorts. code to reproduce: {code:java} val dataset = Seq( ("a", 1, 4), ("b", 2, 5), ("c", 3, 6) ).toDF("a", "b", "c") val groupData = dataset.orderBy("b") val sortData = groupData.sortWithinPartitions("c") {code} The content of groupData is: {code:java} partition 0: [a,1,4] partition 1: [b,2,5] partition 2: [c,3,6]{code} The content of sortData is: {code:java} partition 0: [a,1,4] partition 1: [b,2,5], [c,3,6]{code} UT to cover this defect: In EliminateSortsSuite.scala {code:java} test("should not remove orderBy in sortBy clause") { val plan = testRelation.orderBy('a.asc).sortBy('b.desc) val optimized = Optimize.execute(plan.analyze) val correctAnswer = testRelation.orderBy('a.asc).sortBy('b.desc).analyze comparePlans(optimized, correctAnswer) }{code} This test will be failed because sortBy was removed by EliminateSorts. > orderBy in sortBy clause is removed by EliminateSorts > - > > Key: SPARK-30408 > URL: https://issues.apache.org/jira/browse/SPARK-30408 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: APeng Zhang >Priority: Major > > OrderBy in sortBy clause will be removed by EliminateSorts. > code to reproduce: > {code:java} > val dataset = Seq( ("a", 1, 4), ("b", 2, 5), ("b2", 2, 2), ("c", 3, 6) > ).toDF("a", "b", "c") > val groupData = dataset.orderBy("b") > val sortData = groupData.sortWithinPartitions("c") > {code} > The content of groupData is: > {code:java} > partition 0: > [a,1,4] > partition 1: > [b,2,5] > [b2,2,2] > partition 2: > [c,3,6]{code} > The content of sortData is: > {code:java} > partition 0: > [a,1,4] > [b,2,5] > partition 1: > [b2,2,2] > [c,3,6]{code} > > UT to cover this defect: > In EliminateSortsSuite.scala > {code:java} > test("should not remove orderBy in sortBy clause") { > val plan = testRelation.orderBy('a.asc).sortBy('b.desc) > val optimized = Optimize.execute(plan.analyze) > val correctAnswer = testRelation.orderBy('a.asc).sortBy('b.desc).analyze > comparePlans(optimized, correctAnswer) > }{code} > > > This test will be failed because sortBy was removed by EliminateSorts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30427) Add config item for limiting partition number when calculating statistics through File System
[ https://issues.apache.org/jira/browse/SPARK-30427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hu Fuwang updated SPARK-30427: -- Description: Currently, when spark need to calculate the statistics (eg. sizeInBytes) of table partition through file system (eg. HDFS), it does not consider the number of partitions. Then if the the number of partitions is huge, it will cost much time to calculate the statistics which may be not be that useful. It should be reasonable to add a config item to control the limit of partition number allowable to calculate statistics through file system. > Add config item for limiting partition number when calculating statistics > through File System > - > > Key: SPARK-30427 > URL: https://issues.apache.org/jira/browse/SPARK-30427 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hu Fuwang >Priority: Major > > Currently, when spark need to calculate the statistics (eg. sizeInBytes) of > table partition through file system (eg. HDFS), it does not consider the > number of partitions. Then if the the number of partitions is huge, it will > cost much time to calculate the statistics which may be not be that useful. > It should be reasonable to add a config item to control the limit of > partition number allowable to calculate statistics through file system. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30427) Add config item for limiting partition number when calculating statistics through File System
[ https://issues.apache.org/jira/browse/SPARK-30427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hu Fuwang updated SPARK-30427: -- Summary: Add config item for limiting partition number when calculating statistics through File System (was: Add config item for limiting partition number when calculating statistics through HDFS) > Add config item for limiting partition number when calculating statistics > through File System > - > > Key: SPARK-30427 > URL: https://issues.apache.org/jira/browse/SPARK-30427 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hu Fuwang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30196) Bump lz4-java version to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-30196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010373#comment-17010373 ] Takeshi Yamamuro commented on SPARK-30196: -- Yea, it seems yes (I can reproduce this on an older Mac env. I'm checking this, so please give me more time ;) > Bump lz4-java version to 1.7.0 > -- > > Key: SPARK-30196 > URL: https://issues.apache.org/jira/browse/SPARK-30196 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28137) Data Type Formatting Functions: `to_number`
[ https://issues.apache.org/jira/browse/SPARK-28137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010351#comment-17010351 ] Takeshi Yamamuro commented on SPARK-28137: -- See: [https://github.com/apache/spark/pull/25963#issuecomment-571885135] > Data Type Formatting Functions: `to_number` > --- > > Key: SPARK-28137 > URL: https://issues.apache.org/jira/browse/SPARK-28137 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > ||Function||Return Type||Description||Example|| > |{{to_number(}}{{text}}{{, }}{{text}}{{)}}|{{numeric}}|convert string to > numeric|{{to_number('12,454.8-', '99G999D9S')}}| > https://www.postgresql.org/docs/12/functions-formatting.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28137) Data Type Formatting Functions: `to_number`
[ https://issues.apache.org/jira/browse/SPARK-28137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-28137. -- Resolution: Won't Fix > Data Type Formatting Functions: `to_number` > --- > > Key: SPARK-28137 > URL: https://issues.apache.org/jira/browse/SPARK-28137 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > ||Function||Return Type||Description||Example|| > |{{to_number(}}{{text}}{{, }}{{text}}{{)}}|{{numeric}}|convert string to > numeric|{{to_number('12,454.8-', '99G999D9S')}}| > https://www.postgresql.org/docs/12/functions-formatting.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29878) Improper cache strategies in GraphX
[ https://issues.apache.org/jira/browse/SPARK-29878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010317#comment-17010317 ] Dong Wang commented on SPARK-29878: --- So are these unnecessary caches tolerable? These cached data is used only once in these cases, i.e., SSSPExample and ConectedComponentsExample, and I know that they're necessary cache for the most of other cases, Is there a perfect way to handle all cases well? > Improper cache strategies in GraphX > --- > > Key: SPARK-29878 > URL: https://issues.apache.org/jira/browse/SPARK-29878 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 3.0.0 >Reporter: Dong Wang >Priority: Major > > I have run examples.graphx.SSPExample and looked through the RDD dependency > graphs as well as persist operations. There are some improper cache > strategies in GraphX. The same situations also exist when I run > ConnectedComponentsExample. > 1. vertices.cache() and newEdges.cache() are unnecessary > In SSPExample, a graph is initialized by GraphImpl.mapVertices(). In this > method, a GraphImpl object is created using GraphImpl.apply(vertices, edges), > and RDD vertices/newEdges are cached in apply(). But these two RDDs are not > directly used anymore (their children RDDs has been cached) in SSPExample, so > the persists can be unnecessary here. > However, the other examples may need these two persists, so I think they > cannot be simply removed. It might be hard to fix. > {code:scala} > def apply[VD: ClassTag, ED: ClassTag]( > vertices: VertexRDD[VD], > edges: EdgeRDD[ED]): GraphImpl[VD, ED] = { > vertices.cache() // It is unnecessary for SSPExample and > ConnectedComponentsExample > // Convert the vertex partitions in edges to the correct type > val newEdges = edges.asInstanceOf[EdgeRDDImpl[ED, _]] > .mapEdgePartitions((pid, part) => part.withoutVertexAttributes[VD]) > .cache() // It is unnecessary for SSPExample and > ConnectedComponentsExample > GraphImpl.fromExistingRDDs(vertices, newEdges) > } > {code} > 2. Missing persist on newEdges > SSSPExample will invoke pregel to do execution. Pregel will ultilize > ReplicatedVertexView.upgrade(). I find that RDD newEdges will be directly use > by multiple actions in Pregel. So newEdges should be persisted. > Same as the above issue, this issue is also found in > ConnectedComponentsExample. It is also hard to fix, because the persist added > may be unnecessary for other examples. > {code:scala} > // Pregel.scala > // compute the messages > var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg) // > newEdges is created here > val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)]( > checkpointInterval, graph.vertices.sparkContext) > messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]]) > var activeMessages = messages.count() // The first time use newEdges > ... > while (activeMessages > 0 && i < maxIterations) { > // Receive the messages and update the vertices. > prevG = g > g = g.joinVertices(messages)(vprog) // Generate g will depends on > newEdges > ... > activeMessages = messages.count() // The second action to use newEdges. > newEdges should be unpersisted after this instruction. > {code} > {code:scala} > // ReplicatedVertexView.scala > def upgrade(vertices: VertexRDD[VD], includeSrc: Boolean, includeDst: > Boolean): Unit = { > ... >val newEdges = > edges.withPartitionsRDD(edges.partitionsRDD.zipPartitions(shippedVerts) { > (ePartIter, shippedVertsIter) => ePartIter.map { > case (pid, edgePartition) => > (pid, > edgePartition.updateVertices(shippedVertsIter.flatMap(_._2.iterator))) > } > }) > edges = newEdges // newEdges should be persisted > hasSrcId = includeSrc > hasDstId = includeDst > } > } > {code} > As I don't have much knowledge about Graphx, so I don't know how to fix these > issues well. > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30125) Remove PostgreSQL dialect
[ https://issues.apache.org/jira/browse/SPARK-30125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010304#comment-17010304 ] Yuanjian Li commented on SPARK-30125: - Also link #26940 with this Jira. > Remove PostgreSQL dialect > - > > Key: SPARK-30125 > URL: https://issues.apache.org/jira/browse/SPARK-30125 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.0.0 > > > As the discussion in > [http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html], > we need to remove PostgreSQL dialect form code base for several reasons: > 1. The current approach makes the codebase complicated and hard to maintain. > 2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now. > > Curently we have 3 features under PostgreSQL dialect: > 1. SPARK-27931: when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. > are also allowed as true string. > 2. SPARK-29364: `date - date` returns interval in Spark (SQL standard > behavior), but return int in PostgreSQL > 3. SPARK-28395: `int / int` returns double in Spark, but returns int in > PostgreSQL. (there is no standard) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30450) Exclude .git folder for python linter
[ https://issues.apache.org/jira/browse/SPARK-30450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30450. --- Fix Version/s: 3.0.0 2.4.5 Resolution: Fixed Issue resolved by pull request 27121 [https://github.com/apache/spark/pull/27121] > Exclude .git folder for python linter > - > > Key: SPARK-30450 > URL: https://issues.apache.org/jira/browse/SPARK-30450 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Minor > Fix For: 2.4.5, 3.0.0 > > > The python linter shouldn't include the .git folder. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30429) WideSchemaBenchmark fails with OOM
[ https://issues.apache.org/jira/browse/SPARK-30429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30429. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27117 [https://github.com/apache/spark/pull/27117] > WideSchemaBenchmark fails with OOM > -- > > Key: SPARK-30429 > URL: https://issues.apache.org/jira/browse/SPARK-30429 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.0.0 > > Attachments: WideSchemaBenchmark_console.txt > > > Run WideSchemaBenchmark on the master (commit > bc16bb1dd095c9e1c8deabf6ac0d528441a81d88) via: > {code} > SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain > org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark" > {code} > This fails with: > {code} > Caused by: java.lang.reflect.InvocationTargetException > [error] at > sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source) > [error] at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > [error] at > java.lang.reflect.Constructor.newInstance(Constructor.java:423) > [error] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$7(TreeNode.scala:468) > [error] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > [error] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$1(TreeNode.scala:467) > [error] at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) > [error] ... 132 more > [error] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded > [error] at java.util.Arrays.copyOfRange(Arrays.java:3664) > [error] at java.lang.String.(String.java:207) > [error] at java.lang.StringBuilder.toString(StringBuilder.java:407) > [error] at > org.apache.spark.sql.types.StructType.catalogString(StructType.scala:411) > [error] at > org.apache.spark.sql.types.StructType.$anonfun$catalogString$1(StructType.scala:410) > [error] at > org.apache.spark.sql.types.StructType$$Lambda$2441/1040526643.apply(Unknown > Source) > {code} > Full stack dump is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30429) WideSchemaBenchmark fails with OOM
[ https://issues.apache.org/jira/browse/SPARK-30429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30429: - Assignee: L. C. Hsieh > WideSchemaBenchmark fails with OOM > -- > > Key: SPARK-30429 > URL: https://issues.apache.org/jira/browse/SPARK-30429 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: L. C. Hsieh >Priority: Major > Attachments: WideSchemaBenchmark_console.txt > > > Run WideSchemaBenchmark on the master (commit > bc16bb1dd095c9e1c8deabf6ac0d528441a81d88) via: > {code} > SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain > org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark" > {code} > This fails with: > {code} > Caused by: java.lang.reflect.InvocationTargetException > [error] at > sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source) > [error] at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > [error] at > java.lang.reflect.Constructor.newInstance(Constructor.java:423) > [error] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$7(TreeNode.scala:468) > [error] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > [error] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$1(TreeNode.scala:467) > [error] at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) > [error] ... 132 more > [error] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded > [error] at java.util.Arrays.copyOfRange(Arrays.java:3664) > [error] at java.lang.String.(String.java:207) > [error] at java.lang.StringBuilder.toString(StringBuilder.java:407) > [error] at > org.apache.spark.sql.types.StructType.catalogString(StructType.scala:411) > [error] at > org.apache.spark.sql.types.StructType.$anonfun$catalogString$1(StructType.scala:410) > [error] at > org.apache.spark.sql.types.StructType$$Lambda$2441/1040526643.apply(Unknown > Source) > {code} > Full stack dump is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30453) Update AppVeyor R version to 3.6.2
[ https://issues.apache.org/jira/browse/SPARK-30453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30453. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27124 [https://github.com/apache/spark/pull/27124] > Update AppVeyor R version to 3.6.2 > -- > > Key: SPARK-30453 > URL: https://issues.apache.org/jira/browse/SPARK-30453 > Project: Spark > Issue Type: Improvement > Components: Build, SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30453) Update AppVeyor R version to 3.6.2
[ https://issues.apache.org/jira/browse/SPARK-30453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30453: - Assignee: Hyukjin Kwon > Update AppVeyor R version to 3.6.2 > -- > > Key: SPARK-30453 > URL: https://issues.apache.org/jira/browse/SPARK-30453 > Project: Spark > Issue Type: Improvement > Components: Build, SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30302) Complete info for show create table for views
[ https://issues.apache.org/jira/browse/SPARK-30302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-30302. -- Fix Version/s: 3.0.0 Assignee: Zhenhua Wang Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/26944] > Complete info for show create table for views > - > > Key: SPARK-30302 > URL: https://issues.apache.org/jira/browse/SPARK-30302 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang >Priority: Minor > Fix For: 3.0.0 > > > Add table/column comments and table properties to the result of show create > table of views. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering
[ https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010271#comment-17010271 ] Takeshi Yamamuro commented on SPARK-30421: -- Based on the current implementation, drop and select (drop is a shorthand for a partial use-case of select?) seems to have the same semantics. If so, that query might be correct in lazy evaluation. btw, for changing this behaviour, IMO it would be better to reconstruct dataframe([https://github.com/maropu/spark/commit/fac04161405b9ee755b4c7f87de2a144c609c7fa]) instead of modifying the resolution logic. That's because the resolution logic affects many places. > Dropped columns still available for filtering > - > > Key: SPARK-30421 > URL: https://issues.apache.org/jira/browse/SPARK-30421 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Tobias Hermann >Priority: Minor > > The following minimal example: > {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar") > df.select("foo").where($"bar" === "a").show > df.drop("bar").where($"bar" === "a").show > {quote} > should result in an error like the following: > {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given > input columns: [foo]; > {quote} > However, it does not but instead works without error, as if the column "bar" > would exist. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30381) GBT reuse treePoints for all trees
[ https://issues.apache.org/jira/browse/SPARK-30381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-30381. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27103 [https://github.com/apache/spark/pull/27103] > GBT reuse treePoints for all trees > -- > > Key: SPARK-30381 > URL: https://issues.apache.org/jira/browse/SPARK-30381 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Fix For: 3.0.0 > > > In existing GBT, for each tree, it will first compute avaiable splits of each > feature (via RandomForest.findSplits), based on sampled dataset at this > iteration. Then it will use these splits to discretize vectors into > BaggedPoint[TreePoint]s. The BaggedPoints (of the same size of input vectors) > are then cached and used at this iteration. Note that the splits for > discretization in each tree are different (if subsamplingRate<1), only > because the sampled vectors are different. > However, the splits at different iterations shoud be similar if sampled > dataset is big enough, and even the same if subsamplingRate=1. > > However, in other famous GBT impls (like XGBoost/lightGBM) with binned > features, the splits for discretization is the same for different iterations: > {code:java} > import xgboost as xgb > from sklearn.datasets import load_svmlight_file > X, y = > load_svmlight_file('/data0/Dev/Opensource/spark/data/mllib/sample_linear_regression_data.txt') > dtrain = xgb.DMatrix(X[:, :2], label=y) > num_round = 3 > param = {'max_depth': 2, 'eta': 1, 'objective': 'reg:squarederror', > 'tree_method': 'hist', 'max_bin': 2, 'eta': 0.01, 'subsample':0.5} > bst = xgb.train(param, dtrain, num_round) > bst.trees_to_dataframe('/tmp/bst') > Out[61]: > Tree Node ID Feature Split Yes No MissingGain Cover > 0 0 0 0-0 f1 0.000408 0-1 0-2 0-1 170.337143 256.0 > 1 0 1 0-1 f0 0.003531 0-3 0-4 0-3 44.865482 121.0 > 2 0 2 0-2 f0 0.003531 0-5 0-6 0-5 125.615570 135.0 > 3 0 3 0-3Leaf NaN NaN NaN NaN -0.010050 67.0 > 4 0 4 0-4Leaf NaN NaN NaN NaN0.002126 54.0 > 5 0 5 0-5Leaf NaN NaN NaN NaN0.020972 69.0 > 6 0 6 0-6Leaf NaN NaN NaN NaN0.001714 66.0 > 7 1 0 1-0 f0 0.003531 1-1 1-2 1-1 50.417793 263.0 > 8 1 1 1-1 f1 0.000408 1-3 1-4 1-3 48.732742 124.0 > 9 1 2 1-2 f1 0.000408 1-5 1-6 1-5 52.832161 139.0 > 10 1 3 1-3Leaf NaN NaN NaN NaN -0.012784 63.0 > 11 1 4 1-4Leaf NaN NaN NaN NaN -0.000287 61.0 > 12 1 5 1-5Leaf NaN NaN NaN NaN0.008661 64.0 > 13 1 6 1-6Leaf NaN NaN NaN NaN -0.003624 75.0 > 14 2 0 2-0 f1 0.000408 2-1 2-2 2-1 62.136013 242.0 > 15 2 1 2-1 f0 0.003531 2-3 2-4 2-3 150.537781 118.0 > 16 2 2 2-2 f0 0.003531 2-5 2-6 2-53.829046 124.0 > 17 2 3 2-3Leaf NaN NaN NaN NaN -0.016737 65.0 > 18 2 4 2-4Leaf NaN NaN NaN NaN0.005809 53.0 > 19 2 5 2-5Leaf NaN NaN NaN NaN0.005251 60.0 > 20 2 6 2-6Leaf NaN NaN NaN NaN0.001709 64.0 > {code} > > We can see that even if we set subsample=0.5, the three trees share the same > splits. > > So I think we could reuse the splits and treePoints at all iterations: > at iteration=0, compute the splits on whole training dataset, and use the > splits to generate treepoints. > At each iteration, directly generate baggedPoints based on the treePoints. > Here we do not need to persist/unpersist the internal training dataset for > each tree. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30453) Update AppVeyor R version to 3.6.2
Hyukjin Kwon created SPARK-30453: Summary: Update AppVeyor R version to 3.6.2 Key: SPARK-30453 URL: https://issues.apache.org/jira/browse/SPARK-30453 Project: Spark Issue Type: Improvement Components: Build, SparkR Affects Versions: 3.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28264) Revisiting Python / pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28264: Assignee: Hyukjin Kwon (was: Reynold Xin) > Revisiting Python / pandas UDF > -- > > Key: SPARK-28264 > URL: https://issues.apache.org/jira/browse/SPARK-28264 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Reynold Xin >Assignee: Hyukjin Kwon >Priority: Blocker > > In the past two years, the pandas UDFs are perhaps the most important changes > to Spark for Python data science. However, these functionalities have evolved > organically, leading to some inconsistencies and confusions among users. This > document revisits UDF definition and naming, as a result of discussions among > Xiangrui, Li Jin, Hyukjin, and Reynold. > -See document here: > [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]- > New proposal: > https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering
[ https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010227#comment-17010227 ] Aman Omer commented on SPARK-30421: --- cc [~maropu] > Dropped columns still available for filtering > - > > Key: SPARK-30421 > URL: https://issues.apache.org/jira/browse/SPARK-30421 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Tobias Hermann >Priority: Minor > > The following minimal example: > {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar") > df.select("foo").where($"bar" === "a").show > df.drop("bar").where($"bar" === "a").show > {quote} > should result in an error like the following: > {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given > input columns: [foo]; > {quote} > However, it does not but instead works without error, as if the column "bar" > would exist. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26249) Extension Points Enhancements to inject a rule in order and to add a batch
[ https://issues.apache.org/jira/browse/SPARK-26249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010223#comment-17010223 ] Takeshi Yamamuro commented on SPARK-26249: -- I'll close this because the corresponding pr is inactive (automatically closed). If necessary, please reopen this. Thanks. > Extension Points Enhancements to inject a rule in order and to add a batch > -- > > Key: SPARK-26249 > URL: https://issues.apache.org/jira/browse/SPARK-26249 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Sunitha Kambhampati >Priority: Major > > +Motivation:+ > Spark has extension points API to allow third parties to extend Spark with > custom optimization rules. The current API does not allow fine grain control > on when the optimization rule will be exercised. In the current API, there > is no way to add a batch to the optimization using the SparkSessionExtensions > API, similar to the postHocOptimizationBatches in SparkOptimizer. > In our use cases, we have optimization rules that we want to add as > extensions to a batch in a specific order. > +Proposal:+ > Add 2 new API's to the existing Extension Points to allow for more > flexibility for third party users of Spark. > # Inject a optimizer rule to a batch in order > # Inject a optimizer batch in order > The design spec is here: > [https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26249) Extension Points Enhancements to inject a rule in order and to add a batch
[ https://issues.apache.org/jira/browse/SPARK-26249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-26249. -- Resolution: Won't Fix > Extension Points Enhancements to inject a rule in order and to add a batch > -- > > Key: SPARK-26249 > URL: https://issues.apache.org/jira/browse/SPARK-26249 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Sunitha Kambhampati >Priority: Major > > +Motivation:+ > Spark has extension points API to allow third parties to extend Spark with > custom optimization rules. The current API does not allow fine grain control > on when the optimization rule will be exercised. In the current API, there > is no way to add a batch to the optimization using the SparkSessionExtensions > API, similar to the postHocOptimizationBatches in SparkOptimizer. > In our use cases, we have optimization rules that we want to add as > extensions to a batch in a specific order. > +Proposal:+ > Add 2 new API's to the existing Extension Points to allow for more > flexibility for third party users of Spark. > # Inject a optimizer rule to a batch in order > # Inject a optimizer batch in order > The design spec is here: > [https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28825) Document EXPLAIN Statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-28825. -- Fix Version/s: 3.0.0 Assignee: pavithra ramachandran Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/26970|https://github.com/apache/spark/pull/26970#issuecomment-571833889] > Document EXPLAIN Statement in SQL Reference. > > > Key: SPARK-28825 > URL: https://issues.apache.org/jira/browse/SPARK-28825 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Assignee: pavithra ramachandran >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28825) Document EXPLAIN Statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-28825: - Affects Version/s: (was: 2.4.3) 3.0.0 > Document EXPLAIN Statement in SQL Reference. > > > Key: SPARK-28825 > URL: https://issues.apache.org/jira/browse/SPARK-28825 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24884) Implement regexp_extract_all
[ https://issues.apache.org/jira/browse/SPARK-24884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-24884. -- Resolution: Won't Fix > Implement regexp_extract_all > > > Key: SPARK-24884 > URL: https://issues.apache.org/jira/browse/SPARK-24884 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nick Nicolini >Priority: Major > > I've recently hit many cases of regexp parsing where we need to match on > something that is always arbitrary in length; for example, a text block that > looks something like: > {code:java} > AAA:WORDS| > BBB:TEXT| > MSG:ASDF| > MSG:QWER| > ... > MSG:ZXCV|{code} > Where I need to pull out all values between "MSG:" and "|", which can occur > in each instance between 1 and n times. I cannot reliably use the existing > {{regexp_extract}} method since the number of occurrences is always > arbitrary, and while I can write a UDF to handle this it'd be great if this > was supported natively in Spark. > Perhaps we can implement something like {{regexp_extract_all}} as > [Presto|https://prestodb.io/docs/current/functions/regexp.html] and > [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html] > have? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24884) Implement regexp_extract_all
[ https://issues.apache.org/jira/browse/SPARK-24884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010193#comment-17010193 ] Takeshi Yamamuro commented on SPARK-24884: -- I'll close this because the corresponding pr is inactive (automatically closed). If necessary, please reopen this. Thanks. > Implement regexp_extract_all > > > Key: SPARK-24884 > URL: https://issues.apache.org/jira/browse/SPARK-24884 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nick Nicolini >Priority: Major > > I've recently hit many cases of regexp parsing where we need to match on > something that is always arbitrary in length; for example, a text block that > looks something like: > {code:java} > AAA:WORDS| > BBB:TEXT| > MSG:ASDF| > MSG:QWER| > ... > MSG:ZXCV|{code} > Where I need to pull out all values between "MSG:" and "|", which can occur > in each instance between 1 and n times. I cannot reliably use the existing > {{regexp_extract}} method since the number of occurrences is always > arbitrary, and while I can write a UDF to handle this it'd be great if this > was supported natively in Spark. > Perhaps we can implement something like {{regexp_extract_all}} as > [Presto|https://prestodb.io/docs/current/functions/regexp.html] and > [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html] > have? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30452) Add predict and numFeatures in Python IsotonicRegressionModel
Huaxin Gao created SPARK-30452: -- Summary: Add predict and numFeatures in Python IsotonicRegressionModel Key: SPARK-30452 URL: https://issues.apache.org/jira/browse/SPARK-30452 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 3.0.0 Reporter: Huaxin Gao Since IsotonicRegressionModel doesn't extend JavaPredictionModel, predict and numFeatures need to be added explicitly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29167) Metrics of Analyzer/Optimizer use Scientific counting is not human readable
[ https://issues.apache.org/jira/browse/SPARK-29167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-29167. -- Resolution: Won't Fix > Metrics of Analyzer/Optimizer use Scientific counting is not human readable > --- > > Key: SPARK-29167 > URL: https://issues.apache.org/jira/browse/SPARK-29167 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > Metrics of Analyzer/Optimizer use Scientific counting is not human readable > !image-2019-09-19-11-36-18-966.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29167) Metrics of Analyzer/Optimizer use Scientific counting is not human readable
[ https://issues.apache.org/jira/browse/SPARK-29167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010187#comment-17010187 ] Takeshi Yamamuro commented on SPARK-29167: -- I'll close this because no committer strongly supports this. If necessary, please reopen this. Thanks. > Metrics of Analyzer/Optimizer use Scientific counting is not human readable > --- > > Key: SPARK-29167 > URL: https://issues.apache.org/jira/browse/SPARK-29167 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > Metrics of Analyzer/Optimizer use Scientific counting is not human readable > !image-2019-09-19-11-36-18-966.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30417) SPARK-29976 calculation of slots wrong for Standalone Mode
[ https://issues.apache.org/jira/browse/SPARK-30417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010156#comment-17010156 ] Xingbo Jiang commented on SPARK-30417: -- Good catch! `max(conf.get(EXECUTOR_CORES) / sched.CPUS_PER_TASK, 1)` seems good enough for me. Thanks! > SPARK-29976 calculation of slots wrong for Standalone Mode > -- > > Key: SPARK-30417 > URL: https://issues.apache.org/jira/browse/SPARK-30417 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > In SPARK-29976 we added a config to determine if we should allow speculation > when the number of tasks is less then the number of slots on a single > executor. The problem is that for standalone mode (and mesos coarse > grained) the EXECUTOR_CORES config is not set properly by default. In those > modes the number of executor cores is all the cores of the Worker. The > default of EXECUTOR_CORES is 1. > The calculation: > {color:#80}val {color}{color:#660e7a}speculationTasksLessEqToSlots > {color}= {color:#660e7a}numTasks {color}<= > ({color:#660e7a}conf{color}.get({color:#660e7a}EXECUTOR_CORES{color}) / > sched.{color:#660e7a}CPUS_PER_TASK{color}) > If someone set the cpus per task > 1 then this would end up being false even > if 1 task. Note that the default case where cpus per task is 1 and executor > cores is 1 it works out ok but is only applied if 1 task vs number of slots > on the executor. > Here we really don't know the number of executor cores for standalone mode or > mesos so I think a decent solution is to just use 1 in those cases and > document the difference. > Something like > max({color:#660e7a}conf{color}.get({color:#660e7a}EXECUTOR_CORES{color}) / > sched.{color:#660e7a}CPUS_PER_TASK{color}, 1) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30417) SPARK-29976 calculation of slots wrong for Standalone Mode
[ https://issues.apache.org/jira/browse/SPARK-30417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010146#comment-17010146 ] Thomas Graves commented on SPARK-30417: --- The only way for standalone mode would be to look at what each executor registers with. Theoretically different executors could have different number of cores. There are actually other issues (SPARK-30299 for instance) with this in the code as well that I think we need a global solution for. So perhaps for this Jira we do the easy thing like I suggested and then we have have a separate Jira to look at handling this better in the future. > SPARK-29976 calculation of slots wrong for Standalone Mode > -- > > Key: SPARK-30417 > URL: https://issues.apache.org/jira/browse/SPARK-30417 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > In SPARK-29976 we added a config to determine if we should allow speculation > when the number of tasks is less then the number of slots on a single > executor. The problem is that for standalone mode (and mesos coarse > grained) the EXECUTOR_CORES config is not set properly by default. In those > modes the number of executor cores is all the cores of the Worker. The > default of EXECUTOR_CORES is 1. > The calculation: > {color:#80}val {color}{color:#660e7a}speculationTasksLessEqToSlots > {color}= {color:#660e7a}numTasks {color}<= > ({color:#660e7a}conf{color}.get({color:#660e7a}EXECUTOR_CORES{color}) / > sched.{color:#660e7a}CPUS_PER_TASK{color}) > If someone set the cpus per task > 1 then this would end up being false even > if 1 task. Note that the default case where cpus per task is 1 and executor > cores is 1 it works out ok but is only applied if 1 task vs number of slots > on the executor. > Here we really don't know the number of executor cores for standalone mode or > mesos so I think a decent solution is to just use 1 in those cases and > document the difference. > Something like > max({color:#660e7a}conf{color}.get({color:#660e7a}EXECUTOR_CORES{color}) / > sched.{color:#660e7a}CPUS_PER_TASK{color}, 1) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30417) SPARK-29976 calculation of slots wrong for Standalone Mode
[ https://issues.apache.org/jira/browse/SPARK-30417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010142#comment-17010142 ] Yuchen Huo commented on SPARK-30417: [~tgraves] Sure. Is there a more stable way to get the number of cores the executor is using instead of checking the value of EXECUTOR_CORES which might not be set? cc [~jiangxb1987] > SPARK-29976 calculation of slots wrong for Standalone Mode > -- > > Key: SPARK-30417 > URL: https://issues.apache.org/jira/browse/SPARK-30417 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > In SPARK-29976 we added a config to determine if we should allow speculation > when the number of tasks is less then the number of slots on a single > executor. The problem is that for standalone mode (and mesos coarse > grained) the EXECUTOR_CORES config is not set properly by default. In those > modes the number of executor cores is all the cores of the Worker. The > default of EXECUTOR_CORES is 1. > The calculation: > {color:#80}val {color}{color:#660e7a}speculationTasksLessEqToSlots > {color}= {color:#660e7a}numTasks {color}<= > ({color:#660e7a}conf{color}.get({color:#660e7a}EXECUTOR_CORES{color}) / > sched.{color:#660e7a}CPUS_PER_TASK{color}) > If someone set the cpus per task > 1 then this would end up being false even > if 1 task. Note that the default case where cpus per task is 1 and executor > cores is 1 it works out ok but is only applied if 1 task vs number of slots > on the executor. > Here we really don't know the number of executor cores for standalone mode or > mesos so I think a decent solution is to just use 1 in those cases and > document the difference. > Something like > max({color:#660e7a}conf{color}.get({color:#660e7a}EXECUTOR_CORES{color}) / > sched.{color:#660e7a}CPUS_PER_TASK{color}, 1) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30382) start-thriftserver throws ClassNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-30382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30382. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27042 [https://github.com/apache/spark/pull/27042] > start-thriftserver throws ClassNotFoundException > > > Key: SPARK-30382 > URL: https://issues.apache.org/jira/browse/SPARK-30382 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Minor > Fix For: 3.0.0 > > > start-thriftserver.sh --help throws > {code} > . > > Thrift server options: > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/logging/log4j/spi/LoggerContextFactory > at org.apache.hive.service.server.HiveServer2.main(HiveServer2.java:167) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:82) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) > Caused by: java.lang.ClassNotFoundException: > org.apache.logging.log4j.spi.LoggerContextFactory > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 3 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30382) start-thriftserver throws ClassNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-30382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30382: - Assignee: Ajith S > start-thriftserver throws ClassNotFoundException > > > Key: SPARK-30382 > URL: https://issues.apache.org/jira/browse/SPARK-30382 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Minor > > start-thriftserver.sh --help throws > {code} > . > > Thrift server options: > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/logging/log4j/spi/LoggerContextFactory > at org.apache.hive.service.server.HiveServer2.main(HiveServer2.java:167) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:82) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) > Caused by: java.lang.ClassNotFoundException: > org.apache.logging.log4j.spi.LoggerContextFactory > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 3 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30451) Stage level Sched: ExecutorResourceRequests/TaskResourceRequests should have functions to remove requests
Thomas Graves created SPARK-30451: - Summary: Stage level Sched: ExecutorResourceRequests/TaskResourceRequests should have functions to remove requests Key: SPARK-30451 URL: https://issues.apache.org/jira/browse/SPARK-30451 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Thomas Graves Stage level Sched: ExecutorResourceRequests/TaskResourceRequests should have functions to remove requests Currently in the design ExecutorResourceRequests and TaskREsourceRequests are mutable and users can update as they want. It would make sense to add api's to remove certain resource requirements from them. This would allow a user to create one ExecutorResourceRequests object and then if they want to just add/remove something from it they easily could without having to recreate all the requests in that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30450) Exclude .git folder for python linter
[ https://issues.apache.org/jira/browse/SPARK-30450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-30450: - Affects Version/s: (was: 2.4.4) 3.0.0 > Exclude .git folder for python linter > - > > Key: SPARK-30450 > URL: https://issues.apache.org/jira/browse/SPARK-30450 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Minor > > The python linter shouldn't include the .git folder. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30450) Exclude .git folder for python linter
[ https://issues.apache.org/jira/browse/SPARK-30450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-30450: - Priority: Minor (was: Major) > Exclude .git folder for python linter > - > > Key: SPARK-30450 > URL: https://issues.apache.org/jira/browse/SPARK-30450 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Minor > > The python linter shouldn't include the .git folder. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30450) Exclude .git folder for python linter
[ https://issues.apache.org/jira/browse/SPARK-30450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-30450: Assignee: Eric Chang > Exclude .git folder for python linter > - > > Key: SPARK-30450 > URL: https://issues.apache.org/jira/browse/SPARK-30450 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Major > > The python linter shouldn't include the .git folder. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30450) Exclude .git folder for python linter
Eric Chang created SPARK-30450: -- Summary: Exclude .git folder for python linter Key: SPARK-30450 URL: https://issues.apache.org/jira/browse/SPARK-30450 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4 Reporter: Eric Chang The python linter shouldn't include the .git folder. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009988#comment-17009988 ] Steve Loughran commented on SPARK-2984: --- bq. As part of your recommendation, is it guaranteed that parquet filenames will be unique across jobs? no idea. S3A committer defaults to inserting a UUID into the filename to meet that guarantee. bq. Also, when "outputting independently", is it ok to use v2 commit algorithm? Only if each independent job fails completely if there's a failure/timeout during task commit (i.e do not attempt to commit >1 task attempt for the same task). Spark does not currently do that , AFAIK > FileNotFoundException on _temporary directory > - > > Key: SPARK-2984 > URL: https://issues.apache.org/jira/browse/SPARK-2984 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.3.0 > > > We've seen several stacktraces and threads on the user mailing list where > people are having issues with a {{FileNotFoundException}} stemming from an > HDFS path containing {{_temporary}}. > I ([~aash]) think this may be related to {{spark.speculation}}. I think the > error condition might manifest in this circumstance: > 1) task T starts on a executor E1 > 2) it takes a long time, so task T' is started on another executor E2 > 3) T finishes in E1 so moves its data from {{_temporary}} to the final > destination and deletes the {{_temporary}} directory during cleanup > 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but > those files no longer exist! exception > Some samples: > {noformat} > 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job > 140774430 ms.0 > java.io.FileNotFoundException: File > hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07 > does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) > at > org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136) > at > org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643) > at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > -- Chen Song at > http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html > {noformat} > I am running a Spark Streaming job that uses saveAsTextFiles to save results > into hdfs files. However, it has an exception after 20 batches > result-140631234/_temporary/0/task_201407251119__m_03 does not > exist. > {noformat} > and >
[jira] [Commented] (SPARK-30448) accelerator aware scheduling enforce cores as limiting resource
[ https://issues.apache.org/jira/browse/SPARK-30448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009989#comment-17009989 ] Thomas Graves commented on SPARK-30448: --- Note this actually overlaps with https://issues.apache.org/jira/browse/SPARK-30446 since with this change some of those checks don't make sense. > accelerator aware scheduling enforce cores as limiting resource > --- > > Key: SPARK-30448 > URL: https://issues.apache.org/jira/browse/SPARK-30448 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > For the first version of accelerator aware scheduling(SPARK-27495), the SPIP > had a condition that we can support dynamic allocation because we were going > to have a strict requirement that we don't waste any resources. This means > that the number of number of slots each executor has could be calculated from > the number of cores and task cpus just as is done today. > Somewhere along the line of development we relaxed that and only warn when we > are wasting resources. This breaks the dynamic allocation logic if the > limiting resource is no longer the cores. This means we will request less > executors then we really need to run everything. > We have to enforce that cores is always the limiting resource so we should > throw if its not. > I guess we could only make this a requirement with dynamic allocation on, but > to make the behavior consistent I would say we just require it across the > board. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30442) Write mode ignored when using CodecStreams
[ https://issues.apache.org/jira/browse/SPARK-30442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009963#comment-17009963 ] Maxim Gekk commented on SPARK-30442: > This can cause issues, particularly with aws tools, that make it impossible >to retry. Could you clarify how it makes retry impossible. When the mode is set to overwrite, Spark deletes entire folder and writes new files - should be no clashes. In the append mode, new files are added - Spark does not append to existing files. What's the situation when files should be overwritten? > Write mode ignored when using CodecStreams > -- > > Key: SPARK-30442 > URL: https://issues.apache.org/jira/browse/SPARK-30442 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.4 >Reporter: Jesse Collins >Priority: Major > > Overwrite is hardcoded to false in the codec stream. This can cause issues, > particularly with aws tools, that make it impossible to retry. > Ideally, this should be read from the write mode set for the DataWriter that > is writing through this codec class. > [https://github.com/apache/spark/blame/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala#L81] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30449) Introducing get_dummies method in pyspark
[ https://issues.apache.org/jira/browse/SPARK-30449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krishna Kumar Tiwari updated SPARK-30449: - Flags: Important > Introducing get_dummies method in pyspark > - > > Key: SPARK-30449 > URL: https://issues.apache.org/jira/browse/SPARK-30449 > Project: Spark > Issue Type: Task > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Krishna Kumar Tiwari >Priority: Major > > Introducing get_dummies method in pyspark same as pandas. > Many times when using categorical variable and we want to flatten the data to > do one-hot encoding to generate columns and fill the matrix, get_dummies is > very useful in that scenario. > > The objective here is to introduce get_dummies to pyspark. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30449) Introducing get_dummies method in pyspark
[ https://issues.apache.org/jira/browse/SPARK-30449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009924#comment-17009924 ] Krishna Kumar Tiwari commented on SPARK-30449: -- I am already working on this, will share the PR soon. > Introducing get_dummies method in pyspark > - > > Key: SPARK-30449 > URL: https://issues.apache.org/jira/browse/SPARK-30449 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Krishna Kumar Tiwari >Priority: Major > > Introducing get_dummies method in pyspark same as pandas. > Many times when using categorical variable and we want to flatten the data to > do one-hot encoding to generate columns and fill the matrix, get_dummies is > very useful in that scenario. > > The objective here is to introduce get_dummies to pyspark. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30449) Introducing get_dummies method in pyspark
[ https://issues.apache.org/jira/browse/SPARK-30449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krishna Kumar Tiwari updated SPARK-30449: - Issue Type: Task (was: New Feature) > Introducing get_dummies method in pyspark > - > > Key: SPARK-30449 > URL: https://issues.apache.org/jira/browse/SPARK-30449 > Project: Spark > Issue Type: Task > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Krishna Kumar Tiwari >Priority: Major > > Introducing get_dummies method in pyspark same as pandas. > Many times when using categorical variable and we want to flatten the data to > do one-hot encoding to generate columns and fill the matrix, get_dummies is > very useful in that scenario. > > The objective here is to introduce get_dummies to pyspark. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30449) Introducing get_dummies method in pyspark
Krishna Kumar Tiwari created SPARK-30449: Summary: Introducing get_dummies method in pyspark Key: SPARK-30449 URL: https://issues.apache.org/jira/browse/SPARK-30449 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 2.4.4 Reporter: Krishna Kumar Tiwari Introducing get_dummies method in pyspark same as pandas. Many times when using categorical variable and we want to flatten the data to do one-hot encoding to generate columns and fill the matrix, get_dummies is very useful in that scenario. The objective here is to introduce get_dummies to pyspark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30448) accelerator aware scheduling enforce cores as limiting resource
[ https://issues.apache.org/jira/browse/SPARK-30448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009901#comment-17009901 ] Thomas Graves commented on SPARK-30448: --- note there are other calculations throughout spark code that calculate the number of slots so I think its best for now just to require cores to be limiting resource > accelerator aware scheduling enforce cores as limiting resource > --- > > Key: SPARK-30448 > URL: https://issues.apache.org/jira/browse/SPARK-30448 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > For the first version of accelerator aware scheduling(SPARK-27495), the SPIP > had a condition that we can support dynamic allocation because we were going > to have a strict requirement that we don't waste any resources. This means > that the number of number of slots each executor has could be calculated from > the number of cores and task cpus just as is done today. > Somewhere along the line of development we relaxed that and only warn when we > are wasting resources. This breaks the dynamic allocation logic if the > limiting resource is no longer the cores. This means we will request less > executors then we really need to run everything. > We have to enforce that cores is always the limiting resource so we should > throw if its not. > I guess we could only make this a requirement with dynamic allocation on, but > to make the behavior consistent I would say we just require it across the > board. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30446) Accelerator aware scheduling checkResourcesPerTask doesn't cover all cases
[ https://issues.apache.org/jira/browse/SPARK-30446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009896#comment-17009896 ] Thomas Graves commented on SPARK-30446: --- working on this > Accelerator aware scheduling checkResourcesPerTask doesn't cover all cases > -- > > Key: SPARK-30446 > URL: https://issues.apache.org/jira/browse/SPARK-30446 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > with accelerator aware scheduling SparkContext.checkResourcesPerTask > Tries to make sure that users have configured things properly and warn or > error if not. > It doesn't properly handle all cases like warning if cpu resources are being > wasted. We should test this better and handle those. > I fixed these in the stage level scheduling but not sure the timeline on > getting that in so we may want to fix this separately as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30448) accelerator aware scheduling enforce cores as limiting resource
Thomas Graves created SPARK-30448: - Summary: accelerator aware scheduling enforce cores as limiting resource Key: SPARK-30448 URL: https://issues.apache.org/jira/browse/SPARK-30448 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Thomas Graves For the first version of accelerator aware scheduling(SPARK-27495), the SPIP had a condition that we can support dynamic allocation because we were going to have a strict requirement that we don't waste any resources. This means that the number of number of slots each executor has could be calculated from the number of cores and task cpus just as is done today. Somewhere along the line of development we relaxed that and only warn when we are wasting resources. This breaks the dynamic allocation logic if the limiting resource is no longer the cores. This means we will request less executors then we really need to run everything. We have to enforce that cores is always the limiting resource so we should throw if its not. I guess we could only make this a requirement with dynamic allocation on, but to make the behavior consistent I would say we just require it across the board. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30039) CREATE FUNCTION should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-30039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30039: --- Assignee: Pablo Langa Blanco > CREATE FUNCTION should look up catalog/table like v2 commands > -- > > Key: SPARK-30039 > URL: https://issues.apache.org/jira/browse/SPARK-30039 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Assignee: Pablo Langa Blanco >Priority: Major > > CREATE FUNCTION should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30039) CREATE FUNCTION should look up catalog/table like v2 commands
[ https://issues.apache.org/jira/browse/SPARK-30039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30039. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26890 [https://github.com/apache/spark/pull/26890] > CREATE FUNCTION should look up catalog/table like v2 commands > -- > > Key: SPARK-30039 > URL: https://issues.apache.org/jira/browse/SPARK-30039 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Pablo Langa Blanco >Assignee: Pablo Langa Blanco >Priority: Major > Fix For: 3.0.0 > > > CREATE FUNCTION should look up catalog/table like v2 commands -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30443) "Managed memory leak detected" even with no calls to take() or limit()
[ https://issues.apache.org/jira/browse/SPARK-30443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Richter updated SPARK-30443: - Description: Our Spark code is causing a "Managed memory leak detected" warning to appear, even though we are not calling take() or limit(). According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 managed memory leaks should only be caused by not reading an iterator to completion, i.e. take() or limit() Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed memory leak detected; size = 2097152 bytes, TID = 118" The size of the managed memory leak is always 2MB. I have created a minimal test program that reproduces the warning: {code:java} import pyspark.sql import pyspark.sql.functions as fx def main(): builder = pyspark.sql.SparkSession.builder builder = builder.appName("spark-jira") spark = builder.getOrCreate() reader = spark.read reader = reader.format("csv") reader = reader.option("inferSchema", "true") reader = reader.option("header", "true") table_c = reader.load("c.csv") table_a = reader.load("a.csv") table_b = reader.load("b.csv") primary_filter = fx.col("some_code").isNull() new_primary_data = table_a.filter(primary_filter) new_ids = new_primary_data.select("some_id") new_data = table_b.join(new_ids, "some_id") new_data = new_data.select("some_id") result = table_c.join(new_data, "some_id", "left") result.repartition(1).write.json("results.json", mode="overwrite") spark.stop() if __name__ == "__main__": main() {code} Our code isn't anything out of the ordinary, just some filters, selects and joins. The input data is made up of 3 CSV files. The input data files are quite large, roughly 2.6GB in total uncompressed. I attempted to reduce the number of rows in the CSV input files but this caused the warning to no longer appear. After compressing the files I was able to attach them below. was: Our Spark code is causing a "Managed memory leak detected" warning to appear, even though we are not calling take() or limit(). According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 managed memory leaks should only be caused by not reading an iterator to completion, i.e. take() or limit() Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed memory leak detected; size = 2097152 bytes, TID = 118" The size of the managed memory leak is always 2MB. I have created a minimal test program that reproduces the warning: {code:java} import pyspark.sql import pyspark.sql.functions as fx def main(): builder = pyspark.sql.SparkSession.builder builder = builder.appName("spark-jira") spark = builder.getOrCreate() reader = spark.read reader = reader.format("csv") reader = reader.option("inferSchema", "true") reader = reader.option("header", "true") table_c = reader.load("c.csv") table_a = reader.load("a.csv") table_b = reader.load("b.csv") primary_filter = fx.col("some_code").isNull() new_primary_data = table_a.filter(primary_filter) new_ids = new_primary_data.select("some_id") new_data = table_b.join(new_ids, "some_id") new_data = new_data.select("some_id") result = table_c.join(new_data, "some_id", "left") result.repartition(1).write.json("results.json", mode="overwrite") spark.stop() if __name__ == "__main__": main() {code} Our code isn't anything out of the ordinary, just some filters, selects and joins. The input data is made up of 3 CSV files. The input data files are quite large, roughly 2.6GB in total uncompressed. I attempted to reduce the number of rows in the CSV input files but this caused the warning to no longer appear. What is the best way to get these test data files that reproduce the warning into your hands? > "Managed memory leak detected" even with no calls to take() or limit() > -- > > Key: SPARK-30443 > URL: https://issues.apache.org/jira/browse/SPARK-30443 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.4 >Reporter: Luke Richter >Priority: Major > Attachments: a.csv.zip, b.csv.zip, c.csv.zip > > > Our Spark code is causing a "Managed memory leak detected" warning to appear, > even though we are not calling take() or limit(). > According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 > managed memory leaks should only be caused by not reading an iterator to > completion, i.e. take() or limit() > Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed > memory leak detected; size = 2097152 bytes, TID = 118" > The size of the managed memory leak is always 2MB. > I have
[jira] [Updated] (SPARK-30443) "Managed memory leak detected" even with no calls to take() or limit()
[ https://issues.apache.org/jira/browse/SPARK-30443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Richter updated SPARK-30443: - Attachment: a.csv.zip b.csv.zip c.csv.zip > "Managed memory leak detected" even with no calls to take() or limit() > -- > > Key: SPARK-30443 > URL: https://issues.apache.org/jira/browse/SPARK-30443 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.4 >Reporter: Luke Richter >Priority: Major > Attachments: a.csv.zip, b.csv.zip, c.csv.zip > > > Our Spark code is causing a "Managed memory leak detected" warning to appear, > even though we are not calling take() or limit(). > According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 > managed memory leaks should only be caused by not reading an iterator to > completion, i.e. take() or limit() > Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed > memory leak detected; size = 2097152 bytes, TID = 118" > The size of the managed memory leak is always 2MB. > I have created a minimal test program that reproduces the warning: > {code:java} > import pyspark.sql > import pyspark.sql.functions as fx > def main(): > builder = pyspark.sql.SparkSession.builder > builder = builder.appName("spark-jira") > spark = builder.getOrCreate() > reader = spark.read > reader = reader.format("csv") > reader = reader.option("inferSchema", "true") > reader = reader.option("header", "true") > table_c = reader.load("c.csv") > table_a = reader.load("a.csv") > table_b = reader.load("b.csv") > primary_filter = fx.col("some_code").isNull() > new_primary_data = table_a.filter(primary_filter) > new_ids = new_primary_data.select("some_id") > new_data = table_b.join(new_ids, "some_id") > new_data = new_data.select("some_id") > result = table_c.join(new_data, "some_id", "left") > result.repartition(1).write.json("results.json", mode="overwrite") > spark.stop() > if __name__ == "__main__": > main() > {code} > Our code isn't anything out of the ordinary, just some filters, selects and > joins. > The input data is made up of 3 CSV files. The input data files are quite > large, roughly 2.6GB in total uncompressed. I attempted to reduce the number > of rows in the CSV input files but this caused the warning to no longer > appear. What is the best way to get these test data files that reproduce the > warning into your hands? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30437) Uneven spaces for some fields in EXPLAIN FORMATTED
[ https://issues.apache.org/jira/browse/SPARK-30437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30437. -- Resolution: Won't Fix > Uneven spaces for some fields in EXPLAIN FORMATTED > -- > > Key: SPARK-30437 > URL: https://issues.apache.org/jira/browse/SPARK-30437 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Aman Omer >Priority: Minor > > Output of EXPLAIN EXTENDED have uneven spaces. Eg, > {code:java} > (4) Project [codegen id : 1] > Output: [key#x, val#x] > Input : [key#x, val#x] > > (5) HashAggregate [codegen id : 1] > Input: [key#x, val#x] > {code} > Unlike input field for HashAggregate, Output and Input fields of Project have > more spaces. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27604) Enhance constant and constraint propagation
[ https://issues.apache.org/jira/browse/SPARK-27604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-27604: --- Issue Type: Improvement (was: Bug) > Enhance constant and constraint propagation > --- > > Key: SPARK-27604 > URL: https://issues.apache.org/jira/browse/SPARK-27604 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Peter Toth >Priority: Major > > There is some room for improvement as constant propagation could allow > substitution of deterministic expressions (instead of attributes only) to > constants and substitutions in other than equal predicates. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27604) Enhance constant and constraint propagation
[ https://issues.apache.org/jira/browse/SPARK-27604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-27604: --- Description: There is some room for improvement as constant propagation could allow substitution of deterministic expressions (instead of attributes only) to constants and substitutions in other than equal predicates. (was: There is a bug in constant propagation due to null handling: {{SELECT * FROM t WHERE NOT(c = 1 AND c + 1 = 1)}} returns those rows where {{c}} is null due to {{1 + 1 = 1}} propagation There is some room for improvement as constant propagation could allow substitution of deterministic expressions (instead of attributes only) to constants and substitutions in other than equal predicates.) > Enhance constant and constraint propagation > --- > > Key: SPARK-27604 > URL: https://issues.apache.org/jira/browse/SPARK-27604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Peter Toth >Priority: Major > > There is some room for improvement as constant propagation could allow > substitution of deterministic expressions (instead of attributes only) to > constants and substitutions in other than equal predicates. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30447) Constant propagation nullability issue
Peter Toth created SPARK-30447: -- Summary: Constant propagation nullability issue Key: SPARK-30447 URL: https://issues.apache.org/jira/browse/SPARK-30447 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Peter Toth There is a bug in constant propagation due to null handling: SELECT * FROM t WHERE NOT(c = 1 AND c + 1 = 1) returns those rows where c is null due to 1 + 1 = 1 propagation, but it shouldn't. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30446) Accelerator aware scheduling checkResourcesPerTask doesn't cover all cases
[ https://issues.apache.org/jira/browse/SPARK-30446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009830#comment-17009830 ] Thomas Graves commented on SPARK-30446: --- Yeah so running on standalone if you set the spark.task.cpus=2 (or anything > 1) and you don't set executor cores it fails even though it shouldn't because executor cores are all the cores of the worker by default: 20/01/07 09:34:02 ERROR Main: Failed to initialize Spark session. org.apache.spark.SparkException: The number of cores per executor (=1) has to be >= the task config: spark.task.cpus = 2 when run on spark://tomg-x299:7077. > Accelerator aware scheduling checkResourcesPerTask doesn't cover all cases > -- > > Key: SPARK-30446 > URL: https://issues.apache.org/jira/browse/SPARK-30446 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > with accelerator aware scheduling SparkContext.checkResourcesPerTask > Tries to make sure that users have configured things properly and warn or > error if not. > It doesn't properly handle all cases like warning if cpu resources are being > wasted. We should test this better and handle those. > I fixed these in the stage level scheduling but not sure the timeline on > getting that in so we may want to fix this separately as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30431) Update SqlBase.g4 to create commentSpec pattern as same as locationSpec
[ https://issues.apache.org/jira/browse/SPARK-30431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30431. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27102 [https://github.com/apache/spark/pull/27102] > Update SqlBase.g4 to create commentSpec pattern as same as locationSpec > --- > > Key: SPARK-30431 > URL: https://issues.apache.org/jira/browse/SPARK-30431 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 3.0.0 > > > In `SqlBase.g4`, the comment clause is used as `COMMENT comment=STRING` and > `COMMENT STRING` in many places. > While the location clause often appears along with the comment clause with a > pattern defined as > {code:sql} > locationSpec > : LOCATION STRING > ; > {code} > Then, we have to visit locationSpec as a List but comment as a single token -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30431) Update SqlBase.g4 to create commentSpec pattern as same as locationSpec
[ https://issues.apache.org/jira/browse/SPARK-30431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30431: --- Assignee: Kent Yao > Update SqlBase.g4 to create commentSpec pattern as same as locationSpec > --- > > Key: SPARK-30431 > URL: https://issues.apache.org/jira/browse/SPARK-30431 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > > In `SqlBase.g4`, the comment clause is used as `COMMENT comment=STRING` and > `COMMENT STRING` in many places. > While the location clause often appears along with the comment clause with a > pattern defined as > {code:sql} > locationSpec > : LOCATION STRING > ; > {code} > Then, we have to visit locationSpec as a List but comment as a single token -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30446) Accelerator aware scheduling checkResourcesPerTask doesn't cover all cases
[ https://issues.apache.org/jira/browse/SPARK-30446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009742#comment-17009742 ] Thomas Graves commented on SPARK-30446: --- I think there may also be issues in it with standalone mode since Executor cores isn't necessarily right, but I would have to test again to verify that. > Accelerator aware scheduling checkResourcesPerTask doesn't cover all cases > -- > > Key: SPARK-30446 > URL: https://issues.apache.org/jira/browse/SPARK-30446 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > with accelerator aware scheduling SparkContext.checkResourcesPerTask > Tries to make sure that users have configured things properly and warn or > error if not. > It doesn't properly handle all cases like warning if cpu resources are being > wasted. We should test this better and handle those. > I fixed these in the stage level scheduling but not sure the timeline on > getting that in so we may want to fix this separately as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30446) Accelerator aware scheduling checkResourcesPerTask doesn't cover all cases
Thomas Graves created SPARK-30446: - Summary: Accelerator aware scheduling checkResourcesPerTask doesn't cover all cases Key: SPARK-30446 URL: https://issues.apache.org/jira/browse/SPARK-30446 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Thomas Graves with accelerator aware scheduling SparkContext.checkResourcesPerTask Tries to make sure that users have configured things properly and warn or error if not. It doesn't properly handle all cases like warning if cpu resources are being wasted. We should test this better and handle those. I fixed these in the stage level scheduling but not sure the timeline on getting that in so we may want to fix this separately as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30445) Accelerator aware scheduling handle setting configs to 0 better
Thomas Graves created SPARK-30445: - Summary: Accelerator aware scheduling handle setting configs to 0 better Key: SPARK-30445 URL: https://issues.apache.org/jira/browse/SPARK-30445 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Thomas Graves If you set the resource configs to 0, it errors with divide by zero. While I think ideally the user should just remove the configs we should handle the 0 better. {color:#1d1c1d}$ spark-submit --conf spark.driver.resource.gpu.amount=0 {color}*--conf spark.executor.resource.gpu.amount=0*{color:#1d1c1d} {color}*--conf spark.task.resource.gpu.amount=0*{color:#1d1c1d} --conf spark.driver.resource.gpu.discoveryScript=/shared/tools/get_gpu_resources.sh --conf spark.executor.resource.gpu.discoveryScript=/shared/tools/get_gpu_resources.sh test.py{color} {color:#1d1c1d}20/01/07 05:36:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable{color} {color:#1d1c1d}Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties{color} {color:#1d1c1d}20/01/07 05:36:43 INFO SparkContext: {color}*Running Spark version 3.0.0-preview* {color:#1d1c1d}20/01/07 05:36:43 INFO ResourceUtils: =={color} {color:#1d1c1d}20/01/07 05:36:43 INFO ResourceUtils: Resources for spark.driver:{color} *gpu -> [name: gpu, addresses: 0]* {color:#1d1c1d}20/01/07 05:36:43 INFO ResourceUtils: =={color} {color:#1d1c1d}20/01/07 05:36:43 INFO SparkContext: Submitted application: test.py{color} {color:#1d1c1d}..{color} {color:#1d1c1d}20/01/07 05:36:43 ERROR SparkContext: Error initializing SparkContext.{color} *java.lang.ArithmeticException: / by zero* {color:#1d1c1d}at org.apache.spark.SparkContext$.$anonfun$createTaskScheduler$3(SparkContext.scala:2793){color} {color:#1d1c1d}at org.apache.spark.SparkContext$.$anonfun$createTaskScheduler$3$adapted(SparkContext.scala:2775){color} {color:#1d1c1d}at scala.collection.Iterator.foreach(Iterator.scala:941){color} {color:#1d1c1d}at scala.collection.Iterator.foreach$(Iterator.scala:941){color} {color:#1d1c1d}at scala.collection.AbstractIterator.foreach(Iterator.scala:1429){color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30338) Avoid unnecessary InternalRow copies in ParquetRowConverter
[ https://issues.apache.org/jira/browse/SPARK-30338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30338. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26993 [https://github.com/apache/spark/pull/26993] > Avoid unnecessary InternalRow copies in ParquetRowConverter > --- > > Key: SPARK-30338 > URL: https://issues.apache.org/jira/browse/SPARK-30338 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Fix For: 3.0.0 > > > ParquetRowConverter calls {{InternalRow.copy()}} in cases where the copy is > unnecessary; this can severely harm performance when reading deeply-nested > Parquet. > It looks like this copying was originally added to handle arrays and maps of > structs (in which case we need to keep the copying), but we can omit it for > the more common case of structs nested directly in structs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30429) WideSchemaBenchmark fails with OOM
[ https://issues.apache.org/jira/browse/SPARK-30429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009468#comment-17009468 ] L. C. Hsieh commented on SPARK-30429: - Thanks for pinging me. Looking into this. > WideSchemaBenchmark fails with OOM > -- > > Key: SPARK-30429 > URL: https://issues.apache.org/jira/browse/SPARK-30429 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: WideSchemaBenchmark_console.txt > > > Run WideSchemaBenchmark on the master (commit > bc16bb1dd095c9e1c8deabf6ac0d528441a81d88) via: > {code} > SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain > org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark" > {code} > This fails with: > {code} > Caused by: java.lang.reflect.InvocationTargetException > [error] at > sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source) > [error] at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > [error] at > java.lang.reflect.Constructor.newInstance(Constructor.java:423) > [error] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$7(TreeNode.scala:468) > [error] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > [error] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$1(TreeNode.scala:467) > [error] at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) > [error] ... 132 more > [error] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded > [error] at java.util.Arrays.copyOfRange(Arrays.java:3664) > [error] at java.lang.String.(String.java:207) > [error] at java.lang.StringBuilder.toString(StringBuilder.java:407) > [error] at > org.apache.spark.sql.types.StructType.catalogString(StructType.scala:411) > [error] at > org.apache.spark.sql.types.StructType.$anonfun$catalogString$1(StructType.scala:410) > [error] at > org.apache.spark.sql.types.StructType$$Lambda$2441/1040526643.apply(Unknown > Source) > {code} > Full stack dump is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org