[jira] [Commented] (SPARK-28450) When scan hive data of a not existed partition, it return an error
[ https://issues.apache.org/jira/browse/SPARK-28450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890713#comment-16890713 ] angerszhu commented on SPARK-28450: --- [~shivuson...@gmail.com] Just select a partition table with an not exist table. select * fromn partitiontable where part_col='not exist' ; Since in hive we can just return a result of 0 row. In spark , it will catch this error in HiveExternalCatalog. Maybe we should make it same as Hive? > When scan hive data of a not existed partition, it return an error > -- > > Key: SPARK-28450 > URL: https://issues.apache.org/jira/browse/SPARK-28450 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2019-07-19-20-51-12-861.png > > > When we select data of a un-existed hive partition table's partition, it will > return error, bu it should just return empty. > !image-2019-07-19-20-51-12-861.png! -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28450) When scan hive data of a not existed partition, it return an error
[ https://issues.apache.org/jira/browse/SPARK-28450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890710#comment-16890710 ] Shivu Sondur commented on SPARK-28450: -- [~angerszhuuu] i want to check this issue. can you give me all detailed steps to reproduce this issue? > When scan hive data of a not existed partition, it return an error > -- > > Key: SPARK-28450 > URL: https://issues.apache.org/jira/browse/SPARK-28450 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > Attachments: image-2019-07-19-20-51-12-861.png > > > When we select data of a un-existed hive partition table's partition, it will > return error, bu it should just return empty. > !image-2019-07-19-20-51-12-861.png! -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28480) Types of input parameters of a UDF affect the ability to cache the result
[ https://issues.apache.org/jira/browse/SPARK-28480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890694#comment-16890694 ] Ivan Tsukanov commented on SPARK-28480: --- ok, let's close the ticket. [~shivuson...@gmail.com], thanks for the help! > Types of input parameters of a UDF affect the ability to cache the result > - > > Key: SPARK-28480 > URL: https://issues.apache.org/jira/browse/SPARK-28480 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Ivan Tsukanov >Priority: Major > Fix For: 2.4.3 > > Attachments: image-2019-07-23-10-58-45-768.png > > > When I define a parameter in a UDF as Boolean or Int the result DataFrame > can't be cached > {code:java} > import org.apache.spark.sql.functions.{lit, udf} > val empty = sparkSession.emptyDataFrame > val table = "table" > def test(customUDF: UserDefinedFunction, col: Column): Unit = { > val df = empty.select(customUDF(col)) > df.cache() > df.createOrReplaceTempView(table) > println(sparkSession.catalog.isCached(table)) > } > test(udf { _: String => 42 }, lit("")) // true > test(udf { _: Any => 42 }, lit("")) // true > test(udf { _: Int => 42 }, lit(42)) // false > test(udf { _: Boolean => 42 }, lit(false)) // false > {code} > or sparkSession.catalog.isCached gives irrelevant information. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28480) Types of input parameters of a UDF affect the ability to cache the result
[ https://issues.apache.org/jira/browse/SPARK-28480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Tsukanov updated SPARK-28480: -- Fix Version/s: 2.4.3 > Types of input parameters of a UDF affect the ability to cache the result > - > > Key: SPARK-28480 > URL: https://issues.apache.org/jira/browse/SPARK-28480 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Ivan Tsukanov >Priority: Major > Fix For: 2.4.3 > > Attachments: image-2019-07-23-10-58-45-768.png > > > When I define a parameter in a UDF as Boolean or Int the result DataFrame > can't be cached > {code:java} > import org.apache.spark.sql.functions.{lit, udf} > val empty = sparkSession.emptyDataFrame > val table = "table" > def test(customUDF: UserDefinedFunction, col: Column): Unit = { > val df = empty.select(customUDF(col)) > df.cache() > df.createOrReplaceTempView(table) > println(sparkSession.catalog.isCached(table)) > } > test(udf { _: String => 42 }, lit("")) // true > test(udf { _: Any => 42 }, lit("")) // true > test(udf { _: Int => 42 }, lit(42)) // false > test(udf { _: Boolean => 42 }, lit(false)) // false > {code} > or sparkSession.catalog.isCached gives irrelevant information. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28480) Types of input parameters of a UDF affect the ability to cache the result
[ https://issues.apache.org/jira/browse/SPARK-28480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890685#comment-16890685 ] Shivu Sondur commented on SPARK-28480: -- [~itsukanov] In the latest master branch it works fine. Check below snap !image-2019-07-23-10-58-45-768.png! > Types of input parameters of a UDF affect the ability to cache the result > - > > Key: SPARK-28480 > URL: https://issues.apache.org/jira/browse/SPARK-28480 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Ivan Tsukanov >Priority: Major > Attachments: image-2019-07-23-10-58-45-768.png > > > When I define a parameter in a UDF as Boolean or Int the result DataFrame > can't be cached > {code:java} > import org.apache.spark.sql.functions.{lit, udf} > val empty = sparkSession.emptyDataFrame > val table = "table" > def test(customUDF: UserDefinedFunction, col: Column): Unit = { > val df = empty.select(customUDF(col)) > df.cache() > df.createOrReplaceTempView(table) > println(sparkSession.catalog.isCached(table)) > } > test(udf { _: String => 42 }, lit("")) // true > test(udf { _: Any => 42 }, lit("")) // true > test(udf { _: Int => 42 }, lit(42)) // false > test(udf { _: Boolean => 42 }, lit(false)) // false > {code} > or sparkSession.catalog.isCached gives irrelevant information. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28480) Types of input parameters of a UDF affect the ability to cache the result
[ https://issues.apache.org/jira/browse/SPARK-28480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivu Sondur updated SPARK-28480: - Attachment: image-2019-07-23-10-58-45-768.png > Types of input parameters of a UDF affect the ability to cache the result > - > > Key: SPARK-28480 > URL: https://issues.apache.org/jira/browse/SPARK-28480 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Ivan Tsukanov >Priority: Major > Attachments: image-2019-07-23-10-58-45-768.png > > > When I define a parameter in a UDF as Boolean or Int the result DataFrame > can't be cached > {code:java} > import org.apache.spark.sql.functions.{lit, udf} > val empty = sparkSession.emptyDataFrame > val table = "table" > def test(customUDF: UserDefinedFunction, col: Column): Unit = { > val df = empty.select(customUDF(col)) > df.cache() > df.createOrReplaceTempView(table) > println(sparkSession.catalog.isCached(table)) > } > test(udf { _: String => 42 }, lit("")) // true > test(udf { _: Any => 42 }, lit("")) // true > test(udf { _: Int => 42 }, lit(42)) // false > test(udf { _: Boolean => 42 }, lit(false)) // false > {code} > or sparkSession.catalog.isCached gives irrelevant information. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28480) Types of input parameters of a UDF affect the ability to cache the result
[ https://issues.apache.org/jira/browse/SPARK-28480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890674#comment-16890674 ] Shivu Sondur commented on SPARK-28480: -- i am checking this issue > Types of input parameters of a UDF affect the ability to cache the result > - > > Key: SPARK-28480 > URL: https://issues.apache.org/jira/browse/SPARK-28480 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Ivan Tsukanov >Priority: Major > > When I define a parameter in a UDF as Boolean or Int the result DataFrame > can't be cached > {code:java} > import org.apache.spark.sql.functions.{lit, udf} > val empty = sparkSession.emptyDataFrame > val table = "table" > def test(customUDF: UserDefinedFunction, col: Column): Unit = { > val df = empty.select(customUDF(col)) > df.cache() > df.createOrReplaceTempView(table) > println(sparkSession.catalog.isCached(table)) > } > test(udf { _: String => 42 }, lit("")) // true > test(udf { _: Any => 42 }, lit("")) // true > test(udf { _: Int => 42 }, lit(42)) // false > test(udf { _: Boolean => 42 }, lit(false)) // false > {code} > or sparkSession.catalog.isCached gives irrelevant information. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27282) Spark incorrect results when using UNION with GROUP BY clause
[ https://issues.apache.org/jira/browse/SPARK-27282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-27282: --- Labels: correctness (was: ) > Spark incorrect results when using UNION with GROUP BY clause > - > > Key: SPARK-27282 > URL: https://issues.apache.org/jira/browse/SPARK-27282 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit, SQL >Affects Versions: 2.3.2 > Environment: I'm using : > IntelliJ IDEA ==> 2018.1.4 > spark-sql and spark-core ==> 2.3.2.3.1.0.0-78 (for HDP 3.1) > scala ==> 2.11.8 >Reporter: Sofia >Priority: Major > Labels: correctness > > When using UNION clause after a GROUP BY clause in spark, the results > obtained are wrong. > The following example explicit this issue: > {code:java} > CREATE TABLE test_un ( > col1 varchar(255), > col2 varchar(255), > col3 varchar(255), > col4 varchar(255) > ); > INSERT INTO test_un (col1, col2, col3, col4) > VALUES (1,1,2,4), > (1,1,2,4), > (1,1,3,5), > (2,2,2,null); > {code} > I used the following code : > {code:java} > val x = Toolkit.HiveToolkit.getDataFromHive("test","test_un") > val y = x >.filter(col("col4")isNotNull) > .groupBy("col1", "col2","col3") > .agg(count(col("col3")).alias("cnt")) > .withColumn("col_name", lit("col3")) > .select(col("col1"), col("col2"), > col("col_name"),col("col3").alias("col_value"), col("cnt")) > val z = x > .filter(col("col4")isNotNull) > .groupBy("col1", "col2","col4") > .agg(count(col("col4")).alias("cnt")) > .withColumn("col_name", lit("col4")) > .select(col("col1"), col("col2"), > col("col_name"),col("col4").alias("col_value"), col("cnt")) > y.union(z).show() > {code} > And i obtained the following results: > ||col1||col2||col_name||col_value||cnt|| > |1|1|col3|5|1| > |1|1|col3|4|2| > |1|1|col4|5|1| > |1|1|col4|4|2| > Expected results: > ||col1||col2||col_name||col_value||cnt|| > |1|1|col3|3|1| > |1|1|col3|2|2| > |1|1|col4|4|2| > |1|1|col4|5|1| > But when i remove the last row of the table, i obtain the correct results. > {code:java} > (2,2,2,null){code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24079) Update the nullability of Join output based on inferred predicates
[ https://issues.apache.org/jira/browse/SPARK-24079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890672#comment-16890672 ] Josh Rosen commented on SPARK-24079: Update: I just realized that SPARK-27915 is actually a closer duplicate of SPARK-24080, which is very closely related to this ticket. > Update the nullability of Join output based on inferred predicates > -- > > Key: SPARK-24079 > URL: https://issues.apache.org/jira/browse/SPARK-24079 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > In the master, a logical `Join` node does not respect the nullability that > the optimizer rule `InferFiltersFromConstraints` > might change when inferred predicates have `IsNotNull`, e.g., > {code} > scala> val df1 = Seq((Some(1), Some(2))).toDF("k", "v0") > scala> val df2 = Seq((Some(1), Some(3))).toDF("k", "v1") > scala> val joinedDf = df1.join(df2, df1("k") === df2("k"), "inner") > scala> joinedDf.explain > == Physical Plan == > *(2) BroadcastHashJoin [k#83], [k#92], Inner, BuildRight > :- *(2) Project [_1#80 AS k#83, _2#81 AS v0#84] > : +- *(2) Filter isnotnull(_1#80) > : +- LocalTableScan [_1#80, _2#81] > +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > true] as bigint))) >+- *(1) Project [_1#89 AS k#92, _2#90 AS v1#93] > +- *(1) Filter isnotnull(_1#89) > +- LocalTableScan [_1#89, _2#90] > scala> joinedDf.queryExecution.optimizedPlan.output.map(_.nullable) > res15: Seq[Boolean] = List(true, true, true, true) > {code} > But, these `nullable` values should be: > {code} > scala> joinedDf.queryExecution.optimizedPlan.output.map(_.nullable) > res15: Seq[Boolean] = List(false, true, false, true) > {code} > This ticket comes from the previous discussion: > https://github.com/apache/spark/pull/18576#pullrequestreview-107585997 -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28481) More expressions should extend NullIntolerant
[ https://issues.apache.org/jira/browse/SPARK-28481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-28481: --- Description: SPARK-13995 introduced the {{NullIntolerant}} trait to generalize the logic for inferring {{IsNotNull}} constraints from expressions. An expression is _null-intolerant_ if it returns {{null}} when any of its input expressions are {{null}}. I've noticed that _most_ expressions are null-intolerant: anything which extends UnaryExpression / BinaryExpression and keeps the default {{eval}} method will be null-intolerant. However, only a subset of these expressions mix in the {{NullIntolerant}} trait. As a result, we're missing out on the opportunity to infer certain types of non-null constraints: for example, if we see a {{WHERE length(x) > 10}} condition then we know that the column {{x}} must be non-null and can push this non-null filter down to our datasource scan. I can think of a few ways to fix this: # Modify every relevant expression to mix in the {{NullIntolerant}} trait. We can use IDEs or other code-analysis tools (e.g. {{ClassUtil}} plus reflection) to help automate the process of identifying expressions which do not override the default {{eval}}. # Make a backwards-incompatible change to our abstract base class hierarchy to add {{NullSafe*aryExpression}} abstract base classes which define the {{nullSafeEval}} method and implement a {{final eval}} method, then leave {{eval}} unimplemented in the regular {{*aryExpression}} base classes. ** This would fix the somewhat weird code smell that we have today where {{nullSafeEval}} has a default implementation which calls {{sys.error}}. ** This would negatively impact users who have implemented custom Catalyst expressions. # Use runtime reflection to determine whether expressions are null-intolerant by virtue of using one of the default null-intolerant {{eval}} implementations. We can then use this in an {{isNullIntolerant}} helper method which checks that classes either (a) extend {{NullIntolerant}} or (b) are null-intolerant according to the reflective check (which is basically just figuring out which concrete implementation the {{eval}} method resolves to). ** We only need to perform the reflection once _per-class_ and can cache the result for the lifetime of the JVM, so the performance overheads would be pretty small (especially compared to other non-cacheable reflection / traversal costs in Catalyst). ** The downside is additional complexity in the code which pattern-matches / checks for null-intolerance. Of these approaches, I'm currently leaning towards option 1 (semi-automated identification and manual update of hundreds of expressions): if we go with that approach then we can perform a one-time catch-up to fix all existing expressions. To handle ongoing maintenance (as we add new expressions), I'd propose to add "is this null-intolerant?" to a checklist to use when reviewing PRs which add new Catalyst expressions. /cc [~maropu] [~viirya] was: SPARK-13995 introduced the {{NullIntolerant}} trait to generalize the logic for inferring {{IsNotNull}} constraints from expressions. An expression is _null-intolerant_ if it returns {{null}} when any of its input expressions are {{null}}. I've noticed that _most_ expressions are null-intolerant: anything which extends UnaryExpression / BinaryExpression and keeps the default {{eval}} method will be null-intolerant. However, only a subset of these expressions mix in the {{NullIntolerant}} trait. As a result, we're missing out on the opportunity to infer certain types of non-null constraints: for example, if we see a {{WHERE length(x) > 10}} condition then we know that the column {{x}} must be non-null and can push this non-null filter down to our datasource scan. I can think of a few ways to fix this: # Modify every relevant expression to mix in the {{NullIntolerant}} trait. We can use IDEs or other code-analysis tools (e.g. {{ClassUtil}} plus reflection) to help automate the process of identifying expressions which do not override the default {{eval}}. # Make a backwards-incompatible change to our abstract base class hierarchy to add {{NullSafe*aryExpression}} abstract base classes which define the {{nullSafeEval}} method and implement a {{final eval}} method, then leave {{eval}} unimplemented in the regular {{*aryExpression}} base classes. ** This would fix the somewhat weird code smell that we have today where {{nullSafeEval}} has a default implementation which calls {{sys.error}}. ** This would negatively impact users who have implemented custom Catalyst expressions. # Use runtime reflection to determine whether expressions are null-intolerant by virtue of using one of the default null-intolerant {{eval}} implementations. We can then use this in an {{isNullIntolerant}} helper method which checks that classes either (a) extend {{N
[jira] [Updated] (SPARK-28481) More expressions should extend NullIntolerant
[ https://issues.apache.org/jira/browse/SPARK-28481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-28481: --- Description: SPARK-13995 introduced the {{NullIntolerant}} trait to generalize the logic for inferring {{IsNotNull}} constraints from expressions. An expression is _null-intolerant_ if it returns {{null}} when any of its input expressions are {{null}}. I've noticed that _most_ expressions are null-intolerant: anything which extends UnaryExpression / BinaryExpression and keeps the default {{eval}} method will be null-intolerant. However, only a subset of these expressions mix in the {{NullIntolerant}} trait. As a result, we're missing out on the opportunity to infer certain types of non-null constraints: for example, if we see a {{WHERE length\(x\) > 10}} condition then we know that the column {{x}} must be non-null and can push this non-null filter down to our datasource scan. I can think of a few ways to fix this: # Modify every relevant expression to mix in the {{NullIntolerant}} trait. We can use IDEs or other code-analysis tools (e.g. {{ClassUtil}} plus reflection) to help automate the process of identifying expressions which do not override the default {{eval}}. # Make a backwards-incompatible change to our abstract base class hierarchy to add {{NullSafe*aryExpression}} abstract base classes which define the {{nullSafeEval}} method and implement a {{final eval}} method, then leave {{eval}} unimplemented in the regular {{*aryExpression}} base classes. ** This would fix the somewhat weird code smell that we have today where {{nullSafeEval}} has a default implementation which calls {{sys.error}}. ** This would negatively impact users who have implemented custom Catalyst expressions. # Use runtime reflection to determine whether expressions are null-intolerant by virtue of using one of the default null-intolerant {{eval}} implementations. We can then use this in an {{isNullIntolerant}} helper method which checks that classes either (a) extend {{NullIntolerant}} or (b) are null-intolerant according to the reflective check (which is basically just figuring out which concrete implementation the {{eval}} method resolves to). ** We only need to perform the reflection once _per-class_ and can cache the result for the lifetime of the JVM, so the performance overheads would be pretty small (especially compared to other non-cacheable reflection / traversal costs in Catalyst). ** The downside is additional complexity in the code which pattern-matches / checks for null-intolerance. Of these approaches, I'm currently leaning towards option 1 (semi-automated identification and manual update of hundreds of expressions): if we go with that approach then we can perform a one-time catch-up to fix all existing expressions. To handle ongoing maintenance (as we add new expressions), I'd propose to add "is this null-intolerant?" to a checklist to use when reviewing PRs which add new Catalyst expressions. /cc [~maropu] [~viirya] was: SPARK-13995 introduced the {{NullIntolerant}} trait to generalize the logic for inferring {{IsNotNull}} constraints from expressions. An expression is _null-intolerant_ if it returns {{null}} when any of its input expressions are {{null}}. I've noticed that _most_ expressions are null-intolerant: anything which extends UnaryExpression / BinaryExpression and keeps the default {{eval}} method will be null-intolerant. However, only a subset of these expressions mix in the {{NullIntolerant}} trait. As a result, we're missing out on the opportunity to infer certain types of non-null constraints: for example, if we see a {{WHERE length(x) > 10}} condition then we know that the column {{x}} must be non-null and can push this non-null filter down to our datasource scan. I can think of a few ways to fix this: # Modify every relevant expression to mix in the {{NullIntolerant}} trait. We can use IDEs or other code-analysis tools (e.g. {{ClassUtil}} plus reflection) to help automate the process of identifying expressions which do not override the default {{eval}}. # Make a backwards-incompatible change to our abstract base class hierarchy to add {{NullSafe*aryExpression}} abstract base classes which define the {{nullSafeEval}} method and implement a {{final eval}} method, then leave {{eval}} unimplemented in the regular {{*aryExpression}} base classes. ** This would fix the somewhat weird code smell that we have today where {{nullSafeEval}} has a default implementation which calls {{sys.error}}. ** This would negatively impact users who have implemented custom Catalyst expressions. # Use runtime reflection to determine whether expressions are null-intolerant by virtue of using one of the default null-intolerant {{eval}} implementations. We can then use this in an {{isNullIntolerant}} helper method which checks that classes either (a) extend {
[jira] [Updated] (SPARK-28481) More expressions should extend NullIntolerant
[ https://issues.apache.org/jira/browse/SPARK-28481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-28481: --- Description: SPARK-13995 introduced the {{NullIntolerant}} trait to generalize the logic for inferring {{IsNotNull}} constraints from expressions. An expression is _null-intolerant_ if it returns {{null}} when any of its input expressions are {{null}}. I've noticed that _most_ expressions are null-intolerant: anything which extends UnaryExpression / BinaryExpression and keeps the default {{eval}} method will be null-intolerant. However, only a subset of these expressions mix in the {{NullIntolerant}} trait. As a result, we're missing out on the opportunity to infer certain types of non-null constraints: for example, if we see a {{WHERE length(x) > 10}} condition then we know that the column {{x}} must be non-null and can push this non-null filter down to our datasource scan. I can think of a few ways to fix this: # Modify every relevant expression to mix in the {{NullIntolerant}} trait. We can use IDEs or other code-analysis tools (e.g. {{ClassUtil}} plus reflection) to help automate the process of identifying expressions which do not override the default {{eval}}. # Make a backwards-incompatible change to our abstract base class hierarchy to add {{NullSafe*aryExpression}} abstract base classes which define the {{nullSafeEval}} method and implement a {{final eval}} method, then leave {{eval}} unimplemented in the regular {{*aryExpression}} base classes. ** This would fix the somewhat weird code smell that we have today where {{nullSafeEval}} has a default implementation which calls {{sys.error}}. ** This would negatively impact users who have implemented custom Catalyst expressions. # Use runtime reflection to determine whether expressions are null-intolerant by virtue of using one of the default null-intolerant {{eval}} implementations. We can then use this in an {{isNullIntolerant}} helper method which checks that classes either (a) extend {{NullIntolerant}} or (b) are null-intolerant according to the reflective check (which is basically just figuring out which concrete implementation the {{eval}} method resolves to). ** We only need to perform the reflection once _per-class_ and can cache the result for the lifetime of the JVM, so the performance overheads would be pretty small (especially compared to other non-cacheable reflection / traversal costs in Catalyst). ** The downside is additional complexity in the code which pattern-matches / checks for null-intolerance. Of these approaches, I'm currently leaning towards option 1 (semi-automated identification and manual update of hundreds of expressions): if we go with that approach then we can perform a one-time catch-up to fix all existing expressions. To handle ongoing maintenance (as we add new expressions), I'd propose to add "is this null-intolerant?" to a checklist to use when reviewing PRs which add new Catalyst expressions. /cc [~maropu] [~viirya] was: SPARK-13995 introduced the {{NullIntolerant}} trait to generalize the logic for inferring {{IsNotNull}} constraints from expressions. An expression is _null-intolerant_ if it returns {{null}} when any of its input expressions are {{null}}. I've noticed that _most_ expressions are null-intolerant: anything which extends UnaryExpression / BinaryExpression and keeps the default {{eval}} method will be null-intolerant. However, only a subset of these expressions mix in the {{NullIntolerant}} trait. As a result, we're missing out on the opportunity to infer certain types of non-null constraints: for example, if we see a {{WHERE length(x) > 10}} condition then we know that the column {{x}} must be non-null and can push this non-null filter down to our datasource scan. I can think of a few ways to fix this: # Modify every relevant expression to mix in the {{NullIntolerant}} trait. We can use IDEs or other code-analysis tools (e.g. {{ClassUtil}} plus reflection) to help automate the process of identifying expressions which do not override the default {{eval}}. # Make a backwards-incompatible change to our abstract base class hierarchy to add {{NullSafe*aryExpression}} abstract base classes which define the {{nullSafeEval}} method and implement a {{final eval}} method, then leave {{eval}} unimplemented in the regular {{*aryExpression}} base classes. ** This would fix the somewhat weird code smell that we have today where {{nullSafeEval}} has a default implementation which calls {{sys.error}}. ** This would negatively impact users who have implemented custom Catalyst expressions. # Use runtime reflection to determine whether expressions are null-intolerant by virtue of using one of the default null-intolerant {{eval}} implementations. We can then use this in an {{isNullIntolerant}} helper method which checks that classes either (a) extend {{N
[jira] [Created] (SPARK-28481) More expressions should extend NullIntolerant
Josh Rosen created SPARK-28481: -- Summary: More expressions should extend NullIntolerant Key: SPARK-28481 URL: https://issues.apache.org/jira/browse/SPARK-28481 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Josh Rosen SPARK-13995 introduced the {{NullIntolerant}} trait to generalize the logic for inferring {{IsNotNull}} constraints from expressions. An expression is _null-intolerant_ if it returns {{null}} when any of its input expressions are {{null}}. I've noticed that _most_ expressions are null-intolerant: anything which extends UnaryExpression / BinaryExpression and keeps the default {{eval}} method will be null-intolerant. However, only a subset of these expressions mix in the {{NullIntolerant}} trait. As a result, we're missing out on the opportunity to infer certain types of non-null constraints: for example, if we see a {{WHERE length(x) > 10}} condition then we know that the column {{x}} must be non-null and can push this non-null filter down to our datasource scan. I can think of a few ways to fix this: # Modify every relevant expression to mix in the {{NullIntolerant}} trait. We can use IDEs or other code-analysis tools (e.g. {{ClassUtil}} plus reflection) to help automate the process of identifying expressions which do not override the default {{eval}}. # Make a backwards-incompatible change to our abstract base class hierarchy to add {{NullSafe*aryExpression}} abstract base classes which define the {{nullSafeEval}} method and implement a {{final eval}} method, then leave {{eval}} unimplemented in the regular {{*aryExpression}} base classes. ** This would fix the somewhat weird code smell that we have today where {{nullSafeEval}} has a default implementation which calls {{sys.error}}. ** This would negatively impact users who have implemented custom Catalyst expressions. # Use runtime reflection to determine whether expressions are null-intolerant by virtue of using one of the default null-intolerant {{eval}} implementations. We can then use this in an {{isNullIntolerant}} helper method which checks that classes either (a) extend {{NullIntolerant}} or (b) are null-intolerant according to the reflective check (which is basically just figuring out which concrete implementation the {{eval}} method resolves to). ** We only need to perform the reflection once _per-class_ and can cache the result for the lifetime of the JVM, so the performance overheads would be pretty small (especially compared to other non-cacheable reflection / traversal costs in Catalyst). ** The downside is additional complexity in the code which pattern-matches / checks for null-intolerance. Of these approaches, I'm currently leaning towards option 1 (semi-automated identification and manual update of hundreds of expressions): if we go with that approach then we can perform a one-time catch-up to fix all existing expressions. To handle ongoing maintenance (as we add new expressions), I'd propose to add "is this null-intolerant?" to a checklist to use when reviewing PRs which add new Catalyst expressions. /cc [~maropu] [~viirya] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28469) Change CalendarIntervalType's readable string representation from calendarinterval to interval
[ https://issues.apache.org/jira/browse/SPARK-28469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28469. --- Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/25225 > Change CalendarIntervalType's readable string representation from > calendarinterval to interval > -- > > Key: SPARK-28469 > URL: https://issues.apache.org/jira/browse/SPARK-28469 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > We should update CalendarIntervalType's simpleString from calendarinterval to > interval. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28451) substr returns different values
[ https://issues.apache.org/jira/browse/SPARK-28451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890652#comment-16890652 ] Dongjoon Hyun commented on SPARK-28451: --- Thank you for the explanation, [~maropu]. > substr returns different values > --- > > Key: SPARK-28451 > URL: https://issues.apache.org/jira/browse/SPARK-28451 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > PostgreSQL: > {noformat} > postgres=# select substr('1234567890', -1, 5); > substr > > 123 > (1 row) > postgres=# select substr('1234567890', 1, -1); > ERROR: negative substring length not allowed > {noformat} > Spark SQL: > {noformat} > spark-sql> select substr('1234567890', -1, 5); > 0 > spark-sql> select substr('1234567890', 1, -1); > spark-sql> > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28480) Types of input parameters of a UDF affect the ability to cache the result
Ivan Tsukanov created SPARK-28480: - Summary: Types of input parameters of a UDF affect the ability to cache the result Key: SPARK-28480 URL: https://issues.apache.org/jira/browse/SPARK-28480 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.1 Reporter: Ivan Tsukanov When I define a parameter in a UDF as Boolean or Int the result DataFrame can't be cached {code:java} import org.apache.spark.sql.functions.{lit, udf} val empty = sparkSession.emptyDataFrame val table = "table" def test(customUDF: UserDefinedFunction, col: Column): Unit = { val df = empty.select(customUDF(col)) df.cache() df.createOrReplaceTempView(table) println(sparkSession.catalog.isCached(table)) } test(udf { _: String => 42 }, lit("")) // true test(udf { _: Any => 42 }, lit("")) // true test(udf { _: Int => 42 }, lit(42)) // false test(udf { _: Boolean => 42 }, lit(false)) // false {code} or sparkSession.catalog.isCached gives irrelevant information. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28480) Types of input parameters of a UDF affect the ability to cache the result
[ https://issues.apache.org/jira/browse/SPARK-28480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Tsukanov updated SPARK-28480: -- Description: When I define a parameter in a UDF as Boolean or Int the result DataFrame can't be cached {code:java} import org.apache.spark.sql.functions.{lit, udf} val empty = sparkSession.emptyDataFrame val table = "table" def test(customUDF: UserDefinedFunction, col: Column): Unit = { val df = empty.select(customUDF(col)) df.cache() df.createOrReplaceTempView(table) println(sparkSession.catalog.isCached(table)) } test(udf { _: String => 42 }, lit("")) // true test(udf { _: Any => 42 }, lit("")) // true test(udf { _: Int => 42 }, lit(42)) // false test(udf { _: Boolean => 42 }, lit(false)) // false {code} or sparkSession.catalog.isCached gives irrelevant information. was: When I define a parameter in a UDF as Boolean or Int the result DataFrame can't be cached {code:java} import org.apache.spark.sql.functions.{lit, udf} val empty = sparkSession.emptyDataFrame val table = "table" def test(customUDF: UserDefinedFunction, col: Column): Unit = { val df = empty.select(customUDF(col)) df.cache() df.createOrReplaceTempView(table) println(sparkSession.catalog.isCached(table)) } test(udf { _: String => 42 }, lit("")) // true test(udf { _: Any => 42 }, lit("")) // true test(udf { _: Int => 42 }, lit(42)) // false test(udf { _: Boolean => 42 }, lit(false)) // false {code} or sparkSession.catalog.isCached gives irrelevant information. > Types of input parameters of a UDF affect the ability to cache the result > - > > Key: SPARK-28480 > URL: https://issues.apache.org/jira/browse/SPARK-28480 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Ivan Tsukanov >Priority: Major > > When I define a parameter in a UDF as Boolean or Int the result DataFrame > can't be cached > {code:java} > import org.apache.spark.sql.functions.{lit, udf} > val empty = sparkSession.emptyDataFrame > val table = "table" > def test(customUDF: UserDefinedFunction, col: Column): Unit = { > val df = empty.select(customUDF(col)) > df.cache() > df.createOrReplaceTempView(table) > println(sparkSession.catalog.isCached(table)) > } > test(udf { _: String => 42 }, lit("")) // true > test(udf { _: Any => 42 }, lit("")) // true > test(udf { _: Int => 42 }, lit(42)) // false > test(udf { _: Boolean => 42 }, lit(false)) // false > {code} > or sparkSession.catalog.isCached gives irrelevant information. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28479) Parser error when enabling ANSI mode
Yuming Wang created SPARK-28479: --- Summary: Parser error when enabling ANSI mode Key: SPARK-28479 URL: https://issues.apache.org/jira/browse/SPARK-28479 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang Case 1: {code:sql} spark-sql> set spark.sql.parser.ansi.enabled=true; spark.sql.parser.ansi.enabled true spark-sql> select extract(year from timestamp '2001-02-16 20:38:40') ; Error in query: no viable alternative at input 'year'(line 1, pos 15) == SQL == select extract(year from timestamp '2001-02-16 20:38:40') ---^^^ spark-sql> set spark.sql.parser.ansi.enabled=false; spark.sql.parser.ansi.enabled false spark-sql> select extract(year from timestamp '2001-02-16 20:38:40') ; 2001 {code} Case 2: {code:sql} spark-sql> select left('12345', 2); 12 spark-sql> set spark.sql.parser.ansi.enabled=true; spark.sql.parser.ansi.enabled true spark-sql> select left('12345', 2); Error in query: no viable alternative at input 'left'(line 1, pos 7) == SQL == select left('12345', 2) ---^^^ {code} https://github.com/apache/spark/pull/25114#issuecomment-512229758 -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25349) Support sample pushdown in Data Source V2
[ https://issues.apache.org/jira/browse/SPARK-25349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890623#comment-16890623 ] Weichen Xu commented on SPARK-25349: I will work on this. Thanks! > Support sample pushdown in Data Source V2 > - > > Key: SPARK-25349 > URL: https://issues.apache.org/jira/browse/SPARK-25349 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Support sample pushdown would help file-based data source implementation save > I/O cost significantly if it can decide whether to read a file or not. > > cc: [~cloud_fan] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28478) Optimizer rule to remove unnecessary explicit null checks for null-intolerant expressions (e.g. if(x is null, x, f(x)))
Josh Rosen created SPARK-28478: -- Summary: Optimizer rule to remove unnecessary explicit null checks for null-intolerant expressions (e.g. if(x is null, x, f(x))) Key: SPARK-28478 URL: https://issues.apache.org/jira/browse/SPARK-28478 Project: Spark Issue Type: Improvement Components: Optimizer, SQL Affects Versions: 3.0.0 Reporter: Josh Rosen I ran across a family of expressions like {code:java} if(x is null, x, substring(x, 0, 1024)){code} or {code:java} when($"x".isNull, $"x", substring($"x", 0, 1024)){code} that were written this way because the query author was unsure about whether {{substring}} would return {{null}} when its input string argument is null. This explicit null-handling is unnecessary and adds bloat to the generated code, especially if it's done via a {{CASE}} statement (which compiles down to a {{do-while}} loop). In another case I saw a query compiler which automatically generated this type of code. It would be cool if Spark could automatically optimize such queries to remove these redundant null checks. Here's a sketch of what such a rule might look like (assuming that SPARK-28477 has been implement so we only need to worry about the {{IF}} case): * In the pattern match, check the following three conditions in the following order (to benefit from short-circuiting) ** The {{IF}} condition is an explicit null-check of a column {{c}} ** The {{true}} expression returns either {{c}} or {{null}} ** The {{false}} expression is a _null-intolerant_ expression with {{c}} as a _direct_ child. * If this condition matches, replace the entire {{If}} with the {{false}} branch's expression.. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28477) Rewrite `CASE WHEN cond THEN ifTrue OTHERWISE ifFalse` END into `IF(cond, ifTrue, ifFalse`
Josh Rosen created SPARK-28477: -- Summary: Rewrite `CASE WHEN cond THEN ifTrue OTHERWISE ifFalse` END into `IF(cond, ifTrue, ifFalse` Key: SPARK-28477 URL: https://issues.apache.org/jira/browse/SPARK-28477 Project: Spark Issue Type: Improvement Components: Optimizer, SQL Affects Versions: 3.0.0 Reporter: Josh Rosen Spark SQL has both {{CASE WHEN}} and {{IF}} expressions. I've seen many cases where end-users write {code:java} when(x, ifTrue).otherwise(ifFalse){code} because Spark doesn't have a {{org.apache.spark.sql.functions._}} method for the {{If}} expression. Unfortunately, {{CASE WHEN}} generates substantial code bloat because its codgen is implemented using a {{do-while}} loop. In some performance-critical frameworks, I've modified our code to directly construct the Catalyst {{If}} expression, but this is toilsome and confusing to end-users. If we have a {{CASE WHEN}} which has only two branches, like the example given above, then Spark should automatically rewrite it into a simple {{IF}} expression. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28477) Rewrite `CASE WHEN cond THEN ifTrue OTHERWISE ifFalse` END into `IF(cond, ifTrue, ifFalse)`
[ https://issues.apache.org/jira/browse/SPARK-28477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-28477: --- Summary: Rewrite `CASE WHEN cond THEN ifTrue OTHERWISE ifFalse` END into `IF(cond, ifTrue, ifFalse)` (was: Rewrite `CASE WHEN cond THEN ifTrue OTHERWISE ifFalse` END into `IF(cond, ifTrue, ifFalse`) > Rewrite `CASE WHEN cond THEN ifTrue OTHERWISE ifFalse` END into `IF(cond, > ifTrue, ifFalse)` > --- > > Key: SPARK-28477 > URL: https://issues.apache.org/jira/browse/SPARK-28477 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Priority: Major > > Spark SQL has both {{CASE WHEN}} and {{IF}} expressions. > I've seen many cases where end-users write > {code:java} > when(x, ifTrue).otherwise(ifFalse){code} > because Spark doesn't have a {{org.apache.spark.sql.functions._}} method for > the {{If}} expression. > Unfortunately, {{CASE WHEN}} generates substantial code bloat because its > codgen is implemented using a {{do-while}} loop. In some performance-critical > frameworks, I've modified our code to directly construct the Catalyst {{If}} > expression, but this is toilsome and confusing to end-users. > If we have a {{CASE WHEN}} which has only two branches, like the example > given above, then Spark should automatically rewrite it into a simple {{IF}} > expression. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28431) CSV datasource throw com.univocity.parsers.common.TextParsingException with large size message
[ https://issues.apache.org/jira/browse/SPARK-28431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28431. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25184 [https://github.com/apache/spark/pull/25184] > CSV datasource throw com.univocity.parsers.common.TextParsingException with > large size message > --- > > Key: SPARK-28431 > URL: https://issues.apache.org/jira/browse/SPARK-28431 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Minor > Fix For: 3.0.0 > > > CSV datasource throw com.univocity.parsers.common.TextParsingException with > large size message, which will make log output consume large disk space. > Reproduce code > {code:java} > val s = "a" * 40 * 100 > Seq(s).toDF.write.mode("overwrite").csv("/tmp/bogdan/es4196.csv") > spark.read .option("maxCharsPerColumn", 3000) > .csv("/tmp/bogdan/es4196.csv").count{code} > Because of maxCharsPerColumn limit of 30M, there will be a > TextParsingException. The message of this exception actually includes what > was parsed so far, in this case 30M chars. > > This issue is troublesome when sometimes we need parse CSV with large column. > We should truncate the large size message in the TextParsingException. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28431) CSV datasource throw com.univocity.parsers.common.TextParsingException with large size message
[ https://issues.apache.org/jira/browse/SPARK-28431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28431: Assignee: Weichen Xu > CSV datasource throw com.univocity.parsers.common.TextParsingException with > large size message > --- > > Key: SPARK-28431 > URL: https://issues.apache.org/jira/browse/SPARK-28431 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Minor > > CSV datasource throw com.univocity.parsers.common.TextParsingException with > large size message, which will make log output consume large disk space. > Reproduce code > {code:java} > val s = "a" * 40 * 100 > Seq(s).toDF.write.mode("overwrite").csv("/tmp/bogdan/es4196.csv") > spark.read .option("maxCharsPerColumn", 3000) > .csv("/tmp/bogdan/es4196.csv").count{code} > Because of maxCharsPerColumn limit of 30M, there will be a > TextParsingException. The message of this exception actually includes what > was parsed so far, in this case 30M chars. > > This issue is troublesome when sometimes we need parse CSV with large column. > We should truncate the large size message in the TextParsingException. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28085) Spark Scala API documentation URLs not working properly in Chrome
[ https://issues.apache.org/jira/browse/SPARK-28085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890564#comment-16890564 ] Hyukjin Kwon commented on SPARK-28085: -- My Chrome is 75 and I am seeing this issue FWIW. > Spark Scala API documentation URLs not working properly in Chrome > - > > Key: SPARK-28085 > URL: https://issues.apache.org/jira/browse/SPARK-28085 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.3 >Reporter: Andrew Leverentz >Priority: Minor > > In Chrome version 75, URLs in the Scala API documentation are not working > properly, which makes them difficult to bookmark. > For example, URLs like the following get redirected to a generic "root" > package page: > [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html] > [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset] > Here's the URL that I get redirected to: > [https://spark.apache.org/docs/latest/api/scala/index.html#package] > This issue seems to have appeared between versions 74 and 75 of Chrome, but > the documentation URLs still work in Safari. I suspect that this has > something to do with security-related changes to how Chrome 75 handles frames > and/or redirects. I've reported this issue to the Chrome team via the > in-browser help menu, but I don't have any visibility into their response, so > it's not clear whether they'll consider this a bug or "working as intended". -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28451) substr returns different values
[ https://issues.apache.org/jira/browse/SPARK-28451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890558#comment-16890558 ] Takeshi Yamamuro commented on SPARK-28451: -- I don't have any standard reference for this behaivour though, +1 for the Dongjoon opnion; if the standard defines this behaviour explicitly, it might be worth fixing this. btw, the current ansi mode we have (spark.sql.parser.ansi.enabled) only affects the spark parser behaviour now, so we might need another new option for this kind of behaviour changes to follow the standard. > substr returns different values > --- > > Key: SPARK-28451 > URL: https://issues.apache.org/jira/browse/SPARK-28451 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > PostgreSQL: > {noformat} > postgres=# select substr('1234567890', -1, 5); > substr > > 123 > (1 row) > postgres=# select substr('1234567890', 1, -1); > ERROR: negative substring length not allowed > {noformat} > Spark SQL: > {noformat} > spark-sql> select substr('1234567890', -1, 5); > 0 > spark-sql> select substr('1234567890', 1, -1); > spark-sql> > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28451) substr returns different values
[ https://issues.apache.org/jira/browse/SPARK-28451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890552#comment-16890552 ] Dongjoon Hyun commented on SPARK-28451: --- We already have `ansi` mode and default(non-ansi) mode. Do you have a reference for the standard? (cc [~smilegator] and [~maropu].) If there is no reference, I'd like to stick to the current existing Spark behavior only. > substr returns different values > --- > > Key: SPARK-28451 > URL: https://issues.apache.org/jira/browse/SPARK-28451 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > PostgreSQL: > {noformat} > postgres=# select substr('1234567890', -1, 5); > substr > > 123 > (1 row) > postgres=# select substr('1234567890', 1, -1); > ERROR: negative substring length not allowed > {noformat} > Spark SQL: > {noformat} > spark-sql> select substr('1234567890', -1, 5); > 0 > spark-sql> select substr('1234567890', 1, -1); > spark-sql> > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28474) Lower JDBC client cannot read binary type
[ https://issues.apache.org/jira/browse/SPARK-28474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890549#comment-16890549 ] Yuming Wang commented on SPARK-28474: - I'm working on. > Lower JDBC client cannot read binary type > - > > Key: SPARK-28474 > URL: https://issues.apache.org/jira/browse/SPARK-28474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Logs: > {noformat} > java.lang.RuntimeException: java.lang.ClassCastException: [B incompatible > with java.lang.String > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) > at > java.security.AccessController.doPrivileged(AccessController.java:770) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) > at com.sun.proxy.$Proxy26.fetchResults(Unknown Source) > at > org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:455) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:819) > Caused by: java.lang.ClassCastException: [B incompatible with java.lang.String > at > org.apache.hive.service.cli.ColumnValue.toTColumnValue(ColumnValue.java:198) > at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60) > at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:148) > at > org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220) > at > org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:785) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) > ... 18 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28432) Date/Time Functions: make_date
[ https://issues.apache.org/jira/browse/SPARK-28432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28432. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25210 [https://github.com/apache/spark/pull/25210] > Date/Time Functions: make_date > -- > > Key: SPARK-28432 > URL: https://issues.apache.org/jira/browse/SPARK-28432 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > ||Function||Return Type||Description||Example||Result|| > |{{make_date(_year_ }}{{int}}{{, _month_ }}{{int}}{{, _day_ > }}{{int}}{{)}}|{{date}}|Create date from year, month and day > fields|{{make_date(2013, 7, 15)}}|{{2013-07-15}}| > https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28432) Add `make_date` function
[ https://issues.apache.org/jira/browse/SPARK-28432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28432: -- Summary: Add `make_date` function (was: Date/Time Functions: make_date) > Add `make_date` function > > > Key: SPARK-28432 > URL: https://issues.apache.org/jira/browse/SPARK-28432 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > ||Function||Return Type||Description||Example||Result|| > |{{make_date(_year_ }}{{int}}{{, _month_ }}{{int}}{{, _day_ > }}{{int}}{{)}}|{{date}}|Create date from year, month and day > fields|{{make_date(2013, 7, 15)}}|{{2013-07-15}}| > https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28432) Date/Time Functions: make_date
[ https://issues.apache.org/jira/browse/SPARK-28432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28432: - Assignee: Maxim Gekk > Date/Time Functions: make_date > -- > > Key: SPARK-28432 > URL: https://issues.apache.org/jira/browse/SPARK-28432 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Maxim Gekk >Priority: Major > > ||Function||Return Type||Description||Example||Result|| > |{{make_date(_year_ }}{{int}}{{, _month_ }}{{int}}{{, _day_ > }}{{int}}{{)}}|{{date}}|Create date from year, month and day > fields|{{make_date(2013, 7, 15)}}|{{2013-07-15}}| > https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28469) Change CalendarIntervalType's readable string representation from calendarinterval to interval
[ https://issues.apache.org/jira/browse/SPARK-28469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28469: -- Summary: Change CalendarIntervalType's readable string representation from calendarinterval to interval (was: Add simpleString for CalendarIntervalType) > Change CalendarIntervalType's readable string representation from > calendarinterval to interval > -- > > Key: SPARK-28469 > URL: https://issues.apache.org/jira/browse/SPARK-28469 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > We should update CalendarIntervalType's simpleString from calendarinterval to > interval. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28455) Executor may be timed out too soon because of overflow in tracking code
[ https://issues.apache.org/jira/browse/SPARK-28455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28455: - Assignee: Marcelo Vanzin > Executor may be timed out too soon because of overflow in tracking code > --- > > Key: SPARK-28455 > URL: https://issues.apache.org/jira/browse/SPARK-28455 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > > This affects the new code added in SPARK-27963 (so normal dynamic allocation > is fine). There's an overflow issue in that code that may cause executors to > be timed out early with the default configuration. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28455) Executor may be timed out too soon because of overflow in tracking code
[ https://issues.apache.org/jira/browse/SPARK-28455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28455. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25208 [https://github.com/apache/spark/pull/25208] > Executor may be timed out too soon because of overflow in tracking code > --- > > Key: SPARK-28455 > URL: https://issues.apache.org/jira/browse/SPARK-28455 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 3.0.0 > > > This affects the new code added in SPARK-27963 (so normal dynamic allocation > is fine). There's an overflow issue in that code that may cause executors to > be timed out early with the default configuration. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28476) Support ALTER DATABASE SET LOCATION
[ https://issues.apache.org/jira/browse/SPARK-28476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28476: Target Version/s: 3.0.0 > Support ALTER DATABASE SET LOCATION > --- > > Key: SPARK-28476 > URL: https://issues.apache.org/jira/browse/SPARK-28476 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > We can support the syntax of ALTER (DATABASE|SCHEMA) database_name SET > LOCATION path > Ref: [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL] > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28476) Support ALTER DATABASE SET LOCATION
Xiao Li created SPARK-28476: --- Summary: Support ALTER DATABASE SET LOCATION Key: SPARK-28476 URL: https://issues.apache.org/jira/browse/SPARK-28476 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Xiao Li We can support the syntax of ALTER (DATABASE|SCHEMA) database_name SET LOCATION path Ref: [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28468) `do-release-docker.sh` fails at `sphinx` installation to `Python 2.7`
[ https://issues.apache.org/jira/browse/SPARK-28468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28468. --- Resolution: Fixed Fix Version/s: 2.4.4 This is resolved via https://github.com/apache/spark/pull/25226 > `do-release-docker.sh` fails at `sphinx` installation to `Python 2.7` > - > > Key: SPARK-28468 > URL: https://issues.apache.org/jira/browse/SPARK-28468 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.4 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Blocker > Fix For: 2.4.4 > > > `do-release-docker.sh` fails at `sphinx` installation to `Python 2.7`. > {code} > $ dev/create-release/do-release-docker.sh -d /tmp/spark-2.4.4 -n > {code} > The following is the same reproducible step. > {code} > $ docker build -t spark-rm-test2 --build-arg UID=501 > dev/create-release/spark-rm > {code} > This happens in `branch-2.4` only. > {code} > root@4e196b3d7611:/# lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description:Ubuntu 16.04.6 LTS > Release:16.04 > Codename: xenial > root@4e196b3d7611:/# pip install sphinx > Collecting sphinx > Downloading > https://files.pythonhosted.org/packages/89/1e/64c77163706556b647f99d67b42fced9d39ae6b1b86673965a2cd28037b5/Sphinx-2.1.2.tar.gz > (6.3MB) > 100% || 6.3MB 316kB/s > Complete output from command python setup.py egg_info: > ERROR: Sphinx requires at least Python 3.5 to run. > > Command "python setup.py egg_info" failed with error code 1 in > /tmp/pip-build-7usNN9/sphinx/ > You are using pip version 8.1.1, however version 19.1.1 is available. > You should consider upgrading via the 'pip install --upgrade pip' command. > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28468) Upgrade pip to fix `sphinx` install error
[ https://issues.apache.org/jira/browse/SPARK-28468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28468: -- Summary: Upgrade pip to fix `sphinx` install error (was: `do-release-docker.sh` fails at `sphinx` installation to `Python 2.7`) > Upgrade pip to fix `sphinx` install error > - > > Key: SPARK-28468 > URL: https://issues.apache.org/jira/browse/SPARK-28468 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.4 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Blocker > Fix For: 2.4.4 > > > `do-release-docker.sh` fails at `sphinx` installation to `Python 2.7`. > {code} > $ dev/create-release/do-release-docker.sh -d /tmp/spark-2.4.4 -n > {code} > The following is the same reproducible step. > {code} > $ docker build -t spark-rm-test2 --build-arg UID=501 > dev/create-release/spark-rm > {code} > This happens in `branch-2.4` only. > {code} > root@4e196b3d7611:/# lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description:Ubuntu 16.04.6 LTS > Release:16.04 > Codename: xenial > root@4e196b3d7611:/# pip install sphinx > Collecting sphinx > Downloading > https://files.pythonhosted.org/packages/89/1e/64c77163706556b647f99d67b42fced9d39ae6b1b86673965a2cd28037b5/Sphinx-2.1.2.tar.gz > (6.3MB) > 100% || 6.3MB 316kB/s > Complete output from command python setup.py egg_info: > ERROR: Sphinx requires at least Python 3.5 to run. > > Command "python setup.py egg_info" failed with error code 1 in > /tmp/pip-build-7usNN9/sphinx/ > You are using pip version 8.1.1, however version 19.1.1 is available. > You should consider upgrading via the 'pip install --upgrade pip' command. > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28475) Add regex MetricFilter to GraphiteSink
Nick Karpov created SPARK-28475: --- Summary: Add regex MetricFilter to GraphiteSink Key: SPARK-28475 URL: https://issues.apache.org/jira/browse/SPARK-28475 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.3 Reporter: Nick Karpov Today all registered metric sources are reported to GraphiteSink with no filtering mechanism, although the codahale project does support it. GraphiteReporter (ScheduledReporter) from the codahale project requires you implement and supply the MetricFilter interface (there is only a single implementation by default in the codahale project, MetricFilter.ALL). Propose to add an additional regex config to match and filter metrics to the GraphiteSink -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28085) Spark Scala API documentation URLs not working properly in Chrome
[ https://issues.apache.org/jira/browse/SPARK-28085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890314#comment-16890314 ] Andrew Leverentz commented on SPARK-28085: -- This issue still remains, more than a month after the Chrome update that caused it. It's not clear whether Google considers it a bug that needs fixing. I've reported the issue to Google, as mentioned above, but if anyone else has a better way of contacting the Chrome team, I'd appreciate it if you could try to get in touch with them to see whether they are aware of this bug and planning to fix it. > Spark Scala API documentation URLs not working properly in Chrome > - > > Key: SPARK-28085 > URL: https://issues.apache.org/jira/browse/SPARK-28085 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.3 >Reporter: Andrew Leverentz >Priority: Minor > > In Chrome version 75, URLs in the Scala API documentation are not working > properly, which makes them difficult to bookmark. > For example, URLs like the following get redirected to a generic "root" > package page: > [https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html] > [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset] > Here's the URL that I get redirected to: > [https://spark.apache.org/docs/latest/api/scala/index.html#package] > This issue seems to have appeared between versions 74 and 75 of Chrome, but > the documentation URLs still work in Safari. I suspect that this has > something to do with security-related changes to how Chrome 75 handles frames > and/or redirects. I've reported this issue to the Chrome team via the > in-browser help menu, but I don't have any visibility into their response, so > it's not clear whether they'll consider this a bug or "working as intended". -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28225) Unexpected behavior for Window functions
[ https://issues.apache.org/jira/browse/SPARK-28225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Leverentz resolved SPARK-28225. -- Resolution: Not A Problem > Unexpected behavior for Window functions > > > Key: SPARK-28225 > URL: https://issues.apache.org/jira/browse/SPARK-28225 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Andrew Leverentz >Priority: Major > > I've noticed some odd behavior when combining the "first" aggregate function > with an ordered Window. > In particular, I'm working with columns created using the syntax > {code} > first($"y", ignoreNulls = true).over(Window.orderBy($"x")) > {code} > Below, I'm including some code which reproduces this issue in a Databricks > notebook. > *Code:* > {code:java} > import org.apache.spark.sql.functions.first > import org.apache.spark.sql.expressions.Window > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{StructType,StructField,IntegerType} > val schema = StructType(Seq( > StructField("x", IntegerType, false), > StructField("y", IntegerType, true), > StructField("z", IntegerType, true) > )) > val input = > spark.createDataFrame(sc.parallelize(Seq( > Row(101, null, 11), > Row(102, null, 12), > Row(103, null, 13), > Row(203, 24, null), > Row(201, 26, null), > Row(202, 25, null) > )), schema = schema) > input.show > val output = input > .withColumn("u1", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".asc_nulls_last))) > .withColumn("u2", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".asc))) > .withColumn("u3", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".desc_nulls_last))) > .withColumn("u4", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".desc))) > .withColumn("u5", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".asc_nulls_last))) > .withColumn("u6", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".asc))) > .withColumn("u7", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".desc_nulls_last))) > .withColumn("u8", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".desc))) > output.show > {code} > *Expectation:* > Based on my understanding of how ordered-Window and aggregate functions work, > the results I expected to see were: > * u1 = u2 = constant value of 26 > * u3 = u4 = constant value of 24 > * u5 = u6 = constant value of 11 > * u7 = u8 = constant value of 13 > However, columns u1, u2, u7, and u8 contain some unexpected nulls. > *Results:* > {code:java} > +---+++++---+---+---+---+++ > | x| y| z| u1| u2| u3| u4| u5| u6| u7| u8| > +---+++++---+---+---+---+++ > |203| 24|null| 26| 26| 24| 24| 11| 11|null|null| > |202| 25|null| 26| 26| 24| 24| 11| 11|null|null| > |201| 26|null| 26| 26| 24| 24| 11| 11|null|null| > |103|null| 13|null|null| 24| 24| 11| 11| 13| 13| > |102|null| 12|null|null| 24| 24| 11| 11| 13| 13| > |101|null| 11|null|null| 24| 24| 11| 11| 13| 13| > +---+++++---+---+---+---+++ > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28225) Unexpected behavior for Window functions
[ https://issues.apache.org/jira/browse/SPARK-28225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890310#comment-16890310 ] Andrew Leverentz commented on SPARK-28225: -- Marco, thanks for the explanation. In this case, the workaround in Scala is to use {{Window.orderBy($"x").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)}} This issue can be marked resolved. > Unexpected behavior for Window functions > > > Key: SPARK-28225 > URL: https://issues.apache.org/jira/browse/SPARK-28225 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Andrew Leverentz >Priority: Major > > I've noticed some odd behavior when combining the "first" aggregate function > with an ordered Window. > In particular, I'm working with columns created using the syntax > {code} > first($"y", ignoreNulls = true).over(Window.orderBy($"x")) > {code} > Below, I'm including some code which reproduces this issue in a Databricks > notebook. > *Code:* > {code:java} > import org.apache.spark.sql.functions.first > import org.apache.spark.sql.expressions.Window > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{StructType,StructField,IntegerType} > val schema = StructType(Seq( > StructField("x", IntegerType, false), > StructField("y", IntegerType, true), > StructField("z", IntegerType, true) > )) > val input = > spark.createDataFrame(sc.parallelize(Seq( > Row(101, null, 11), > Row(102, null, 12), > Row(103, null, 13), > Row(203, 24, null), > Row(201, 26, null), > Row(202, 25, null) > )), schema = schema) > input.show > val output = input > .withColumn("u1", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".asc_nulls_last))) > .withColumn("u2", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".asc))) > .withColumn("u3", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".desc_nulls_last))) > .withColumn("u4", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".desc))) > .withColumn("u5", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".asc_nulls_last))) > .withColumn("u6", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".asc))) > .withColumn("u7", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".desc_nulls_last))) > .withColumn("u8", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".desc))) > output.show > {code} > *Expectation:* > Based on my understanding of how ordered-Window and aggregate functions work, > the results I expected to see were: > * u1 = u2 = constant value of 26 > * u3 = u4 = constant value of 24 > * u5 = u6 = constant value of 11 > * u7 = u8 = constant value of 13 > However, columns u1, u2, u7, and u8 contain some unexpected nulls. > *Results:* > {code:java} > +---+++++---+---+---+---+++ > | x| y| z| u1| u2| u3| u4| u5| u6| u7| u8| > +---+++++---+---+---+---+++ > |203| 24|null| 26| 26| 24| 24| 11| 11|null|null| > |202| 25|null| 26| 26| 24| 24| 11| 11|null|null| > |201| 26|null| 26| 26| 24| 24| 11| 11|null|null| > |103|null| 13|null|null| 24| 24| 11| 11| 13| 13| > |102|null| 12|null|null| 24| 24| 11| 11| 13| 13| > |101|null| 11|null|null| 24| 24| 11| 11| 13| 13| > +---+++++---+---+---+---+++ > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28225) Unexpected behavior for Window functions
[ https://issues.apache.org/jira/browse/SPARK-28225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890310#comment-16890310 ] Andrew Leverentz edited comment on SPARK-28225 at 7/22/19 4:54 PM: --- Marco, thanks for the explanation. In this case, the solution in Scala is to use {{Window.orderBy($"x").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)}} This issue can be marked resolved. was (Author: alev_etx): Marco, thanks for the explanation. In this case, the workaround in Scala is to use {{Window.orderBy($"x").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)}} This issue can be marked resolved. > Unexpected behavior for Window functions > > > Key: SPARK-28225 > URL: https://issues.apache.org/jira/browse/SPARK-28225 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Andrew Leverentz >Priority: Major > > I've noticed some odd behavior when combining the "first" aggregate function > with an ordered Window. > In particular, I'm working with columns created using the syntax > {code} > first($"y", ignoreNulls = true).over(Window.orderBy($"x")) > {code} > Below, I'm including some code which reproduces this issue in a Databricks > notebook. > *Code:* > {code:java} > import org.apache.spark.sql.functions.first > import org.apache.spark.sql.expressions.Window > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.{StructType,StructField,IntegerType} > val schema = StructType(Seq( > StructField("x", IntegerType, false), > StructField("y", IntegerType, true), > StructField("z", IntegerType, true) > )) > val input = > spark.createDataFrame(sc.parallelize(Seq( > Row(101, null, 11), > Row(102, null, 12), > Row(103, null, 13), > Row(203, 24, null), > Row(201, 26, null), > Row(202, 25, null) > )), schema = schema) > input.show > val output = input > .withColumn("u1", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".asc_nulls_last))) > .withColumn("u2", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".asc))) > .withColumn("u3", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".desc_nulls_last))) > .withColumn("u4", first($"y", ignoreNulls = > true).over(Window.orderBy($"x".desc))) > .withColumn("u5", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".asc_nulls_last))) > .withColumn("u6", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".asc))) > .withColumn("u7", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".desc_nulls_last))) > .withColumn("u8", first($"z", ignoreNulls = > true).over(Window.orderBy($"x".desc))) > output.show > {code} > *Expectation:* > Based on my understanding of how ordered-Window and aggregate functions work, > the results I expected to see were: > * u1 = u2 = constant value of 26 > * u3 = u4 = constant value of 24 > * u5 = u6 = constant value of 11 > * u7 = u8 = constant value of 13 > However, columns u1, u2, u7, and u8 contain some unexpected nulls. > *Results:* > {code:java} > +---+++++---+---+---+---+++ > | x| y| z| u1| u2| u3| u4| u5| u6| u7| u8| > +---+++++---+---+---+---+++ > |203| 24|null| 26| 26| 24| 24| 11| 11|null|null| > |202| 25|null| 26| 26| 24| 24| 11| 11|null|null| > |201| 26|null| 26| 26| 24| 24| 11| 11|null|null| > |103|null| 13|null|null| 24| 24| 11| 11| 13| 13| > |102|null| 12|null|null| 24| 24| 11| 11| 13| 13| > |101|null| 11|null|null| 24| 24| 11| 11| 13| 13| > +---+++++---+---+---+---+++ > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28474) Lower JDBC client cannot read binary type
[ https://issues.apache.org/jira/browse/SPARK-28474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28474: Summary: Lower JDBC client cannot read binary type (was: Lower JDBC client version cannot read binary type) > Lower JDBC client cannot read binary type > - > > Key: SPARK-28474 > URL: https://issues.apache.org/jira/browse/SPARK-28474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Logs: > {noformat} > java.lang.RuntimeException: java.lang.ClassCastException: [B incompatible > with java.lang.String > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) > at > java.security.AccessController.doPrivileged(AccessController.java:770) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) > at com.sun.proxy.$Proxy26.fetchResults(Unknown Source) > at > org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:455) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:819) > Caused by: java.lang.ClassCastException: [B incompatible with java.lang.String > at > org.apache.hive.service.cli.ColumnValue.toTColumnValue(ColumnValue.java:198) > at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60) > at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:148) > at > org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220) > at > org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:785) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) > ... 18 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28457) curl: (60) SSL certificate problem: unable to get local issuer certificate More details here:
[ https://issues.apache.org/jira/browse/SPARK-28457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890281#comment-16890281 ] Xiao Li commented on SPARK-28457: - [~shaneknapp] Thanks for fixing it! > curl: (60) SSL certificate problem: unable to get local issuer certificate > More details here: > -- > > Key: SPARK-28457 > URL: https://issues.apache.org/jira/browse/SPARK-28457 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: shane knapp >Priority: Blocker > > > Build broke since this afternoon. > [spark-master-compile-maven-hadoop-2.7 #10224 (broken since this > build)|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/10224/] > [spark-master-compile-maven-hadoop-3.2 #171 (broken since this > build)|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/171/] > [spark-master-lint #10599 (broken since this > build)|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/10599/] > > {code:java} > > > https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.1/binaries/apache-maven-3.6.1-bin.tar.gz > curl: (60) SSL certificate problem: unable to get local issuer certificate > More details here: > https://curl.haxx.se/docs/sslcerts.html > curl performs SSL certificate verification by default, using a "bundle" > of Certificate Authority (CA) public keys (CA certs). If the default > bundle file isn't adequate, you can specify an alternate file > using the --cacert option. > If this HTTPS server uses a certificate signed by a CA represented in > the bundle, the certificate verification probably failed due to a > problem with the certificate (it might be expired, or the name might > not match the domain name in the URL). > If you'd like to turn off curl's verification of the certificate, use > the -k (or --insecure) option. > gzip: stdin: unexpected end of file > tar: Child returned status 1 > tar: Error is not recoverable: exiting now > Using `mvn` from path: > /home/jenkins/workspace/spark-master-compile-maven-hadoop-2.7/build/apache-maven-3.6.1/bin/mvn > build/mvn: line 163: > /home/jenkins/workspace/spark-master-compile-maven-hadoop-2.7/build/apache-maven-3.6.1/bin/mvn: > No such file or directory > Build step 'Execute shell' marked build as failure > Finished: FAILURE > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28457) curl: (60) SSL certificate problem: unable to get local issuer certificate More details here:
[ https://issues.apache.org/jira/browse/SPARK-28457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp reassigned SPARK-28457: --- Assignee: shane knapp > curl: (60) SSL certificate problem: unable to get local issuer certificate > More details here: > -- > > Key: SPARK-28457 > URL: https://issues.apache.org/jira/browse/SPARK-28457 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: shane knapp >Priority: Blocker > > > Build broke since this afternoon. > [spark-master-compile-maven-hadoop-2.7 #10224 (broken since this > build)|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/10224/] > [spark-master-compile-maven-hadoop-3.2 #171 (broken since this > build)|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/171/] > [spark-master-lint #10599 (broken since this > build)|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/10599/] > > {code:java} > > > https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.1/binaries/apache-maven-3.6.1-bin.tar.gz > curl: (60) SSL certificate problem: unable to get local issuer certificate > More details here: > https://curl.haxx.se/docs/sslcerts.html > curl performs SSL certificate verification by default, using a "bundle" > of Certificate Authority (CA) public keys (CA certs). If the default > bundle file isn't adequate, you can specify an alternate file > using the --cacert option. > If this HTTPS server uses a certificate signed by a CA represented in > the bundle, the certificate verification probably failed due to a > problem with the certificate (it might be expired, or the name might > not match the domain name in the URL). > If you'd like to turn off curl's verification of the certificate, use > the -k (or --insecure) option. > gzip: stdin: unexpected end of file > tar: Child returned status 1 > tar: Error is not recoverable: exiting now > Using `mvn` from path: > /home/jenkins/workspace/spark-master-compile-maven-hadoop-2.7/build/apache-maven-3.6.1/bin/mvn > build/mvn: line 163: > /home/jenkins/workspace/spark-master-compile-maven-hadoop-2.7/build/apache-maven-3.6.1/bin/mvn: > No such file or directory > Build step 'Execute shell' marked build as failure > Finished: FAILURE > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28457) curl: (60) SSL certificate problem: unable to get local issuer certificate More details here:
[ https://issues.apache.org/jira/browse/SPARK-28457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp resolved SPARK-28457. - Resolution: Fixed > curl: (60) SSL certificate problem: unable to get local issuer certificate > More details here: > -- > > Key: SPARK-28457 > URL: https://issues.apache.org/jira/browse/SPARK-28457 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: shane knapp >Priority: Blocker > > > Build broke since this afternoon. > [spark-master-compile-maven-hadoop-2.7 #10224 (broken since this > build)|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/10224/] > [spark-master-compile-maven-hadoop-3.2 #171 (broken since this > build)|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/171/] > [spark-master-lint #10599 (broken since this > build)|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/10599/] > > {code:java} > > > https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.1/binaries/apache-maven-3.6.1-bin.tar.gz > curl: (60) SSL certificate problem: unable to get local issuer certificate > More details here: > https://curl.haxx.se/docs/sslcerts.html > curl performs SSL certificate verification by default, using a "bundle" > of Certificate Authority (CA) public keys (CA certs). If the default > bundle file isn't adequate, you can specify an alternate file > using the --cacert option. > If this HTTPS server uses a certificate signed by a CA represented in > the bundle, the certificate verification probably failed due to a > problem with the certificate (it might be expired, or the name might > not match the domain name in the URL). > If you'd like to turn off curl's verification of the certificate, use > the -k (or --insecure) option. > gzip: stdin: unexpected end of file > tar: Child returned status 1 > tar: Error is not recoverable: exiting now > Using `mvn` from path: > /home/jenkins/workspace/spark-master-compile-maven-hadoop-2.7/build/apache-maven-3.6.1/bin/mvn > build/mvn: line 163: > /home/jenkins/workspace/spark-master-compile-maven-hadoop-2.7/build/apache-maven-3.6.1/bin/mvn: > No such file or directory > Build step 'Execute shell' marked build as failure > Finished: FAILURE > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28457) curl: (60) SSL certificate problem: unable to get local issuer certificate More details here:
[ https://issues.apache.org/jira/browse/SPARK-28457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890279#comment-16890279 ] shane knapp commented on SPARK-28457: - ok, the error i'm seeing in the lint job is most definitely not related to the SSL certs: [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/10613/console] {noformat} starting python compilation test... python compilation succeeded. downloading pycodestyle from https://raw.githubusercontent.com/PyCQA/pycodestyle/2.4.0/pycodestyle.py... starting pycodestyle test... pycodestyle checks failed: File "/home/jenkins/workspace/spark-master-lint/dev/pycodestyle-2.4.0.py", line 1 500: Internal Server Error ^ SyntaxError: invalid syntax{noformat} i went to PyCQA's repo on github and i'm seeing a LOT of 500 errors. this is out of scope of this ticket, and actually not a localized (to our jenkins) issue, so i will notify dev@ and mark this as resolved. > curl: (60) SSL certificate problem: unable to get local issuer certificate > More details here: > -- > > Key: SPARK-28457 > URL: https://issues.apache.org/jira/browse/SPARK-28457 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Blocker > > > Build broke since this afternoon. > [spark-master-compile-maven-hadoop-2.7 #10224 (broken since this > build)|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/10224/] > [spark-master-compile-maven-hadoop-3.2 #171 (broken since this > build)|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/171/] > [spark-master-lint #10599 (broken since this > build)|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/10599/] > > {code:java} > > > https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.1/binaries/apache-maven-3.6.1-bin.tar.gz > curl: (60) SSL certificate problem: unable to get local issuer certificate > More details here: > https://curl.haxx.se/docs/sslcerts.html > curl performs SSL certificate verification by default, using a "bundle" > of Certificate Authority (CA) public keys (CA certs). If the default > bundle file isn't adequate, you can specify an alternate file > using the --cacert option. > If this HTTPS server uses a certificate signed by a CA represented in > the bundle, the certificate verification probably failed due to a > problem with the certificate (it might be expired, or the name might > not match the domain name in the URL). > If you'd like to turn off curl's verification of the certificate, use > the -k (or --insecure) option. > gzip: stdin: unexpected end of file > tar: Child returned status 1 > tar: Error is not recoverable: exiting now > Using `mvn` from path: > /home/jenkins/workspace/spark-master-compile-maven-hadoop-2.7/build/apache-maven-3.6.1/bin/mvn > build/mvn: line 163: > /home/jenkins/workspace/spark-master-compile-maven-hadoop-2.7/build/apache-maven-3.6.1/bin/mvn: > No such file or directory > Build step 'Execute shell' marked build as failure > Finished: FAILURE > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28457) curl: (60) SSL certificate problem: unable to get local issuer certificate More details here:
[ https://issues.apache.org/jira/browse/SPARK-28457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890275#comment-16890275 ] shane knapp commented on SPARK-28457: - ok, curl was unhappy w/the old the cacert.pem, so i updated to the latest from [https://curl.haxx.se/ca/cacert.pem] and things look to be better, tho the lint job is failing. once i get that sorted i will mark this as resolved. > curl: (60) SSL certificate problem: unable to get local issuer certificate > More details here: > -- > > Key: SPARK-28457 > URL: https://issues.apache.org/jira/browse/SPARK-28457 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Blocker > > > Build broke since this afternoon. > [spark-master-compile-maven-hadoop-2.7 #10224 (broken since this > build)|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/10224/] > [spark-master-compile-maven-hadoop-3.2 #171 (broken since this > build)|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/171/] > [spark-master-lint #10599 (broken since this > build)|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/10599/] > > {code:java} > > > https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.1/binaries/apache-maven-3.6.1-bin.tar.gz > curl: (60) SSL certificate problem: unable to get local issuer certificate > More details here: > https://curl.haxx.se/docs/sslcerts.html > curl performs SSL certificate verification by default, using a "bundle" > of Certificate Authority (CA) public keys (CA certs). If the default > bundle file isn't adequate, you can specify an alternate file > using the --cacert option. > If this HTTPS server uses a certificate signed by a CA represented in > the bundle, the certificate verification probably failed due to a > problem with the certificate (it might be expired, or the name might > not match the domain name in the URL). > If you'd like to turn off curl's verification of the certificate, use > the -k (or --insecure) option. > gzip: stdin: unexpected end of file > tar: Child returned status 1 > tar: Error is not recoverable: exiting now > Using `mvn` from path: > /home/jenkins/workspace/spark-master-compile-maven-hadoop-2.7/build/apache-maven-3.6.1/bin/mvn > build/mvn: line 163: > /home/jenkins/workspace/spark-master-compile-maven-hadoop-2.7/build/apache-maven-3.6.1/bin/mvn: > No such file or directory > Build step 'Execute shell' marked build as failure > Finished: FAILURE > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28474) Lower JDBC client version cannot read binary type
[ https://issues.apache.org/jira/browse/SPARK-28474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28474: Summary: Lower JDBC client version cannot read binary type (was: Hive 0.12's JDBC client cannot read binary type) > Lower JDBC client version cannot read binary type > - > > Key: SPARK-28474 > URL: https://issues.apache.org/jira/browse/SPARK-28474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Logs: > {noformat} > java.lang.RuntimeException: java.lang.ClassCastException: [B incompatible > with java.lang.String > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) > at > java.security.AccessController.doPrivileged(AccessController.java:770) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) > at com.sun.proxy.$Proxy26.fetchResults(Unknown Source) > at > org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:455) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:819) > Caused by: java.lang.ClassCastException: [B incompatible with java.lang.String > at > org.apache.hive.service.cli.ColumnValue.toTColumnValue(ColumnValue.java:198) > at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60) > at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:148) > at > org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220) > at > org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:785) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) > ... 18 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28457) curl: (60) SSL certificate problem: unable to get local issuer certificate More details here:
[ https://issues.apache.org/jira/browse/SPARK-28457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890255#comment-16890255 ] shane knapp commented on SPARK-28457: - looking in to it now. > curl: (60) SSL certificate problem: unable to get local issuer certificate > More details here: > -- > > Key: SPARK-28457 > URL: https://issues.apache.org/jira/browse/SPARK-28457 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Blocker > > > Build broke since this afternoon. > [spark-master-compile-maven-hadoop-2.7 #10224 (broken since this > build)|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/10224/] > [spark-master-compile-maven-hadoop-3.2 #171 (broken since this > build)|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/171/] > [spark-master-lint #10599 (broken since this > build)|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/10599/] > > {code:java} > > > https://www.apache.org/dyn/closer.lua?action=download&filename=/maven/maven-3/3.6.1/binaries/apache-maven-3.6.1-bin.tar.gz > curl: (60) SSL certificate problem: unable to get local issuer certificate > More details here: > https://curl.haxx.se/docs/sslcerts.html > curl performs SSL certificate verification by default, using a "bundle" > of Certificate Authority (CA) public keys (CA certs). If the default > bundle file isn't adequate, you can specify an alternate file > using the --cacert option. > If this HTTPS server uses a certificate signed by a CA represented in > the bundle, the certificate verification probably failed due to a > problem with the certificate (it might be expired, or the name might > not match the domain name in the URL). > If you'd like to turn off curl's verification of the certificate, use > the -k (or --insecure) option. > gzip: stdin: unexpected end of file > tar: Child returned status 1 > tar: Error is not recoverable: exiting now > Using `mvn` from path: > /home/jenkins/workspace/spark-master-compile-maven-hadoop-2.7/build/apache-maven-3.6.1/bin/mvn > build/mvn: line 163: > /home/jenkins/workspace/spark-master-compile-maven-hadoop-2.7/build/apache-maven-3.6.1/bin/mvn: > No such file or directory > Build step 'Execute shell' marked build as failure > Finished: FAILURE > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28474) Hive 0.12's JDBC client cannot read binary type
Yuming Wang created SPARK-28474: --- Summary: Hive 0.12's JDBC client cannot read binary type Key: SPARK-28474 URL: https://issues.apache.org/jira/browse/SPARK-28474 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang Logs: {noformat} java.lang.RuntimeException: java.lang.ClassCastException: [B incompatible with java.lang.String at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) at java.security.AccessController.doPrivileged(AccessController.java:770) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) at com.sun.proxy.$Proxy26.fetchResults(Unknown Source) at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:455) at org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621) at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:819) Caused by: java.lang.ClassCastException: [B incompatible with java.lang.String at org.apache.hive.service.cli.ColumnValue.toTColumnValue(ColumnValue.java:198) at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60) at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:148) at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220) at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:785) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) ... 18 more {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28473) Build command in README should start with ./
Douglas Colkitt created SPARK-28473: --- Summary: Build command in README should start with ./ Key: SPARK-28473 URL: https://issues.apache.org/jira/browse/SPARK-28473 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 2.4.3 Reporter: Douglas Colkitt In the top-level README, the format of the build command does *not* begin with a ./ prefix: build/mvn -DskipTests clean package All the other commands in the README begin with a ./ prefix, e.g. ./bin/spark-shell To be consistent the build command should be changed to match the style of the other commands in the README: ./build/mvn -DskipTests clean package Although the non-prefixed command still works, having the ./ prefix makes it clear that the command is dependent on being executed from inside the repository as the CWD. It's a minor change, but makes things less confusing for new users. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28280) Convert and port 'group-by.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28280. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25098 [https://github.com/apache/spark/pull/25098] > Convert and port 'group-by.sql' into UDF test base > -- > > Key: SPARK-28280 > URL: https://issues.apache.org/jira/browse/SPARK-28280 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28280) Convert and port 'group-by.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28280: Assignee: Stavros Kontopoulos > Convert and port 'group-by.sql' into UDF test base > -- > > Key: SPARK-28280 > URL: https://issues.apache.org/jira/browse/SPARK-28280 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Stavros Kontopoulos >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28451) substr returns different values
[ https://issues.apache.org/jira/browse/SPARK-28451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890099#comment-16890099 ] Hyukjin Kwon edited comment on SPARK-28451 at 7/22/19 11:32 AM: Personally I don't think it's worth but let's see what other committers think like. was (Author: hyukjin.kwon): Personally I don't think it'w worth but let's see what other committers like. > substr returns different values > --- > > Key: SPARK-28451 > URL: https://issues.apache.org/jira/browse/SPARK-28451 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > PostgreSQL: > {noformat} > postgres=# select substr('1234567890', -1, 5); > substr > > 123 > (1 row) > postgres=# select substr('1234567890', 1, -1); > ERROR: negative substring length not allowed > {noformat} > Spark SQL: > {noformat} > spark-sql> select substr('1234567890', -1, 5); > 0 > spark-sql> select substr('1234567890', 1, -1); > spark-sql> > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28451) substr returns different values
[ https://issues.apache.org/jira/browse/SPARK-28451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890099#comment-16890099 ] Hyukjin Kwon commented on SPARK-28451: -- Personally I don't think it'w worth but let's see what other committers like. > substr returns different values > --- > > Key: SPARK-28451 > URL: https://issues.apache.org/jira/browse/SPARK-28451 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > PostgreSQL: > {noformat} > postgres=# select substr('1234567890', -1, 5); > substr > > 123 > (1 row) > postgres=# select substr('1234567890', 1, -1); > ERROR: negative substring length not allowed > {noformat} > Spark SQL: > {noformat} > spark-sql> select substr('1234567890', -1, 5); > 0 > spark-sql> select substr('1234567890', 1, -1); > spark-sql> > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28451) substr returns different values
[ https://issues.apache.org/jira/browse/SPARK-28451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890084#comment-16890084 ] Shivu Sondur commented on SPARK-28451: -- [~hyukjin.kwon], [~dongjoon] Here one more postgres compatible issue, Is it required to handle? Aftter checking i found following > Spark behavior is same as *Oracle*,*mysql* > and *MS Sql* behavior is same as *PostgreSQL* I think we should have global settings like postgresql_Flavor, sql_Flavor parameter, if it is set corresponding flavor, all the function should behave accordingly to the database flavor. > substr returns different values > --- > > Key: SPARK-28451 > URL: https://issues.apache.org/jira/browse/SPARK-28451 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > PostgreSQL: > {noformat} > postgres=# select substr('1234567890', -1, 5); > substr > > 123 > (1 row) > postgres=# select substr('1234567890', 1, -1); > ERROR: negative substring length not allowed > {noformat} > Spark SQL: > {noformat} > spark-sql> select substr('1234567890', -1, 5); > 0 > spark-sql> select substr('1234567890', 1, -1); > spark-sql> > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22213) Spark to detect slow executors on nodes with problematic hardware
[ https://issues.apache.org/jira/browse/SPARK-22213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890081#comment-16890081 ] Yuri Ronin commented on SPARK-22213: [~hyukjin.kwon] thanks > Spark to detect slow executors on nodes with problematic hardware > - > > Key: SPARK-22213 > URL: https://issues.apache.org/jira/browse/SPARK-22213 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.0.0 > Environment: - AWS EMR clusters > - window time is 60s > - several millions of events processed per minute >Reporter: Oleksandr Konopko >Priority: Major > Labels: bulk-closed > > Sometimes when new cluster is created it contains 1-2 slow nodes. When > average Task finishes in 5 seconds, it takes up to 50 seconds to finish on > slow node. As a result, batch processing time increases for 45s > In order to avoid that we could use `speculation` feature, but it seems that > it can be improved > > - 1st issue with `speculation` is that we do not want to use `speculation` on > all tasks, since we have tens of thousands of them during processing of one > batch. Spawning extra several thousands would not be resource-efficient. I > suggest to create new parameter `spark.speculation.mintime`. This would > specify minimal task run time for speculation to be enabled for this task > - 2nd issue is that even if Spark spawns speculative tasks only for > long-running ones (longer than 10s for example), task on slow node still will > run for some significant time before it is killed. Which still makes batch > processing time bigger than it should be. Solution is to enable > `blacklisting` for slow nodes. With speculation and blacklisting combined, > only first 1-2 batches would take more time when expected. After faulty node > is blacklisted batch processing time is as expected -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28472) Add a test for testing different protocol versions
Yuming Wang created SPARK-28472: --- Summary: Add a test for testing different protocol versions Key: SPARK-28472 URL: https://issues.apache.org/jira/browse/SPARK-28472 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22213) Spark to detect slow executors on nodes with problematic hardware
[ https://issues.apache.org/jira/browse/SPARK-22213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890079#comment-16890079 ] Hyukjin Kwon commented on SPARK-22213: -- It was closed because affected version indicated EOL releases. > Spark to detect slow executors on nodes with problematic hardware > - > > Key: SPARK-22213 > URL: https://issues.apache.org/jira/browse/SPARK-22213 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.0.0 > Environment: - AWS EMR clusters > - window time is 60s > - several millions of events processed per minute >Reporter: Oleksandr Konopko >Priority: Major > Labels: bulk-closed > > Sometimes when new cluster is created it contains 1-2 slow nodes. When > average Task finishes in 5 seconds, it takes up to 50 seconds to finish on > slow node. As a result, batch processing time increases for 45s > In order to avoid that we could use `speculation` feature, but it seems that > it can be improved > > - 1st issue with `speculation` is that we do not want to use `speculation` on > all tasks, since we have tens of thousands of them during processing of one > batch. Spawning extra several thousands would not be resource-efficient. I > suggest to create new parameter `spark.speculation.mintime`. This would > specify minimal task run time for speculation to be enabled for this task > - 2nd issue is that even if Spark spawns speculative tasks only for > long-running ones (longer than 10s for example), task on slow node still will > run for some significant time before it is killed. Which still makes batch > processing time bigger than it should be. Solution is to enable > `blacklisting` for slow nodes. With speculation and blacklisting combined, > only first 1-2 batches would take more time when expected. After faulty node > is blacklisted batch processing time is as expected -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22213) Spark to detect slow executors on nodes with problematic hardware
[ https://issues.apache.org/jira/browse/SPARK-22213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890048#comment-16890048 ] Yuri Ronin commented on SPARK-22213: [~hyukjin.kwon], why did you close it? Is it resolved? Can you please provide a PR? Thanks > Spark to detect slow executors on nodes with problematic hardware > - > > Key: SPARK-22213 > URL: https://issues.apache.org/jira/browse/SPARK-22213 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.0.0 > Environment: - AWS EMR clusters > - window time is 60s > - several millions of events processed per minute >Reporter: Oleksandr Konopko >Priority: Major > Labels: bulk-closed > > Sometimes when new cluster is created it contains 1-2 slow nodes. When > average Task finishes in 5 seconds, it takes up to 50 seconds to finish on > slow node. As a result, batch processing time increases for 45s > In order to avoid that we could use `speculation` feature, but it seems that > it can be improved > > - 1st issue with `speculation` is that we do not want to use `speculation` on > all tasks, since we have tens of thousands of them during processing of one > batch. Spawning extra several thousands would not be resource-efficient. I > suggest to create new parameter `spark.speculation.mintime`. This would > specify minimal task run time for speculation to be enabled for this task > - 2nd issue is that even if Spark spawns speculative tasks only for > long-running ones (longer than 10s for example), task on slow node still will > run for some significant time before it is killed. Which still makes batch > processing time bigger than it should be. Solution is to enable > `blacklisting` for slow nodes. With speculation and blacklisting combined, > only first 1-2 batches would take more time when expected. After faulty node > is blacklisted batch processing time is as expected -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28451) substr returns different values
[ https://issues.apache.org/jira/browse/SPARK-28451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890026#comment-16890026 ] Shivu Sondur commented on SPARK-28451: -- i will check this issue > substr returns different values > --- > > Key: SPARK-28451 > URL: https://issues.apache.org/jira/browse/SPARK-28451 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > PostgreSQL: > {noformat} > postgres=# select substr('1234567890', -1, 5); > substr > > 123 > (1 row) > postgres=# select substr('1234567890', 1, -1); > ERROR: negative substring length not allowed > {noformat} > Spark SQL: > {noformat} > spark-sql> select substr('1234567890', -1, 5); > 0 > spark-sql> select substr('1234567890', 1, -1); > spark-sql> > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28471) Formatting dates with negative years
[ https://issues.apache.org/jira/browse/SPARK-28471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890024#comment-16890024 ] Shivu Sondur commented on SPARK-28471: -- [~maxgekk] According to your discussion link, it is not required to change any code for this issue right? > Formatting dates with negative years > > > Key: SPARK-28471 > URL: https://issues.apache.org/jira/browse/SPARK-28471 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.3 >Reporter: Maxim Gekk >Priority: Minor > > While converting dates with negative years to strings, Spark skips era > sub-field by default. That's can confuse users since years from BC era are > mirrored to current era. For example: > {code} > spark-sql> select make_date(-44, 3, 15); > 0045-03-15 > {code} > Even negative years are out of supported range by the DATE type, it would be > nice to indicate the era for such dates. > PostgreSQL outputs the era for such inputs: > {code} > # select make_date(-44, 3, 15); >make_date > --- > 0044-03-15 BC > (1 row) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28467) Tests failed if there are not enough executors up before running
[ https://issues.apache.org/jira/browse/SPARK-28467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28467: -- Component/s: (was: Spark Core) Tests > Tests failed if there are not enough executors up before running > > > Key: SPARK-28467 > URL: https://issues.apache.org/jira/browse/SPARK-28467 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > > We ran unit tests on arm64 instance, and there are tests failed due to the > executor can't up under the timeout 1 ms: > - test driver discovery under local-cluster mode *** FAILED *** > java.util.concurrent.TimeoutException: Can't find 1 executors before 1 > milliseconds elapsed > at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:293) > at > org.apache.spark.SparkContextSuite.$anonfun$new$78(SparkContextSuite.scala:753) > at > org.apache.spark.SparkContextSuite.$anonfun$new$78$adapted(SparkContextSuite.scala:741) > at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161) > at > org.apache.spark.SparkContextSuite.$anonfun$new$77(SparkContextSuite.scala:741) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > > - test gpu driver resource files and discovery under local-cluster mode *** > FAILED *** > java.util.concurrent.TimeoutException: Can't find 1 executors before 1 > milliseconds elapsed > at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:293) > at > org.apache.spark.SparkContextSuite.$anonfun$new$80(SparkContextSuite.scala:781) > at > org.apache.spark.SparkContextSuite.$anonfun$new$80$adapted(SparkContextSuite.scala:761) > at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161) > at > org.apache.spark.SparkContextSuite.$anonfun$new$79(SparkContextSuite.scala:761) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > And then we increase the timeout to 2(or 3) the tests passed, I found > there are other issues about the timeout increasing before, see: > https://issues.apache.org/jira/browse/SPARK-7989 and > https://issues.apache.org/jira/browse/SPARK-10651 > I think the timeout doesn't work well, and seems there is no principle of the > timeout setting, how can I fix this? Could I increase the timeout for these > two tests? -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28467) Tests failed if there are not enough executors up before running
[ https://issues.apache.org/jira/browse/SPARK-28467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889995#comment-16889995 ] Dongjoon Hyun commented on SPARK-28467: --- I tested in `a1.4xlarge` and I cannot reproduce the failure. I'd like to recommend to use more powerful machines like `a1.4xlarge` for testing. > Tests failed if there are not enough executors up before running > > > Key: SPARK-28467 > URL: https://issues.apache.org/jira/browse/SPARK-28467 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: huangtianhua >Priority: Minor > > We ran unit tests on arm64 instance, and there are tests failed due to the > executor can't up under the timeout 1 ms: > - test driver discovery under local-cluster mode *** FAILED *** > java.util.concurrent.TimeoutException: Can't find 1 executors before 1 > milliseconds elapsed > at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:293) > at > org.apache.spark.SparkContextSuite.$anonfun$new$78(SparkContextSuite.scala:753) > at > org.apache.spark.SparkContextSuite.$anonfun$new$78$adapted(SparkContextSuite.scala:741) > at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161) > at > org.apache.spark.SparkContextSuite.$anonfun$new$77(SparkContextSuite.scala:741) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > > - test gpu driver resource files and discovery under local-cluster mode *** > FAILED *** > java.util.concurrent.TimeoutException: Can't find 1 executors before 1 > milliseconds elapsed > at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:293) > at > org.apache.spark.SparkContextSuite.$anonfun$new$80(SparkContextSuite.scala:781) > at > org.apache.spark.SparkContextSuite.$anonfun$new$80$adapted(SparkContextSuite.scala:761) > at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161) > at > org.apache.spark.SparkContextSuite.$anonfun$new$79(SparkContextSuite.scala:761) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > And then we increase the timeout to 2(or 3) the tests passed, I found > there are other issues about the timeout increasing before, see: > https://issues.apache.org/jira/browse/SPARK-7989 and > https://issues.apache.org/jira/browse/SPARK-10651 > I think the timeout doesn't work well, and seems there is no principle of the > timeout setting, how can I fix this? Could I increase the timeout for these > two tests? -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28471) Formatting dates with negative years
[ https://issues.apache.org/jira/browse/SPARK-28471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889981#comment-16889981 ] Maxim Gekk commented on SPARK-28471: Here is my explanation of difference in years between Spark's and PostgreSQL outputs: [https://github.com/apache/spark/pull/25210#discussion_r305609274] > Formatting dates with negative years > > > Key: SPARK-28471 > URL: https://issues.apache.org/jira/browse/SPARK-28471 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.3 >Reporter: Maxim Gekk >Priority: Minor > > While converting dates with negative years to strings, Spark skips era > sub-field by default. That's can confuse users since years from BC era are > mirrored to current era. For example: > {code} > spark-sql> select make_date(-44, 3, 15); > 0045-03-15 > {code} > Even negative years are out of supported range by the DATE type, it would be > nice to indicate the era for such dates. > PostgreSQL outputs the era for such inputs: > {code} > # select make_date(-44, 3, 15); >make_date > --- > 0044-03-15 BC > (1 row) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28470) Honor spark.sql.decimalOperations.nullOnOverflow in Cast
[ https://issues.apache.org/jira/browse/SPARK-28470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889979#comment-16889979 ] Marco Gaido commented on SPARK-28470: - Thanks for checking this Wenchen! I will work on this ASAP. Thanks. > Honor spark.sql.decimalOperations.nullOnOverflow in Cast > > > Key: SPARK-28470 > URL: https://issues.apache.org/jira/browse/SPARK-28470 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > cast long to decimal or decimal to decimal can overflow, we should respect > the new config if overflow happens. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28471) Formatting dates with negative years
Maxim Gekk created SPARK-28471: -- Summary: Formatting dates with negative years Key: SPARK-28471 URL: https://issues.apache.org/jira/browse/SPARK-28471 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.3 Reporter: Maxim Gekk While converting dates with negative years to strings, Spark skips era sub-field by default. That's can confuse users since years from BC era are mirrored to current era. For example: {code} spark-sql> select make_date(-44, 3, 15); 0045-03-15 {code} Even negative years are out of supported range by the DATE type, it would be nice to indicate the era for such dates. PostgreSQL outputs the era for such inputs: {code} # select make_date(-44, 3, 15); make_date --- 0044-03-15 BC (1 row) {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28470) Honor spark.sql.decimalOperations.nullOnOverflow in Cast
Wenchen Fan created SPARK-28470: --- Summary: Honor spark.sql.decimalOperations.nullOnOverflow in Cast Key: SPARK-28470 URL: https://issues.apache.org/jira/browse/SPARK-28470 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28470) Honor spark.sql.decimalOperations.nullOnOverflow in Cast
[ https://issues.apache.org/jira/browse/SPARK-28470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-28470: Description: cast long to decimal or decimal to decimal can overflow, we should respect the new config if overflow happens. > Honor spark.sql.decimalOperations.nullOnOverflow in Cast > > > Key: SPARK-28470 > URL: https://issues.apache.org/jira/browse/SPARK-28470 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > cast long to decimal or decimal to decimal can overflow, we should respect > the new config if overflow happens. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28470) Honor spark.sql.decimalOperations.nullOnOverflow in Cast
[ https://issues.apache.org/jira/browse/SPARK-28470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889975#comment-16889975 ] Wenchen Fan commented on SPARK-28470: - cc [~mgaido] > Honor spark.sql.decimalOperations.nullOnOverflow in Cast > > > Key: SPARK-28470 > URL: https://issues.apache.org/jira/browse/SPARK-28470 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > cast long to decimal or decimal to decimal can overflow, we should respect > the new config if overflow happens. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org