[jira] [Updated] (SPARK-26260) Task Summary Metrics for Stage Page: Efficient implementation for SHS when using disk store.
[ https://issues.apache.org/jira/browse/SPARK-26260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid updated SPARK-26260: --- Summary: Task Summary Metrics for Stage Page: Efficient implementation for SHS when using disk store. (was: Summary Task Metrics for Stage Page: Efficient implementation for SHS when using disk store.) > Task Summary Metrics for Stage Page: Efficient implementation for SHS when > using disk store. > > > Key: SPARK-26260 > URL: https://issues.apache.org/jira/browse/SPARK-26260 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: shahid >Priority: Major > > Currently, tasks summary metrics is calculated based on all the tasks, > instead of successful tasks. > After the JIRA, https://issues.apache.org/jira/browse/SPARK-26119, when using > InMemory store, it find task summary metrics for all the successful tasks > metrics. But we need to find an efficient implementation for disk store case > for SHS. The main bottle neck for disk store is deserialization time overhead. > Hints: Need to rework on the way indexing works, so that we can index by > specific metrics for successful and failed tasks differently (would be > tricky). Also would require changing the disk store version (to invalidate > old stores). > OR any other efficient solutions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26260) Summary Task Metrics for Stage Page: Efficient implementation for SHS when using disk store.
[ https://issues.apache.org/jira/browse/SPARK-26260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid updated SPARK-26260: --- Summary: Summary Task Metrics for Stage Page: Efficient implementation for SHS when using disk store. (was: Summary Task Metrics for Stage Page: Efficient implimentation for SHS when using disk store.) > Summary Task Metrics for Stage Page: Efficient implementation for SHS when > using disk store. > > > Key: SPARK-26260 > URL: https://issues.apache.org/jira/browse/SPARK-26260 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: shahid >Priority: Major > > Currently, tasks summary metrics is calculated based on all the tasks, > instead of successful tasks. > After the JIRA, https://issues.apache.org/jira/browse/SPARK-26119, when using > InMemory store, it find task summary metrics for all the successful tasks > metrics. But we need to find an efficient implementation for disk store case > for SHS. The main bottle neck for disk store is deserialization time overhead. > Hints: Need to rework on the way indexing works, so that we can index by > specific metrics for successful and failed tasks differently (would be > tricky). Also would require changing the disk store version (to invalidate > old stores). > OR any other efficient solutions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25573) Combine resolveExpression and resolve in the rule ResolveReferences
[ https://issues.apache.org/jira/browse/SPARK-25573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25573. - Resolution: Fixed Assignee: Dilip Biswal Fix Version/s: 3.0.0 > Combine resolveExpression and resolve in the rule ResolveReferences > --- > > Key: SPARK-25573 > URL: https://issues.apache.org/jira/browse/SPARK-25573 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Dilip Biswal >Priority: Major > Fix For: 3.0.0 > > > In the rule ResolveReferences, two private functions `resolve` and > `resolveExpression` should be combined. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale
[ https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708243#comment-16708243 ] Apache Spark commented on SPARK-26155: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/23214 > Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS > in 3TB scale > -- > > Key: SPARK-26155 > URL: https://issues.apache.org/jira/browse/SPARK-26155 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis > in Spark2.3 without L486&487.pdf, q19.sql > > > In our test environment, we found a serious performance degradation issue in > Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious > performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark > 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated > this problem and figured out the root cause is in community patch SPARK-21052 > which add metrics to hash join process. And the impact code is > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > . Q19 costs about 30 seconds without these two lines code and 126 seconds > with these code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale
[ https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708241#comment-16708241 ] Apache Spark commented on SPARK-26155: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/23214 > Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS > in 3TB scale > -- > > Key: SPARK-26155 > URL: https://issues.apache.org/jira/browse/SPARK-26155 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis > in Spark2.3 without L486&487.pdf, q19.sql > > > In our test environment, we found a serious performance degradation issue in > Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious > performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark > 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated > this problem and figured out the root cause is in community patch SPARK-21052 > which add metrics to hash join process. And the impact code is > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > . Q19 costs about 30 seconds without these two lines code and 126 seconds > with these code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode
[ https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708245#comment-16708245 ] indraneel r commented on SPARK-26206: - [~kabhwan] Here are some of my observations: - The error only ones when you get the data for the second batch. The first batch goes through fine. You may not see the output sometimes, not sure why though, but you can see the output once you start pumping new data into the kafka topic. And this is when it throws the error. - This same query works well with spark 2.3.0. Heres how some sample data looks like: {code:java} [ { "timestamp": 1541043341540, "cid": "333-333-333", "uid": "11-111-111", "sessionId": "11-111-111", "merchantId": "", "event": "-222-222", "ip": "1.1.1.1", "refUrl": "", "referrer": "", "section": "lorem", "tag": "lorem,ipsum", "eventType": "Random_event_1", "sid": "qwwewew" }, { "timestamp": 1541043341540, "cid": "333-444-444", "uid": "11-555-111", "sessionId": "11-111-111", "merchantId": "3331", "event": "-222-333", "ip": "1.1.2.1", "refUrl": "", "referrer": "", "section": "ipsum", "tag": "lorem,ipsum2", "eventType": "Random_event_2", "sid": "xxxdfffwewe" }] {code} > Spark structured streaming with kafka integration fails in update mode > --- > > Key: SPARK-26206 > URL: https://issues.apache.org/jira/browse/SPARK-26206 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Operating system : MacOS Mojave > spark version : 2.4.0 > spark-sql-kafka-0-10 : 2.4.0 > kafka version 1.1.1 > scala version : 2.12.7 >Reporter: indraneel r >Priority: Major > > Spark structured streaming with kafka integration fails in update mode with > compilation exception in code generation. > Here's the code that was executed: > {code:java} > // code placeholder > override def main(args: Array[String]): Unit = { > val spark = SparkSession > .builder > .master("local[*]") > .appName("SparkStreamingTest") > .getOrCreate() > > val kafkaParams = Map[String, String]( > "kafka.bootstrap.servers" -> "localhost:9092", > "startingOffsets" -> "earliest", > "subscribe" -> "test_events") > > val schema = Encoders.product[UserEvent].schema > val query = spark.readStream.format("kafka") > .options(kafkaParams) > .load() > .selectExpr("CAST(value AS STRING) as message") > .select(from_json(col("message"), schema).as("json")) > .select("json.*") > .groupBy(window(col("event_time"), "10 minutes")) > .count() > .writeStream > .foreachBatch { (batch: Dataset[Row], batchId: Long) => > println(s"batch : ${batchId}") > batch.show(false) > } > .outputMode("update") > .start() > query.awaitTermination() > }{code} > It succeeds for batch 0 but fails for batch 1 with following exception when > more data is arrives in the stream. > {code:java} > 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 25, Column 18: A method named "putLong" is not declared in any enclosing > class nor any supertype, nor through a static import > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 25, Column 18: A method named "putLong" is not declared in any enclosing > class nor any supertype, nor through a static import > at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060) > at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215) > at > org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421) > at > org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781) > at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215) > at > org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760) > at > org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732) > at
[jira] [Assigned] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale
[ https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26155: Assignee: (was: Apache Spark) > Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS > in 3TB scale > -- > > Key: SPARK-26155 > URL: https://issues.apache.org/jira/browse/SPARK-26155 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis > in Spark2.3 without L486&487.pdf, q19.sql > > > In our test environment, we found a serious performance degradation issue in > Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious > performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark > 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated > this problem and figured out the root cause is in community patch SPARK-21052 > which add metrics to hash join process. And the impact code is > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > . Q19 costs about 30 seconds without these two lines code and 126 seconds > with these code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale
[ https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26155: Assignee: Apache Spark > Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS > in 3TB scale > -- > > Key: SPARK-26155 > URL: https://issues.apache.org/jira/browse/SPARK-26155 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Assignee: Apache Spark >Priority: Major > Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis > in Spark2.3 without L486&487.pdf, q19.sql > > > In our test environment, we found a serious performance degradation issue in > Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious > performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark > 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated > this problem and figured out the root cause is in community patch SPARK-21052 > which add metrics to hash join process. And the impact code is > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > . Q19 costs about 30 seconds without these two lines code and 126 seconds > with these code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26051) Can't create table with column name '22222d'
[ https://issues.apache.org/jira/browse/SPARK-26051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708233#comment-16708233 ] Dilip Biswal commented on SPARK-26051: -- [~xiejuntao1...@163.com] Hello, i took a quick look at this. `2d` is parsed as a DOUBLE_LITERAL. Thats the reason its not allowed as a column name. Can you check other systems ? I checked hive and db2 and both of these systems do not allow numeric literals as column names. {quote} db2 => create table t1(2d int) DB21034E The command was processed as an SQL statement because it was not a valid Command Line Processor command. During SQL processing it returned: SQL0103N The numeric literal "2d" is not valid. SQLSTATE=42604 {quote} > Can't create table with column name '2d' > > > Key: SPARK-26051 > URL: https://issues.apache.org/jira/browse/SPARK-26051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xie Juntao >Priority: Minor > > I can't create table in which the column name is '2d' when I use > spark-sql. It seems a SQL parser bug because it's ok for creating table with > the column name ''2m". > {code:java} > spark-sql> create table t1(2d int); > Error in query: > no viable alternative at input 'create table t1(2d'(line 1, pos 16) > == SQL == > create table t1(2d int) > ^^^ > spark-sql> create table t1(2m int); > 18/11/14 09:13:53 INFO HiveMetaStore: 0: get_database: global_temp > 18/11/14 09:13:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: > global_temp > 18/11/14 09:13:53 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_database: default > 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: > default > 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_database: default > 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: > default > 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 > 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : > db=default tbl=t1 > 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_database: default > 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: > default > 18/11/14 09:13:55 INFO HiveMetaStore: 0: create_table: Table(tableName:t1, > dbName:default, owner:root, createTime:1542158033, lastAccessTime:0, > retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:2m, type:int, > comment:null)], > location:file:/opt/UQuery/spark_/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t1, > inputFormat:org.apache.hadoop.mapred.TextInputFormat, > outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, > compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, > serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > parameters:{serialization.format=1}), bucketCols:[], sortCols:[], > parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], > skewedColValueLocationMaps:{})), partitionKeys:[], > parameters:{spark.sql.sources.schema.part.0={"type":"struct","fields":[{"name":"2m","type":"integer","nullable":true,"metadata":{}}]}, > spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.3.1}, > viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, > privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, > rolePrivileges:null)) > 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=create_table: > Table(tableName:t1, dbName:default, owner:root, createTime:1542158033, > lastAccessTime:0, retention:0, > sd:StorageDescriptor(cols:[FieldSchema(name:2m, type:int, comment:null)], > location:file:/opt/UQuery/spark_/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t1, > inputFormat:org.apache.hadoop.mapred.TextInputFormat, > outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, > compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, > serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > parameters:{serialization.format=1}), bucketCols:[], sortCols:[], > parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], > skewedColValueLocationMaps:{})), partitionKeys:[], > parameters:{spark.sql.sources.schema.part.0={"type":"struct","fields":[{"name":"2m","type":"integer","nullable":true,"metadata":{}}]}, > spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.3.1}, > viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, > privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, > rolePrivileges:null)) > 18/11/14 09:13:55 WARN HiveMetaStore: Location: >
[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale
[ https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708225#comment-16708225 ] Wenchen Fan commented on SPARK-26155: - great to hear that! looking forward to your patch :) > Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS > in 3TB scale > -- > > Key: SPARK-26155 > URL: https://issues.apache.org/jira/browse/SPARK-26155 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis > in Spark2.3 without L486&487.pdf, q19.sql > > > In our test environment, we found a serious performance degradation issue in > Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious > performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark > 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated > this problem and figured out the root cause is in community patch SPARK-21052 > which add metrics to hash join process. And the impact code is > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > . Q19 costs about 30 seconds without these two lines code and 126 seconds > with these code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale
[ https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708223#comment-16708223 ] Ke Jia commented on SPARK-26155: [~cloud_fan] [~viirya] Spark2.3 with the optimized patch can have the same performance with spark2.1. ||spark2.1||spark2.3 with patch|| |49s|47s| > Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS > in 3TB scale > -- > > Key: SPARK-26155 > URL: https://issues.apache.org/jira/browse/SPARK-26155 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis > in Spark2.3 without L486&487.pdf, q19.sql > > > In our test environment, we found a serious performance degradation issue in > Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious > performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark > 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated > this problem and figured out the root cause is in community patch SPARK-21052 > which add metrics to hash join process. And the impact code is > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > . Q19 costs about 30 seconds without these two lines code and 126 seconds > with these code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26244) Do not use case class as public API
[ https://issues.apache.org/jira/browse/SPARK-26244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708220#comment-16708220 ] Wenchen Fan commented on SPARK-26244: - I took a quick look, and seems most of the public case class APIs are OK, like `SparkListenerEvent` implementations. Feel free to add more subtasks if you found something to fix. cc [~srowen] [~hyukjin.kwon] > Do not use case class as public API > --- > > Key: SPARK-26244 > URL: https://issues.apache.org/jira/browse/SPARK-26244 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > Labels: release-notes > > It's a bad idea to use case class as public API, as it has a very wide > surface. For example, the copy method, its fields, the companion object, etc. > I don't think it's expect to expose so many stuff to end users, and usually > we only want to expose a few methods. > We should use a pure trait as public API, and use case class as an > implementation, which should be private and hide from end users. > Changing class to interface is not binary compatible(but source compatible), > so 3.0 is a good chance to do it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale
[ https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708219#comment-16708219 ] Yang Jie commented on SPARK-26155: -- [~cloud_fan] [~viirya] maybe we needn't revert this patch, offline discussion and testing with [~Jk_Self], we will give a patch to optimize the performance. > Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS > in 3TB scale > -- > > Key: SPARK-26155 > URL: https://issues.apache.org/jira/browse/SPARK-26155 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis > in Spark2.3 without L486&487.pdf, q19.sql > > > In our test environment, we found a serious performance degradation issue in > Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious > performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark > 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated > this problem and figured out the root cause is in community patch SPARK-21052 > which add metrics to hash join process. And the impact code is > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > . Q19 costs about 30 seconds without these two lines code and 126 seconds > with these code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema
[ https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708218#comment-16708218 ] Hyukjin Kwon commented on SPARK-26259: -- Can you try the examples against the current master branch of Spark and describe expected input and output if there's an issue? > RecordSeparator other than newline discovers incorrect schema > - > > Key: SPARK-26259 > URL: https://issues.apache.org/jira/browse/SPARK-26259 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: PoojaMurarka >Priority: Major > > Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed > in SPARK 2.3 which allows record Separators other than new line but this > doesn't work when schema is not specified i.e. while inferring the schema > Let me try to explain this using below data and scenarios: > Input Data - (input_data.csv) as shown below: *+where recordSeparator is > "\t"+* > {noformat} > "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed" > "2012-01-01","0","0","0","0","1","9","9.1","66","0" > "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat} > *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data > correctly: > {code:java} > val customSchema = StructType(Array( > StructField("dteday", DateType, true), > StructField("hr", IntegerType, true), > StructField("holiday", IntegerType, true), > StructField("weekday", IntegerType, true), > StructField("workingday", DateType, true), > StructField("weathersit", IntegerType, true), > StructField("temp", IntegerType, true), > StructField("atemp", DoubleType, true), > StructField("hum", IntegerType, true), > StructField("windspeed", IntegerType, true))); > Dataset ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "schema", customSchema) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} > *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is > done i.e. entire data is read as column names. > {code:java} > Dataset ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "inferSchema", true) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26262) Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false
[ https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26262: Assignee: Apache Spark > Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false > --- > > Key: SPARK-26262 > URL: https://issues.apache.org/jira/browse/SPARK-26262 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark >Priority: Minor > > For better test coverage, we need to set `false` at > `WHOLESTAGE_CODEGEN_ENABLED` for interpreter execution tests when running > `SQLQueryTestSuite`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26262) Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false
[ https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708184#comment-16708184 ] Apache Spark commented on SPARK-26262: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/23213 > Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false > --- > > Key: SPARK-26262 > URL: https://issues.apache.org/jira/browse/SPARK-26262 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > For better test coverage, we need to set `false` at > `WHOLESTAGE_CODEGEN_ENABLED` for interpreter execution tests when running > `SQLQueryTestSuite`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26262) Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false
[ https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708185#comment-16708185 ] Apache Spark commented on SPARK-26262: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/23213 > Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false > --- > > Key: SPARK-26262 > URL: https://issues.apache.org/jira/browse/SPARK-26262 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > For better test coverage, we need to set `false` at > `WHOLESTAGE_CODEGEN_ENABLED` for interpreter execution tests when running > `SQLQueryTestSuite`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26262) Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false
[ https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26262: Assignee: (was: Apache Spark) > Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false > --- > > Key: SPARK-26262 > URL: https://issues.apache.org/jira/browse/SPARK-26262 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > For better test coverage, we need to set `false` at > `WHOLESTAGE_CODEGEN_ENABLED` for interpreter execution tests when running > `SQLQueryTestSuite`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26262) Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false
Takeshi Yamamuro created SPARK-26262: Summary: Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false Key: SPARK-26262 URL: https://issues.apache.org/jira/browse/SPARK-26262 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Takeshi Yamamuro For better test coverage, we need to set `false` at `WHOLESTAGE_CODEGEN_ENABLED` for interpreter execution tests when running `SQLQueryTestSuite`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema
[ https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708161#comment-16708161 ] PoojaMurarka edited comment on SPARK-26259 at 12/4/18 4:14 AM: --- The fix for using custom record delimiters seems to be only available when schema is specified based on the examples. Please correct me if I am wrong. Rather I am looking for setting custom record delimiter while discovery schema i.e. only use *inferschema* as true rather than specifying schema. Let me know if above issue covers both scenarios. was (Author: pooja.murarka): The fix for using custom record delimiters seems to be only available when schema is specified based on the examples. Please correct me if I am wrong. Rather I am looking for setting custom record delimiter while discovery schema i.e. only use inferschema as true rather than specifying schema. Let me know if above issue covers both scenarios. > RecordSeparator other than newline discovers incorrect schema > - > > Key: SPARK-26259 > URL: https://issues.apache.org/jira/browse/SPARK-26259 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: PoojaMurarka >Priority: Major > > Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed > in SPARK 2.3 which allows record Separators other than new line but this > doesn't work when schema is not specified i.e. while inferring the schema > Let me try to explain this using below data and scenarios: > Input Data - (input_data.csv) as shown below: *+where recordSeparator is > "\t"+* > {noformat} > "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed" > "2012-01-01","0","0","0","0","1","9","9.1","66","0" > "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat} > *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data > correctly: > {code:java} > val customSchema = StructType(Array( > StructField("dteday", DateType, true), > StructField("hr", IntegerType, true), > StructField("holiday", IntegerType, true), > StructField("weekday", IntegerType, true), > StructField("workingday", DateType, true), > StructField("weathersit", IntegerType, true), > StructField("temp", IntegerType, true), > StructField("atemp", DoubleType, true), > StructField("hum", IntegerType, true), > StructField("windspeed", IntegerType, true))); > Dataset ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "schema", customSchema) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} > *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is > done i.e. entire data is read as column names. > {code:java} > Dataset ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "inferSchema", true) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema
[ https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708161#comment-16708161 ] PoojaMurarka commented on SPARK-26259: -- The fix for using custom record delimiters seems to be only available when schema is specified based on the examples. Please correct me if I am wrong. Rather I am looking for setting custom record delimiter while discovery schema i.e. only use inferschema as true rather than specifying schema. Let me know if above issue covers both scenarios. > RecordSeparator other than newline discovers incorrect schema > - > > Key: SPARK-26259 > URL: https://issues.apache.org/jira/browse/SPARK-26259 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: PoojaMurarka >Priority: Major > > Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed > in SPARK 2.3 which allows record Separators other than new line but this > doesn't work when schema is not specified i.e. while inferring the schema > Let me try to explain this using below data and scenarios: > Input Data - (input_data.csv) as shown below: *+where recordSeparator is > "\t"+* > {noformat} > "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed" > "2012-01-01","0","0","0","0","1","9","9.1","66","0" > "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat} > *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data > correctly: > {code:java} > val customSchema = StructType(Array( > StructField("dteday", DateType, true), > StructField("hr", IntegerType, true), > StructField("holiday", IntegerType, true), > StructField("weekday", IntegerType, true), > StructField("workingday", DateType, true), > StructField("weathersit", IntegerType, true), > StructField("temp", IntegerType, true), > StructField("atemp", DoubleType, true), > StructField("hum", IntegerType, true), > StructField("windspeed", IntegerType, true))); > Dataset ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "schema", customSchema) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} > *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is > done i.e. entire data is read as column names. > {code:java} > Dataset ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "inferSchema", true) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26261) Spark does not check completeness temporary file
[ https://issues.apache.org/jira/browse/SPARK-26261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708153#comment-16708153 ] Hyukjin Kwon commented on SPARK-26261: -- Mind if I ask the initial test you ran? > Spark does not check completeness temporary file > - > > Key: SPARK-26261 > URL: https://issues.apache.org/jira/browse/SPARK-26261 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Jialin LIu >Priority: Minor > > Spark does not check temporary files' completeness. When persisting to disk > is enabled on some RDDs, a bunch of temporary files will be created on > blockmgr folder. Block manager is able to detect missing blocks while it is > not able detect file content being modified during execution. > Our initial test shows that if we truncate the block file before being used > by executors, the program will finish without detecting any error, but the > result content is totally wrong. > We believe there should be a file checksum on every RDD file block and these > files should be protected by checksum. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26251) isnan function not picking non-numeric values
[ https://issues.apache.org/jira/browse/SPARK-26251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708129#comment-16708129 ] Hyukjin Kwon commented on SPARK-26251: -- Why does it should pick "po box 7896"? > isnan function not picking non-numeric values > - > > Key: SPARK-26251 > URL: https://issues.apache.org/jira/browse/SPARK-26251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kunal Rao >Priority: Minor > > import org.apache.spark.sql.functions._ > List("po box 7896", "8907", > "435435").toDF("rgid").filter(isnan(col("rgid"))).show > > should pick "po box 7896" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26250) Fail to run dataframe.R examples
[ https://issues.apache.org/jira/browse/SPARK-26250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708125#comment-16708125 ] Hyukjin Kwon commented on SPARK-26250: -- Please avoid to set a target version which is usually reserved for committers. > Fail to run dataframe.R examples > > > Key: SPARK-26250 > URL: https://issues.apache.org/jira/browse/SPARK-26250 > Project: Spark > Issue Type: Test > Components: Examples >Affects Versions: 2.4.0 >Reporter: Jean Pierre PIN >Priority: Major > > I get an error=2 running spark-submit examples/src/main/r/dataframe.R > the script is working with Rstudio but i've changed the library(SparkR) line > with this one > library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) > i am at the top root directory of spark installation and the path variable > for /bin is specified in the environment so spark-submit is found. On system > window 7 Ultimate 64bits > read "main" java.io.IOException: Cannot run program "Rscript": CreateProcess > error=2, The system cannot find the file specified > I think the issue is known for a long but i don't find any post. > Thanks for answer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema
[ https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26259. -- Resolution: Duplicate > RecordSeparator other than newline discovers incorrect schema > - > > Key: SPARK-26259 > URL: https://issues.apache.org/jira/browse/SPARK-26259 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: PoojaMurarka >Priority: Major > > Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed > in SPARK 2.3 which allows record Separators other than new line but this > doesn't work when schema is not specified i.e. while inferring the schema > Let me try to explain this using below data and scenarios: > Input Data - (input_data.csv) as shown below: *+where recordSeparator is > "\t"+* > {noformat} > "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed" > "2012-01-01","0","0","0","0","1","9","9.1","66","0" > "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat} > *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data > correctly: > {code:java} > val customSchema = StructType(Array( > StructField("dteday", DateType, true), > StructField("hr", IntegerType, true), > StructField("holiday", IntegerType, true), > StructField("weekday", IntegerType, true), > StructField("workingday", DateType, true), > StructField("weathersit", IntegerType, true), > StructField("temp", IntegerType, true), > StructField("atemp", DoubleType, true), > StructField("hum", IntegerType, true), > StructField("windspeed", IntegerType, true))); > Dataset ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "schema", customSchema) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} > *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is > done i.e. entire data is read as column names. > {code:java} > Dataset ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "inferSchema", true) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema
[ https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26259: - Component/s: (was: Spark Core) SQL > RecordSeparator other than newline discovers incorrect schema > - > > Key: SPARK-26259 > URL: https://issues.apache.org/jira/browse/SPARK-26259 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: PoojaMurarka >Priority: Major > > Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed > in SPARK 2.3 which allows record Separators other than new line but this > doesn't work when schema is not specified i.e. while inferring the schema > Let me try to explain this using below data and scenarios: > Input Data - (input_data.csv) as shown below: *+where recordSeparator is > "\t"+* > {noformat} > "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed" > "2012-01-01","0","0","0","0","1","9","9.1","66","0" > "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat} > *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data > correctly: > {code:java} > val customSchema = StructType(Array( > StructField("dteday", DateType, true), > StructField("hr", IntegerType, true), > StructField("holiday", IntegerType, true), > StructField("weekday", IntegerType, true), > StructField("workingday", DateType, true), > StructField("weathersit", IntegerType, true), > StructField("temp", IntegerType, true), > StructField("atemp", DoubleType, true), > StructField("hum", IntegerType, true), > StructField("windspeed", IntegerType, true))); > Dataset ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "schema", customSchema) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} > *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is > done i.e. entire data is read as column names. > {code:java} > Dataset ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "inferSchema", true) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema
[ https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708136#comment-16708136 ] Hyukjin Kwon commented on SPARK-26259: -- This is fixed in SPARK-26108 which should be availabe from upcoming Spark version. > RecordSeparator other than newline discovers incorrect schema > - > > Key: SPARK-26259 > URL: https://issues.apache.org/jira/browse/SPARK-26259 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: PoojaMurarka >Priority: Major > > Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed > in SPARK 2.3 which allows record Separators other than new line but this > doesn't work when schema is not specified i.e. while inferring the schema > Let me try to explain this using below data and scenarios: > Input Data - (input_data.csv) as shown below: *+where recordSeparator is > "\t"+* > {noformat} > "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed" > "2012-01-01","0","0","0","0","1","9","9.1","66","0" > "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat} > *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data > correctly: > {code:java} > val customSchema = StructType(Array( > StructField("dteday", DateType, true), > StructField("hr", IntegerType, true), > StructField("holiday", IntegerType, true), > StructField("weekday", IntegerType, true), > StructField("workingday", DateType, true), > StructField("weathersit", IntegerType, true), > StructField("temp", IntegerType, true), > StructField("atemp", DoubleType, true), > StructField("hum", IntegerType, true), > StructField("windspeed", IntegerType, true))); > Dataset ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "schema", customSchema) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} > *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is > done i.e. entire data is read as column names. > {code:java} > Dataset ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "inferSchema", true) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21289) Text based formats do not support custom end-of-line delimiters
[ https://issues.apache.org/jira/browse/SPARK-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-21289: - Priority: Major (was: Minor) > Text based formats do not support custom end-of-line delimiters > --- > > Key: SPARK-21289 > URL: https://issues.apache.org/jira/browse/SPARK-21289 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.1.1, 2.3.0 >Reporter: Yevgen Galchenko >Priority: Major > > Spark csv and text readers always use default CR, LF or CRLF line terminators > without an option to configure a custom delimiter. > Option "textinputformat.record.delimiter" is not being used to set delimiter > in HadoopFileLinesReader and can only be set for Hadoop RDD when textFile() > is used to read file. > Possible solution would be to change HadoopFileLinesReader and create > LineRecordReader with delimiters specified in configuration. LineRecordReader > already supports passing recordDelimiter in its constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema
[ https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708134#comment-16708134 ] Hyukjin Kwon commented on SPARK-26259: -- please avoid to set the fixed version which is usually set after acutally it's fixed. > RecordSeparator other than newline discovers incorrect schema > - > > Key: SPARK-26259 > URL: https://issues.apache.org/jira/browse/SPARK-26259 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: PoojaMurarka >Priority: Major > > Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed > in SPARK 2.3 which allows record Separators other than new line but this > doesn't work when schema is not specified i.e. while inferring the schema > Let me try to explain this using below data and scenarios: > Input Data - (input_data.csv) as shown below: *+where recordSeparator is > "\t"+* > {noformat} > "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed" > "2012-01-01","0","0","0","0","1","9","9.1","66","0" > "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat} > *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data > correctly: > {code:java} > val customSchema = StructType(Array( > StructField("dteday", DateType, true), > StructField("hr", IntegerType, true), > StructField("holiday", IntegerType, true), > StructField("weekday", IntegerType, true), > StructField("workingday", DateType, true), > StructField("weathersit", IntegerType, true), > StructField("temp", IntegerType, true), > StructField("atemp", DoubleType, true), > StructField("hum", IntegerType, true), > StructField("windspeed", IntegerType, true))); > Dataset ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "schema", customSchema) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} > *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is > done i.e. entire data is read as column names. > {code:java} > Dataset ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "inferSchema", true) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema
[ https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26259: - Fix Version/s: (was: 2.4.1) > RecordSeparator other than newline discovers incorrect schema > - > > Key: SPARK-26259 > URL: https://issues.apache.org/jira/browse/SPARK-26259 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: PoojaMurarka >Priority: Major > > Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed > in SPARK 2.3 which allows record Separators other than new line but this > doesn't work when schema is not specified i.e. while inferring the schema > Let me try to explain this using below data and scenarios: > Input Data - (input_data.csv) as shown below: *+where recordSeparator is > "\t"+* > {noformat} > "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed" > "2012-01-01","0","0","0","0","1","9","9.1","66","0" > "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat} > *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data > correctly: > {code:java} > val customSchema = StructType(Array( > StructField("dteday", DateType, true), > StructField("hr", IntegerType, true), > StructField("holiday", IntegerType, true), > StructField("weekday", IntegerType, true), > StructField("workingday", DateType, true), > StructField("weathersit", IntegerType, true), > StructField("temp", IntegerType, true), > StructField("atemp", DoubleType, true), > StructField("hum", IntegerType, true), > StructField("windspeed", IntegerType, true))); > Dataset ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "schema", customSchema) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} > *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is > done i.e. entire data is read as column names. > {code:java} > Dataset ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "inferSchema", true) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26255) Custom error/exception is not thrown for the SQL tab when UI filters are added in spark-sql launch
[ https://issues.apache.org/jira/browse/SPARK-26255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708132#comment-16708132 ] Hyukjin Kwon commented on SPARK-26255: -- it would be great if some snapshots from UI is uploaded here. > Custom error/exception is not thrown for the SQL tab when UI filters are > added in spark-sql launch > -- > > Key: SPARK-26255 > URL: https://issues.apache.org/jira/browse/SPARK-26255 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.3.2 > Environment: 【Test Environment】: > Server OS :-SUSE > No. of Cluster Node:-3 > Spark Version:- 2.3.2 > Hadoop Version:-3.1 >Reporter: Sushanta Sen >Priority: Major > > 【Detailed description】:Custom error is not thrown for the SQL tab when UI > filters are added in spark-sql launch > 【Precondition】: > 1.Cluster is up and running【Test step】: > 1. Launch spark sql as below: > [spark-sql --master yarn --conf > spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter > --conf > spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=simple"] > 2. Go to Yarn application list UI link > 3. Launch the application master for the Spark-SQL app ID > 4. It will display an error > 5. Append /executors, /stages, /jobs, /environment, /SQL > 【Expect Output】:An error should be displayed "An error has occurred. Please > check for all the TABS > 【Actual Output】:The error message is displayed for all the tabs except SQL > tab . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26255) Custom error/exception is not thrown for the SQL tab when UI filters are added in spark-sql launch
[ https://issues.apache.org/jira/browse/SPARK-26255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26255: - Component/s: (was: Spark Core) Web UI SQL > Custom error/exception is not thrown for the SQL tab when UI filters are > added in spark-sql launch > -- > > Key: SPARK-26255 > URL: https://issues.apache.org/jira/browse/SPARK-26255 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.3.2 > Environment: 【Test Environment】: > Server OS :-SUSE > No. of Cluster Node:-3 > Spark Version:- 2.3.2 > Hadoop Version:-3.1 >Reporter: Sushanta Sen >Priority: Major > > 【Detailed description】:Custom error is not thrown for the SQL tab when UI > filters are added in spark-sql launch > 【Precondition】: > 1.Cluster is up and running【Test step】: > 1. Launch spark sql as below: > [spark-sql --master yarn --conf > spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter > --conf > spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=simple"] > 2. Go to Yarn application list UI link > 3. Launch the application master for the Spark-SQL app ID > 4. It will display an error > 5. Append /executors, /stages, /jobs, /environment, /SQL > 【Expect Output】:An error should be displayed "An error has occurred. Please > check for all the TABS > 【Actual Output】:The error message is displayed for all the tabs except SQL > tab . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26149) Read UTF8String from Parquet/ORC may be incorrect
[ https://issues.apache.org/jira/browse/SPARK-26149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708119#comment-16708119 ] Yuming Wang commented on SPARK-26149: - This is not a Spark bug, but a Hive bug. !image-2018-12-04-10-55-49-369.png! > Read UTF8String from Parquet/ORC may be incorrect > - > > Key: SPARK-26149 > URL: https://issues.apache.org/jira/browse/SPARK-26149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Yuming Wang >Priority: Major > Attachments: SPARK-26149.snappy.parquet, > image-2018-12-04-10-55-49-369.png > > > How to reproduce: > {code:bash} > scala> > spark.read.parquet("/Users/yumwang/SPARK-26149/SPARK-26149.snappy.parquet").selectExpr("s1 > = s2").show > +-+ > |(s1 = s2)| > +-+ > |false| > +-+ > scala> val first = > spark.read.parquet("/Users/yumwang/SPARK-26149/SPARK-26149.snappy.parquet").collect().head > first: org.apache.spark.sql.Row = > [a0750c1f13f0k5��F8j���b�Ro'4da96,a0750c1f13f0k5��F8j���b�Ro'4da96] > scala> println(first.getString(0).equals(first.getString(1))) > true > {code} > {code:sql} > hive> CREATE TABLE `tb1` (`s1` STRING, `s2` STRING) > > stored as parquet > > location "/Users/yumwang/SPARK-26149"; > OK > Time taken: 0.224 seconds > hive> select s1 = s2 from tb1; > OK > true > Time taken: 0.167 seconds, Fetched: 1 row(s) > {code} > As you can see, only UTF8String returns {{false}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-26251) isnan function not picking non-numeric values
[ https://issues.apache.org/jira/browse/SPARK-26251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26251: - Comment: was deleted (was: Why does it should pick "po box 7896"?) > isnan function not picking non-numeric values > - > > Key: SPARK-26251 > URL: https://issues.apache.org/jira/browse/SPARK-26251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kunal Rao >Priority: Minor > > import org.apache.spark.sql.functions._ > List("po box 7896", "8907", > "435435").toDF("rgid").filter(isnan(col("rgid"))).show > > should pick "po box 7896" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26251) isnan function not picking non-numeric values
[ https://issues.apache.org/jira/browse/SPARK-26251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26251: - Component/s: (was: Spark Core) SQL > isnan function not picking non-numeric values > - > > Key: SPARK-26251 > URL: https://issues.apache.org/jira/browse/SPARK-26251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kunal Rao >Priority: Minor > > import org.apache.spark.sql.functions._ > List("po box 7896", "8907", > "435435").toDF("rgid").filter(isnan(col("rgid"))).show > > should pick "po box 7896" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26251) isnan function not picking non-numeric values
[ https://issues.apache.org/jira/browse/SPARK-26251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708128#comment-16708128 ] Hyukjin Kwon commented on SPARK-26251: -- Why does it should pick "po box 7896"? > isnan function not picking non-numeric values > - > > Key: SPARK-26251 > URL: https://issues.apache.org/jira/browse/SPARK-26251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kunal Rao >Priority: Minor > > import org.apache.spark.sql.functions._ > List("po box 7896", "8907", > "435435").toDF("rgid").filter(isnan(col("rgid"))).show > > should pick "po box 7896" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26149) Read UTF8String from Parquet/ORC may be incorrect
[ https://issues.apache.org/jira/browse/SPARK-26149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-26149: Attachment: image-2018-12-04-10-55-49-369.png > Read UTF8String from Parquet/ORC may be incorrect > - > > Key: SPARK-26149 > URL: https://issues.apache.org/jira/browse/SPARK-26149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Yuming Wang >Priority: Major > Attachments: SPARK-26149.snappy.parquet, > image-2018-12-04-10-55-49-369.png > > > How to reproduce: > {code:bash} > scala> > spark.read.parquet("/Users/yumwang/SPARK-26149/SPARK-26149.snappy.parquet").selectExpr("s1 > = s2").show > +-+ > |(s1 = s2)| > +-+ > |false| > +-+ > scala> val first = > spark.read.parquet("/Users/yumwang/SPARK-26149/SPARK-26149.snappy.parquet").collect().head > first: org.apache.spark.sql.Row = > [a0750c1f13f0k5��F8j���b�Ro'4da96,a0750c1f13f0k5��F8j���b�Ro'4da96] > scala> println(first.getString(0).equals(first.getString(1))) > true > {code} > {code:sql} > hive> CREATE TABLE `tb1` (`s1` STRING, `s2` STRING) > > stored as parquet > > location "/Users/yumwang/SPARK-26149"; > OK > Time taken: 0.224 seconds > hive> select s1 = s2 from tb1; > OK > true > Time taken: 0.167 seconds, Fetched: 1 row(s) > {code} > As you can see, only UTF8String returns {{false}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26250) Fail to run dataframe.R examples
[ https://issues.apache.org/jira/browse/SPARK-26250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708126#comment-16708126 ] Hyukjin Kwon commented on SPARK-26250: -- The error looks literally {{Rscript}} is not installed in your computer. > Fail to run dataframe.R examples > > > Key: SPARK-26250 > URL: https://issues.apache.org/jira/browse/SPARK-26250 > Project: Spark > Issue Type: Test > Components: Examples >Affects Versions: 2.4.0 >Reporter: Jean Pierre PIN >Priority: Major > > I get an error=2 running spark-submit examples/src/main/r/dataframe.R > the script is working with Rstudio but i've changed the library(SparkR) line > with this one > library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) > i am at the top root directory of spark installation and the path variable > for /bin is specified in the environment so spark-submit is found. On system > window 7 Ultimate 64bits > read "main" java.io.IOException: Cannot run program "Rscript": CreateProcess > error=2, The system cannot find the file specified > I think the issue is known for a long but i don't find any post. > Thanks for answer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26250) Fail to run dataframe.R examples
[ https://issues.apache.org/jira/browse/SPARK-26250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26250. -- Resolution: Invalid > Fail to run dataframe.R examples > > > Key: SPARK-26250 > URL: https://issues.apache.org/jira/browse/SPARK-26250 > Project: Spark > Issue Type: Test > Components: Examples >Affects Versions: 2.4.0 >Reporter: Jean Pierre PIN >Priority: Major > > I get an error=2 running spark-submit examples/src/main/r/dataframe.R > the script is working with Rstudio but i've changed the library(SparkR) line > with this one > library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) > i am at the top root directory of spark installation and the path variable > for /bin is specified in the environment so spark-submit is found. On system > window 7 Ultimate 64bits > read "main" java.io.IOException: Cannot run program "Rscript": CreateProcess > error=2, The system cannot find the file specified > I think the issue is known for a long but i don't find any post. > Thanks for answer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26250) Fail to run dataframe.R examples
[ https://issues.apache.org/jira/browse/SPARK-26250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26250: - Target Version/s: (was: 2.4.0) > Fail to run dataframe.R examples > > > Key: SPARK-26250 > URL: https://issues.apache.org/jira/browse/SPARK-26250 > Project: Spark > Issue Type: Test > Components: Examples >Affects Versions: 2.4.0 >Reporter: Jean Pierre PIN >Priority: Major > > I get an error=2 running spark-submit examples/src/main/r/dataframe.R > the script is working with Rstudio but i've changed the library(SparkR) line > with this one > library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) > i am at the top root directory of spark installation and the path variable > for /bin is specified in the environment so spark-submit is found. On system > window 7 Ultimate 64bits > read "main" java.io.IOException: Cannot run program "Rscript": CreateProcess > error=2, The system cannot find the file specified > I think the issue is known for a long but i don't find any post. > Thanks for answer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26261) Spark does not check completeness temporary file
Jialin LIu created SPARK-26261: -- Summary: Spark does not check completeness temporary file Key: SPARK-26261 URL: https://issues.apache.org/jira/browse/SPARK-26261 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.2 Reporter: Jialin LIu Spark does not check temporary files' completeness. When persisting to disk is enabled on some RDDs, a bunch of temporary files will be created on blockmgr folder. Block manager is able to detect missing blocks while it is not able detect file content being modified during execution. Our initial test shows that if we truncate the block file before being used by executors, the program will finish without detecting any error, but the result content is totally wrong. We believe there should be a file checksum on every RDD file block and these files should be protected by checksum. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26149) Read UTF8String from Parquet/ORC may be incorrect
[ https://issues.apache.org/jira/browse/SPARK-26149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-26149. - Resolution: Not A Problem > Read UTF8String from Parquet/ORC may be incorrect > - > > Key: SPARK-26149 > URL: https://issues.apache.org/jira/browse/SPARK-26149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Yuming Wang >Priority: Major > Attachments: SPARK-26149.snappy.parquet, > image-2018-12-04-10-55-49-369.png > > > How to reproduce: > {code:bash} > scala> > spark.read.parquet("/Users/yumwang/SPARK-26149/SPARK-26149.snappy.parquet").selectExpr("s1 > = s2").show > +-+ > |(s1 = s2)| > +-+ > |false| > +-+ > scala> val first = > spark.read.parquet("/Users/yumwang/SPARK-26149/SPARK-26149.snappy.parquet").collect().head > first: org.apache.spark.sql.Row = > [a0750c1f13f0k5��F8j���b�Ro'4da96,a0750c1f13f0k5��F8j���b�Ro'4da96] > scala> println(first.getString(0).equals(first.getString(1))) > true > {code} > {code:sql} > hive> CREATE TABLE `tb1` (`s1` STRING, `s2` STRING) > > stored as parquet > > location "/Users/yumwang/SPARK-26149"; > OK > Time taken: 0.224 seconds > hive> select s1 = s2 from tb1; > OK > true > Time taken: 0.167 seconds, Fetched: 1 row(s) > {code} > As you can see, only UTF8String returns {{false}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26260) Summary Task Metrics for Stage Page: Efficient implimentation for SHS when using disk store.
[ https://issues.apache.org/jira/browse/SPARK-26260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid updated SPARK-26260: --- Description: Currently, tasks summary metrics is calculated based on all the tasks, instead of successful tasks. After the JIRA, https://issues.apache.org/jira/browse/SPARK-26119, when using InMemory store, it find task summary metrics for all the successful tasks metrics. But we need to find an efficient implementation for disk store case for SHS. The main bottle neck for disk store is deserialization time overhead. Hints: Need to rework on the way indexing works, so that we can index by specific metrics for successful and failed tasks differently (would be tricky). Also would require changing the disk store version (to invalidate old stores). OR any other efficient solutions. was: Currently, tasks summary metrics is calculated based on all the tasks, instead of successful tasks. After the JIRA, https://issues.apache.org/jira/browse/SPARK-26119, when using InMemory store, it find task summary metrics for all the successful tasks metrics. But we need to find an efficient implementation for disk store case for SHS. The main bottle neck for disk store is deserialization time overhead. Hints: Need to rework on the way indexing works, so that we can index by specific metrics for successful and failed tasks differently (would be tricky). Also would require changing the disk store version (to invalidate old stores). > Summary Task Metrics for Stage Page: Efficient implimentation for SHS when > using disk store. > > > Key: SPARK-26260 > URL: https://issues.apache.org/jira/browse/SPARK-26260 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: shahid >Priority: Major > > Currently, tasks summary metrics is calculated based on all the tasks, > instead of successful tasks. > After the JIRA, https://issues.apache.org/jira/browse/SPARK-26119, when using > InMemory store, it find task summary metrics for all the successful tasks > metrics. But we need to find an efficient implementation for disk store case > for SHS. The main bottle neck for disk store is deserialization time overhead. > Hints: Need to rework on the way indexing works, so that we can index by > specific metrics for successful and failed tasks differently (would be > tricky). Also would require changing the disk store version (to invalidate > old stores). > OR any other efficient solutions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25498) Fix SQLQueryTestSuite failures when the interpreter mode enabled
[ https://issues.apache.org/jira/browse/SPARK-25498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708027#comment-16708027 ] Apache Spark commented on SPARK-25498: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/23212 > Fix SQLQueryTestSuite failures when the interpreter mode enabled > > > Key: SPARK-25498 > URL: https://issues.apache.org/jira/browse/SPARK-25498 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26260) Summary Task Metrics for Stage Page: Efficient implimentation for SHS when using disk store.
shahid created SPARK-26260: -- Summary: Summary Task Metrics for Stage Page: Efficient implimentation for SHS when using disk store. Key: SPARK-26260 URL: https://issues.apache.org/jira/browse/SPARK-26260 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0, 3.0.0 Reporter: shahid Currently, tasks summary metrics is calculated based on all the tasks, instead of successful tasks. After the JIRA, https://issues.apache.org/jira/browse/SPARK-26119, when using InMemory store, it find task summary metrics for all the successful tasks metrics. But we need to find an efficient implementation for disk store case for SHS. The main bottle neck for disk store is deserialization time overhead. Hints: Need to rework on the way indexing works, so that we can index by specific metrics for successful and failed tasks differently (would be tricky). Also would require changing the disk store version (to invalidate old stores). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26083) Pyspark command is not working properly with default Docker Image build
[ https://issues.apache.org/jira/browse/SPARK-26083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-26083: -- Assignee: Qi Shao > Pyspark command is not working properly with default Docker Image build > --- > > Key: SPARK-26083 > URL: https://issues.apache.org/jira/browse/SPARK-26083 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Qi Shao >Assignee: Qi Shao >Priority: Minor > Labels: easyfix, newbie, patch, pull-request-available > Fix For: 3.0.0 > > > When I try to run > {code:java} > ./bin/pyspark{code} > in a pod in Kubernetes(image built without change from pyspark Dockerfile), > I'm getting an error: > {code:java} > $SPARK_HOME/bin/pyspark --deploy-mode client --master > k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... > Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type > "help", "copyright", "credits" or "license" for more information. > Could not open PYTHONSTARTUP > IOError: [Errno 2] No such file or directory: > '/opt/spark/python/pyspark/shell.py'{code} > This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving
[ https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707997#comment-16707997 ] Stavros Kontopoulos edited comment on SPARK-26247 at 12/3/18 11:59 PM: --- Hi [~aholler], Yes I agree there are representation issues, on the other hand a spark specific model format locks you in. I dont know if PMML is a lost cause but at least you can get it running also on limited devices that dont run Spark. Also I understand that if the focus is on improving the internal format then makes sense. was (Author: skonto): Hi [~aholler], Yes I agree there are representation issues, on the other hand a spark specific model format locks you in. I dont know if PMML is a lost cause but at least you can get it running also on limited devices that dont run Spark. > SPIP - ML Model Extension for no-Spark MLLib Online Serving > --- > > Key: SPARK-26247 > URL: https://issues.apache.org/jira/browse/SPARK-26247 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Anne Holler >Priority: Major > Labels: SPIP > Attachments: SPIPMlModelExtensionForOnlineServing.pdf > > > This ticket tracks an SPIP to improve model load time and model serving > interfaces for online serving of Spark MLlib models. The SPIP is here > [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub] > > The improvement opportunity exists in all versions of spark. We developed > our set of changes wrt version 2.1.0 and can port them forward to other > versions (e.g., we have ported them forward to 2.3.2). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving
[ https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707997#comment-16707997 ] Stavros Kontopoulos commented on SPARK-26247: - Hi [~aholler], Yes I agree there are representation issues, on the other hand a spark specific model format locks you in. I dont know if PMML is a lost cause but at least you can get it running also on limited devices that dont run Spark. > SPIP - ML Model Extension for no-Spark MLLib Online Serving > --- > > Key: SPARK-26247 > URL: https://issues.apache.org/jira/browse/SPARK-26247 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Anne Holler >Priority: Major > Labels: SPIP > Attachments: SPIPMlModelExtensionForOnlineServing.pdf > > > This ticket tracks an SPIP to improve model load time and model serving > interfaces for online serving of Spark MLlib models. The SPIP is here > [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub] > > The improvement opportunity exists in all versions of spark. We developed > our set of changes wrt version 2.1.0 and can port them forward to other > versions (e.g., we have ported them forward to 2.3.2). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26083) Pyspark command is not working properly with default Docker Image build
[ https://issues.apache.org/jira/browse/SPARK-26083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-26083. Resolution: Fixed Fix Version/s: (was: 2.4.1) 3.0.0 Issue resolved by pull request 23037 [https://github.com/apache/spark/pull/23037] > Pyspark command is not working properly with default Docker Image build > --- > > Key: SPARK-26083 > URL: https://issues.apache.org/jira/browse/SPARK-26083 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Qi Shao >Priority: Minor > Labels: easyfix, newbie, patch, pull-request-available > Fix For: 3.0.0 > > > When I try to run > {code:java} > ./bin/pyspark{code} > in a pod in Kubernetes(image built without change from pyspark Dockerfile), > I'm getting an error: > {code:java} > $SPARK_HOME/bin/pyspark --deploy-mode client --master > k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... > Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type > "help", "copyright", "credits" or "license" for more information. > Could not open PYTHONSTARTUP > IOError: [Errno 2] No such file or directory: > '/opt/spark/python/pyspark/shell.py'{code} > This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26219) Executor summary is not getting updated for failure jobs in history server UI
[ https://issues.apache.org/jira/browse/SPARK-26219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-26219: --- Fix Version/s: 2.4.1 > Executor summary is not getting updated for failure jobs in history server UI > - > > Key: SPARK-26219 > URL: https://issues.apache.org/jira/browse/SPARK-26219 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.0 >Reporter: shahid >Assignee: shahid >Priority: Major > Fix For: 2.4.1, 3.0.0 > > Attachments: Screenshot from 2018-11-29 22-13-34.png, Screenshot from > 2018-11-29 22-13-44.png > > > Test step to reproduce: > {code:java} > bin/spark-shell --master yarn --conf spark.executor.instances=3 > sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad > executor")}.collect() > {code} > 1)Open the application from History UI > 2) Go to the executor tab > From History UI: > !Screenshot from 2018-11-29 22-13-34.png! > From Live UI: > !Screenshot from 2018-11-29 22-13-44.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26239) Add configurable auth secret source in k8s backend
[ https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707962#comment-16707962 ] Matt Cheah commented on SPARK-26239: It could work in client mode but is less useful there overall because the user has to determine how to get ahold of that secret file. Nevertheless for cluster mode users that have secret file mounting systems for the driver and executors, it would be a great start. I can start building the code for this. > Add configurable auth secret source in k8s backend > -- > > Key: SPARK-26239 > URL: https://issues.apache.org/jira/browse/SPARK-26239 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Major > > This is a follow up to SPARK-26194, which aims to add auto-generated secrets > similar to the YARN backend. > There's a desire to support different ways to generate and propagate these > auth secrets (e.g. using things like Vault). Need to investigate: > - exposing configuration to support that > - changing SecurityManager so that it can delegate some of the > secret-handling logic to custom implementations > - figuring out whether this can also be used in client-mode, where the driver > is not created by the k8s backend in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26256) Add proper labels when deleting pods
[ https://issues.apache.org/jira/browse/SPARK-26256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-26256: --- Fix Version/s: 2.4.1 > Add proper labels when deleting pods > > > Key: SPARK-26256 > URL: https://issues.apache.org/jira/browse/SPARK-26256 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 2.4.1, 3.0.0 > > > As discussed here: > [https://github.com/apache/spark/pull/23136#discussion_r236463330] > we need to add proper labels to avoid killing executors belonging to other > jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26256) Add proper labels when deleting pods
[ https://issues.apache.org/jira/browse/SPARK-26256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-26256: -- Assignee: Stavros Kontopoulos > Add proper labels when deleting pods > > > Key: SPARK-26256 > URL: https://issues.apache.org/jira/browse/SPARK-26256 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > As discussed here: > [https://github.com/apache/spark/pull/23136#discussion_r236463330] > we need to add proper labels to avoid killing executors belonging to other > jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26256) Add proper labels when deleting pods
[ https://issues.apache.org/jira/browse/SPARK-26256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-26256. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23209 [https://github.com/apache/spark/pull/23209] > Add proper labels when deleting pods > > > Key: SPARK-26256 > URL: https://issues.apache.org/jira/browse/SPARK-26256 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > As discussed here: > [https://github.com/apache/spark/pull/23136#discussion_r236463330] > we need to add proper labels to avoid killing executors belonging to other > jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving
[ https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707907#comment-16707907 ] Anne Holler commented on SPARK-26247: - Hi, [~skonto], My basic take on model representation is that any representation that is not the same format that the spark mllib code produces for training and consumes for serving basically introduces additional maintenance toil and potential risk of model serving mismatch. In that sense, spark mllib format is a de facto standard. Unless PMML were to completely replace spark mllib representation as the first class citizen model representation in spark (which doesn't seem to have clear switchover ROI), the team I am on would not choose to move to it, because we do not want to take the risk that the model trained and evaluated wrt spark mllib native representation has some difference when served in batch or online mode from PMML representation. Best regards, Anne > SPIP - ML Model Extension for no-Spark MLLib Online Serving > --- > > Key: SPARK-26247 > URL: https://issues.apache.org/jira/browse/SPARK-26247 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Anne Holler >Priority: Major > Labels: SPIP > Attachments: SPIPMlModelExtensionForOnlineServing.pdf > > > This ticket tracks an SPIP to improve model load time and model serving > interfaces for online serving of Spark MLlib models. The SPIP is here > [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub] > > The improvement opportunity exists in all versions of spark. We developed > our set of changes wrt version 2.1.0 and can port them forward to other > versions (e.g., we have ported them forward to 2.3.2). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan
[ https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707901#comment-16707901 ] Apache Spark commented on SPARK-19712: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/23211 > EXISTS and Left Semi join do not produce the same plan > -- > > Key: SPARK-19712 > URL: https://issues.apache.org/jira/browse/SPARK-19712 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nattavut Sutyanyong >Priority: Major > > This problem was found during the development of SPARK-18874. > The EXISTS form in the following query: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 > from t3 where t1.t1b=t3.t3b)")}} > gives the optimized plan below: > {code} > == Optimized Logical Plan == > Join Inner, (t1a#7 = t2a#25) > :- Join LeftSemi, (t1b#8 = t3b#58) > : :- Filter isnotnull(t1a#7) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Project [1 AS 1#271, t3b#58] > : +- Relation[t3a#57,t3b#58,t3c#59] parquet > +- Filter isnotnull(t2a#25) >+- Relation[t2a#25,t2b#26,t2c#27] parquet > {code} > whereas a semantically equivalent Left Semi join query below: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on > t1.t1b=t3.t3b")}} > gives the following optimized plan: > {code} > == Optimized Logical Plan == > Join LeftSemi, (t1b#8 = t3b#58) > :- Join Inner, (t1a#7 = t2a#25) > : :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7)) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Filter isnotnull(t2a#25) > : +- Relation[t2a#25,t2b#26,t2c#27] parquet > +- Project [t3b#58] >+- Relation[t3a#57,t3b#58,t3c#59] parquet > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan
[ https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707902#comment-16707902 ] Apache Spark commented on SPARK-19712: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/23211 > EXISTS and Left Semi join do not produce the same plan > -- > > Key: SPARK-19712 > URL: https://issues.apache.org/jira/browse/SPARK-19712 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nattavut Sutyanyong >Priority: Major > > This problem was found during the development of SPARK-18874. > The EXISTS form in the following query: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 > from t3 where t1.t1b=t3.t3b)")}} > gives the optimized plan below: > {code} > == Optimized Logical Plan == > Join Inner, (t1a#7 = t2a#25) > :- Join LeftSemi, (t1b#8 = t3b#58) > : :- Filter isnotnull(t1a#7) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Project [1 AS 1#271, t3b#58] > : +- Relation[t3a#57,t3b#58,t3c#59] parquet > +- Filter isnotnull(t2a#25) >+- Relation[t2a#25,t2b#26,t2c#27] parquet > {code} > whereas a semantically equivalent Left Semi join query below: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on > t1.t1b=t3.t3b")}} > gives the following optimized plan: > {code} > == Optimized Logical Plan == > Join LeftSemi, (t1b#8 = t3b#58) > :- Join Inner, (t1a#7 = t2a#25) > : :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7)) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Filter isnotnull(t2a#25) > : +- Relation[t2a#25,t2b#26,t2c#27] parquet > +- Project [t3b#58] >+- Relation[t3a#57,t3b#58,t3c#59] parquet > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema
PoojaMurarka created SPARK-26259: Summary: RecordSeparator other than newline discovers incorrect schema Key: SPARK-26259 URL: https://issues.apache.org/jira/browse/SPARK-26259 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: PoojaMurarka Fix For: 2.4.1 Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed in SPARK 2.3 which allows record Separators other than new line but this doesn't work when schema is not specified i.e. while inferring the schema Let me try to explain this using below data and scenarios: Input Data - (input_data.csv) as shown below: *+where recordSeparator is "\t"+* {noformat} "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed" "2012-01-01","0","0","0","0","1","9","9.1","66","0" "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat} *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data correctly: {code:java} val customSchema = StructType(Array( StructField("dteday", DateType, true), StructField("hr", IntegerType, true), StructField("holiday", IntegerType, true), StructField("weekday", IntegerType, true), StructField("workingday", DateType, true), StructField("weathersit", IntegerType, true), StructField("temp", IntegerType, true), StructField("atemp", DoubleType, true), StructField("hum", IntegerType, true), StructField("windspeed", IntegerType, true))); Dataset ds = executionContext.getSparkSession().read().format( "csv" ) .option( "header", true ) .option( "schema", customSchema) .option( "sep", "," ) .load( "input_data.csv" ); {code} *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is done i.e. entire data is read as column names. {code:java} Dataset ds = executionContext.getSparkSession().read().format( "csv" ) .option( "header", true ) .option( "inferSchema", true) .option( "sep", "," ) .load( "input_data.csv" ); {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving
[ https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707856#comment-16707856 ] Stavros Kontopoulos commented on SPARK-26247: - Some pipelines export models in pmml (I know there is a long history about, license etc) but would be better to standardize on something? This would minimize a few issues mentioned in this quote "Recreating and reimplementing model reading outside of the Spark project is costly and error prone, and over time is hard to keep in-sync with Spark changes over various Spark releases" > SPIP - ML Model Extension for no-Spark MLLib Online Serving > --- > > Key: SPARK-26247 > URL: https://issues.apache.org/jira/browse/SPARK-26247 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Anne Holler >Priority: Major > Labels: SPIP > Attachments: SPIPMlModelExtensionForOnlineServing.pdf > > > This ticket tracks an SPIP to improve model load time and model serving > interfaces for online serving of Spark MLlib models. The SPIP is here > [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub] > > The improvement opportunity exists in all versions of spark. We developed > our set of changes wrt version 2.1.0 and can port them forward to other > versions (e.g., we have ported them forward to 2.3.2). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving
[ https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707815#comment-16707815 ] Stavros Kontopoulos commented on SPARK-26247: - Thanx a lot! If you make it accessible would be better so we can add comments. > SPIP - ML Model Extension for no-Spark MLLib Online Serving > --- > > Key: SPARK-26247 > URL: https://issues.apache.org/jira/browse/SPARK-26247 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Anne Holler >Priority: Major > Labels: SPIP > Attachments: SPIPMlModelExtensionForOnlineServing.pdf > > > This ticket tracks an SPIP to improve model load time and model serving > interfaces for online serving of Spark MLlib models. The SPIP is here > [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub] > > The improvement opportunity exists in all versions of spark. We developed > our set of changes wrt version 2.1.0 and can port them forward to other > versions (e.g., we have ported them forward to 2.3.2). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving
[ https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707812#comment-16707812 ] Anne Holler commented on SPARK-26247: - Hi, [~skonto], I just attached a pdf of the document to this issue. [I am a google doc doofus, apparently, not sure why my attempt to publish the google doc for public access wasn't successful.] > SPIP - ML Model Extension for no-Spark MLLib Online Serving > --- > > Key: SPARK-26247 > URL: https://issues.apache.org/jira/browse/SPARK-26247 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Anne Holler >Priority: Major > Labels: SPIP > Attachments: SPIPMlModelExtensionForOnlineServing.pdf > > > This ticket tracks an SPIP to improve model load time and model serving > interfaces for online serving of Spark MLlib models. The SPIP is here > [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub] > > The improvement opportunity exists in all versions of spark. We developed > our set of changes wrt version 2.1.0 and can port them forward to other > versions (e.g., we have ported them forward to 2.3.2). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving
[ https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anne Holler updated SPARK-26247: Attachment: SPIPMlModelExtensionForOnlineServing.pdf > SPIP - ML Model Extension for no-Spark MLLib Online Serving > --- > > Key: SPARK-26247 > URL: https://issues.apache.org/jira/browse/SPARK-26247 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Anne Holler >Priority: Major > Labels: SPIP > Attachments: SPIPMlModelExtensionForOnlineServing.pdf > > > This ticket tracks an SPIP to improve model load time and model serving > interfaces for online serving of Spark MLlib models. The SPIP is here > [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub] > > The improvement opportunity exists in all versions of spark. We developed > our set of changes wrt version 2.1.0 and can port them forward to other > versions (e.g., we have ported them forward to 2.3.2). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving
[ https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707798#comment-16707798 ] Stavros Kontopoulos commented on SPARK-26247: - [~aholler] could you provide public access to the doc? > SPIP - ML Model Extension for no-Spark MLLib Online Serving > --- > > Key: SPARK-26247 > URL: https://issues.apache.org/jira/browse/SPARK-26247 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Anne Holler >Priority: Major > Labels: SPIP > > This ticket tracks an SPIP to improve model load time and model serving > interfaces for online serving of Spark MLlib models. The SPIP is here > [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub] > > The improvement opportunity exists in all versions of spark. We developed > our set of changes wrt version 2.1.0 and can port them forward to other > versions (e.g., we have ported them forward to 2.3.2). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions
[ https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707738#comment-16707738 ] Dongjoon Hyun commented on SPARK-26233: --- Thanks. I raised the priority because this is a correctness issue. > Incorrect decimal value with java beans and first/last/max... functions > --- > > Key: SPARK-26233 > URL: https://issues.apache.org/jira/browse/SPARK-26233 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Miquel >Priority: Blocker > Labels: correctness > > Decimal values from Java beans are incorrectly scaled when used with > functions like first/last/max... > This problem came because Encoders.bean always set Decimal values as > _DecimalType(this.MAX_PRECISION(), 18)._ > Usually it's not a problem if you use numeric functions like *sum* but for > functions like *first*/*last*/*max*... it is a problem. > How to reproduce this error: > Using this class as an example: > {code:java} > public class Foo implements Serializable { > private String group; > private BigDecimal var; > public BigDecimal getVar() { > return var; > } > public void setVar(BigDecimal var) { > this.var = var; > } > public String getGroup() { > return group; > } > public void setGroup(String group) { > this.group = group; > } > } > {code} > > And a dummy code to create some objects: > {code:java} > Dataset ds = spark.range(5) > .map(l -> { > Foo foo = new Foo(); > foo.setGroup("" + l); > foo.setVar(BigDecimal.valueOf(l + 0.)); > return foo; > }, Encoders.bean(Foo.class)); > ds.printSchema(); > ds.show(); > +-+--+ > |group| var| > +-+--+ > | 0|0.| > | 1|1.| > | 2|2.| > | 3|3.| > | 4|4.| > +-+--+ > {code} > We can see that the DecimalType is precision 38 and 18 scale and all values > are show correctly. > But if we use a first function, they are scaled incorrectly: > {code:java} > ds.groupBy(col("group")) > .agg( > first("var") > ) > .show(); > +-+-+ > |group|first(var, false)| > +-+-+ > | 3| 3.E-14| > | 0| 1.111E-15| > | 1| 1.E-14| > | 4| 4.E-14| > | 2| 2.E-14| > +-+-+ > {code} > This incorrect behavior cannot be reproduced if we use "numerical "functions > like sum or if the column is cast a new Decimal Type. > {code:java} > ds.groupBy(col("group")) > .agg( > sum("var") > ) > .show(); > +-++ > |group| sum(var)| > +-++ > | 3|3.00| > | 0|0.00| > | 1|1.00| > | 4|4.00| > | 2|2.00| > +-++ > ds.groupBy(col("group")) > .agg( > first(col("var").cast(new DecimalType(38, 8))) > ) > .show(); > +-++ > |group|first(CAST(var AS DECIMAL(38,8)), false)| > +-++ > | 3| 3.| > | 0| 0.| > | 1| 1.| > | 4| 4.| > | 2| 2.| > +-++ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions
[ https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26233: -- Component/s: (was: Java API) SQL > Incorrect decimal value with java beans and first/last/max... functions > --- > > Key: SPARK-26233 > URL: https://issues.apache.org/jira/browse/SPARK-26233 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Miquel >Priority: Blocker > Labels: correctness > > Decimal values from Java beans are incorrectly scaled when used with > functions like first/last/max... > This problem came because Encoders.bean always set Decimal values as > _DecimalType(this.MAX_PRECISION(), 18)._ > Usually it's not a problem if you use numeric functions like *sum* but for > functions like *first*/*last*/*max*... it is a problem. > How to reproduce this error: > Using this class as an example: > {code:java} > public class Foo implements Serializable { > private String group; > private BigDecimal var; > public BigDecimal getVar() { > return var; > } > public void setVar(BigDecimal var) { > this.var = var; > } > public String getGroup() { > return group; > } > public void setGroup(String group) { > this.group = group; > } > } > {code} > > And a dummy code to create some objects: > {code:java} > Dataset ds = spark.range(5) > .map(l -> { > Foo foo = new Foo(); > foo.setGroup("" + l); > foo.setVar(BigDecimal.valueOf(l + 0.)); > return foo; > }, Encoders.bean(Foo.class)); > ds.printSchema(); > ds.show(); > +-+--+ > |group| var| > +-+--+ > | 0|0.| > | 1|1.| > | 2|2.| > | 3|3.| > | 4|4.| > +-+--+ > {code} > We can see that the DecimalType is precision 38 and 18 scale and all values > are show correctly. > But if we use a first function, they are scaled incorrectly: > {code:java} > ds.groupBy(col("group")) > .agg( > first("var") > ) > .show(); > +-+-+ > |group|first(var, false)| > +-+-+ > | 3| 3.E-14| > | 0| 1.111E-15| > | 1| 1.E-14| > | 4| 4.E-14| > | 2| 2.E-14| > +-+-+ > {code} > This incorrect behavior cannot be reproduced if we use "numerical "functions > like sum or if the column is cast a new Decimal Type. > {code:java} > ds.groupBy(col("group")) > .agg( > sum("var") > ) > .show(); > +-++ > |group| sum(var)| > +-++ > | 3|3.00| > | 0|0.00| > | 1|1.00| > | 4|4.00| > | 2|2.00| > +-++ > ds.groupBy(col("group")) > .agg( > first(col("var").cast(new DecimalType(38, 8))) > ) > .show(); > +-++ > |group|first(CAST(var AS DECIMAL(38,8)), false)| > +-++ > | 3| 3.| > | 0| 0.| > | 1| 1.| > | 4| 4.| > | 2| 2.| > +-++ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions
[ https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707729#comment-16707729 ] Marco Gaido commented on SPARK-26233: - [~dongjoon] I think so. SPARK-24957 was a long standing issue too, and the problem is analogous... > Incorrect decimal value with java beans and first/last/max... functions > --- > > Key: SPARK-26233 > URL: https://issues.apache.org/jira/browse/SPARK-26233 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Miquel >Priority: Minor > > Decimal values from Java beans are incorrectly scaled when used with > functions like first/last/max... > This problem came because Encoders.bean always set Decimal values as > _DecimalType(this.MAX_PRECISION(), 18)._ > Usually it's not a problem if you use numeric functions like *sum* but for > functions like *first*/*last*/*max*... it is a problem. > How to reproduce this error: > Using this class as an example: > {code:java} > public class Foo implements Serializable { > private String group; > private BigDecimal var; > public BigDecimal getVar() { > return var; > } > public void setVar(BigDecimal var) { > this.var = var; > } > public String getGroup() { > return group; > } > public void setGroup(String group) { > this.group = group; > } > } > {code} > > And a dummy code to create some objects: > {code:java} > Dataset ds = spark.range(5) > .map(l -> { > Foo foo = new Foo(); > foo.setGroup("" + l); > foo.setVar(BigDecimal.valueOf(l + 0.)); > return foo; > }, Encoders.bean(Foo.class)); > ds.printSchema(); > ds.show(); > +-+--+ > |group| var| > +-+--+ > | 0|0.| > | 1|1.| > | 2|2.| > | 3|3.| > | 4|4.| > +-+--+ > {code} > We can see that the DecimalType is precision 38 and 18 scale and all values > are show correctly. > But if we use a first function, they are scaled incorrectly: > {code:java} > ds.groupBy(col("group")) > .agg( > first("var") > ) > .show(); > +-+-+ > |group|first(var, false)| > +-+-+ > | 3| 3.E-14| > | 0| 1.111E-15| > | 1| 1.E-14| > | 4| 4.E-14| > | 2| 2.E-14| > +-+-+ > {code} > This incorrect behavior cannot be reproduced if we use "numerical "functions > like sum or if the column is cast a new Decimal Type. > {code:java} > ds.groupBy(col("group")) > .agg( > sum("var") > ) > .show(); > +-++ > |group| sum(var)| > +-++ > | 3|3.00| > | 0|0.00| > | 1|1.00| > | 4|4.00| > | 2|2.00| > +-++ > ds.groupBy(col("group")) > .agg( > first(col("var").cast(new DecimalType(38, 8))) > ) > .show(); > +-++ > |group|first(CAST(var AS DECIMAL(38,8)), false)| > +-++ > | 3| 3.| > | 0| 0.| > | 1| 1.| > | 4| 4.| > | 2| 2.| > +-++ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26258) Include JobGroupId in the callerContext
[ https://issues.apache.org/jira/browse/SPARK-26258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aihua Xu updated SPARK-26258: - Description: SPARK-15857 adds the support of callerContext for HDFS and Yarn. Currently, Spark callerContext prints Job Id, Stage Id, Task Id in the callerContext. It would be useful to include JobGroup Id as well since such JobGroup Id could be meaningful from the Spark Client point of view to group the jobs by JobGroup. callerContext=SPARK_CLIENT_application_1541467098185_0008 callerContext=SPARK_TASK_application_1541702365678_0005_1_JId_1_SId_3_0_TId_3_0 was: Currently, Spark callerContext prints Job Id, Stage Id, Task Id in the callerContext. It would be useful to include JobGroup Id as well since such JobGroup Id could be meaningful from the Spark Client point of view to group the jobs by JobGroup. callerContext=SPARK_CLIENT_application_1541467098185_0008 callerContext=SPARK_TASK_application_1541702365678_0005_1_JId_1_SId_3_0_TId_3_0 > Include JobGroupId in the callerContext > > > Key: SPARK-26258 > URL: https://issues.apache.org/jira/browse/SPARK-26258 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Aihua Xu >Priority: Major > > SPARK-15857 adds the support of callerContext for HDFS and Yarn. Currently, > Spark callerContext prints Job Id, Stage Id, Task Id in the callerContext. It > would be useful to include JobGroup Id as well since such JobGroup Id could > be meaningful from the Spark Client point of view to group the jobs by > JobGroup. > callerContext=SPARK_CLIENT_application_1541467098185_0008 > callerContext=SPARK_TASK_application_1541702365678_0005_1_JId_1_SId_3_0_TId_3_0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions
[ https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26233: -- Priority: Blocker (was: Minor) > Incorrect decimal value with java beans and first/last/max... functions > --- > > Key: SPARK-26233 > URL: https://issues.apache.org/jira/browse/SPARK-26233 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Miquel >Priority: Blocker > Labels: correctness > > Decimal values from Java beans are incorrectly scaled when used with > functions like first/last/max... > This problem came because Encoders.bean always set Decimal values as > _DecimalType(this.MAX_PRECISION(), 18)._ > Usually it's not a problem if you use numeric functions like *sum* but for > functions like *first*/*last*/*max*... it is a problem. > How to reproduce this error: > Using this class as an example: > {code:java} > public class Foo implements Serializable { > private String group; > private BigDecimal var; > public BigDecimal getVar() { > return var; > } > public void setVar(BigDecimal var) { > this.var = var; > } > public String getGroup() { > return group; > } > public void setGroup(String group) { > this.group = group; > } > } > {code} > > And a dummy code to create some objects: > {code:java} > Dataset ds = spark.range(5) > .map(l -> { > Foo foo = new Foo(); > foo.setGroup("" + l); > foo.setVar(BigDecimal.valueOf(l + 0.)); > return foo; > }, Encoders.bean(Foo.class)); > ds.printSchema(); > ds.show(); > +-+--+ > |group| var| > +-+--+ > | 0|0.| > | 1|1.| > | 2|2.| > | 3|3.| > | 4|4.| > +-+--+ > {code} > We can see that the DecimalType is precision 38 and 18 scale and all values > are show correctly. > But if we use a first function, they are scaled incorrectly: > {code:java} > ds.groupBy(col("group")) > .agg( > first("var") > ) > .show(); > +-+-+ > |group|first(var, false)| > +-+-+ > | 3| 3.E-14| > | 0| 1.111E-15| > | 1| 1.E-14| > | 4| 4.E-14| > | 2| 2.E-14| > +-+-+ > {code} > This incorrect behavior cannot be reproduced if we use "numerical "functions > like sum or if the column is cast a new Decimal Type. > {code:java} > ds.groupBy(col("group")) > .agg( > sum("var") > ) > .show(); > +-++ > |group| sum(var)| > +-++ > | 3|3.00| > | 0|0.00| > | 1|1.00| > | 4|4.00| > | 2|2.00| > +-++ > ds.groupBy(col("group")) > .agg( > first(col("var").cast(new DecimalType(38, 8))) > ) > .show(); > +-++ > |group|first(CAST(var AS DECIMAL(38,8)), false)| > +-++ > | 3| 3.| > | 0| 0.| > | 1| 1.| > | 4| 4.| > | 2| 2.| > +-++ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions
[ https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26233: -- Labels: correctness (was: ) > Incorrect decimal value with java beans and first/last/max... functions > --- > > Key: SPARK-26233 > URL: https://issues.apache.org/jira/browse/SPARK-26233 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Miquel >Priority: Minor > Labels: correctness > > Decimal values from Java beans are incorrectly scaled when used with > functions like first/last/max... > This problem came because Encoders.bean always set Decimal values as > _DecimalType(this.MAX_PRECISION(), 18)._ > Usually it's not a problem if you use numeric functions like *sum* but for > functions like *first*/*last*/*max*... it is a problem. > How to reproduce this error: > Using this class as an example: > {code:java} > public class Foo implements Serializable { > private String group; > private BigDecimal var; > public BigDecimal getVar() { > return var; > } > public void setVar(BigDecimal var) { > this.var = var; > } > public String getGroup() { > return group; > } > public void setGroup(String group) { > this.group = group; > } > } > {code} > > And a dummy code to create some objects: > {code:java} > Dataset ds = spark.range(5) > .map(l -> { > Foo foo = new Foo(); > foo.setGroup("" + l); > foo.setVar(BigDecimal.valueOf(l + 0.)); > return foo; > }, Encoders.bean(Foo.class)); > ds.printSchema(); > ds.show(); > +-+--+ > |group| var| > +-+--+ > | 0|0.| > | 1|1.| > | 2|2.| > | 3|3.| > | 4|4.| > +-+--+ > {code} > We can see that the DecimalType is precision 38 and 18 scale and all values > are show correctly. > But if we use a first function, they are scaled incorrectly: > {code:java} > ds.groupBy(col("group")) > .agg( > first("var") > ) > .show(); > +-+-+ > |group|first(var, false)| > +-+-+ > | 3| 3.E-14| > | 0| 1.111E-15| > | 1| 1.E-14| > | 4| 4.E-14| > | 2| 2.E-14| > +-+-+ > {code} > This incorrect behavior cannot be reproduced if we use "numerical "functions > like sum or if the column is cast a new Decimal Type. > {code:java} > ds.groupBy(col("group")) > .agg( > sum("var") > ) > .show(); > +-++ > |group| sum(var)| > +-++ > | 3|3.00| > | 0|0.00| > | 1|1.00| > | 4|4.00| > | 2|2.00| > +-++ > ds.groupBy(col("group")) > .agg( > first(col("var").cast(new DecimalType(38, 8))) > ) > .show(); > +-++ > |group|first(CAST(var AS DECIMAL(38,8)), false)| > +-++ > | 3| 3.| > | 0| 0.| > | 1| 1.| > | 4| 4.| > | 2| 2.| > +-++ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions
[ https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707717#comment-16707717 ] Dongjoon Hyun commented on SPARK-26233: --- Hi, [~mcanes] and [~mgaido]. This seems to exist for a long time, right? > Incorrect decimal value with java beans and first/last/max... functions > --- > > Key: SPARK-26233 > URL: https://issues.apache.org/jira/browse/SPARK-26233 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Miquel >Priority: Minor > > Decimal values from Java beans are incorrectly scaled when used with > functions like first/last/max... > This problem came because Encoders.bean always set Decimal values as > _DecimalType(this.MAX_PRECISION(), 18)._ > Usually it's not a problem if you use numeric functions like *sum* but for > functions like *first*/*last*/*max*... it is a problem. > How to reproduce this error: > Using this class as an example: > {code:java} > public class Foo implements Serializable { > private String group; > private BigDecimal var; > public BigDecimal getVar() { > return var; > } > public void setVar(BigDecimal var) { > this.var = var; > } > public String getGroup() { > return group; > } > public void setGroup(String group) { > this.group = group; > } > } > {code} > > And a dummy code to create some objects: > {code:java} > Dataset ds = spark.range(5) > .map(l -> { > Foo foo = new Foo(); > foo.setGroup("" + l); > foo.setVar(BigDecimal.valueOf(l + 0.)); > return foo; > }, Encoders.bean(Foo.class)); > ds.printSchema(); > ds.show(); > +-+--+ > |group| var| > +-+--+ > | 0|0.| > | 1|1.| > | 2|2.| > | 3|3.| > | 4|4.| > +-+--+ > {code} > We can see that the DecimalType is precision 38 and 18 scale and all values > are show correctly. > But if we use a first function, they are scaled incorrectly: > {code:java} > ds.groupBy(col("group")) > .agg( > first("var") > ) > .show(); > +-+-+ > |group|first(var, false)| > +-+-+ > | 3| 3.E-14| > | 0| 1.111E-15| > | 1| 1.E-14| > | 4| 4.E-14| > | 2| 2.E-14| > +-+-+ > {code} > This incorrect behavior cannot be reproduced if we use "numerical "functions > like sum or if the column is cast a new Decimal Type. > {code:java} > ds.groupBy(col("group")) > .agg( > sum("var") > ) > .show(); > +-++ > |group| sum(var)| > +-++ > | 3|3.00| > | 0|0.00| > | 1|1.00| > | 4|4.00| > | 2|2.00| > +-++ > ds.groupBy(col("group")) > .agg( > first(col("var").cast(new DecimalType(38, 8))) > ) > .show(); > +-++ > |group|first(CAST(var AS DECIMAL(38,8)), false)| > +-++ > | 3| 3.| > | 0| 0.| > | 1| 1.| > | 4| 4.| > | 2| 2.| > +-++ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions
[ https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26233: -- Affects Version/s: 2.0.2 > Incorrect decimal value with java beans and first/last/max... functions > --- > > Key: SPARK-26233 > URL: https://issues.apache.org/jira/browse/SPARK-26233 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Miquel >Priority: Minor > > Decimal values from Java beans are incorrectly scaled when used with > functions like first/last/max... > This problem came because Encoders.bean always set Decimal values as > _DecimalType(this.MAX_PRECISION(), 18)._ > Usually it's not a problem if you use numeric functions like *sum* but for > functions like *first*/*last*/*max*... it is a problem. > How to reproduce this error: > Using this class as an example: > {code:java} > public class Foo implements Serializable { > private String group; > private BigDecimal var; > public BigDecimal getVar() { > return var; > } > public void setVar(BigDecimal var) { > this.var = var; > } > public String getGroup() { > return group; > } > public void setGroup(String group) { > this.group = group; > } > } > {code} > > And a dummy code to create some objects: > {code:java} > Dataset ds = spark.range(5) > .map(l -> { > Foo foo = new Foo(); > foo.setGroup("" + l); > foo.setVar(BigDecimal.valueOf(l + 0.)); > return foo; > }, Encoders.bean(Foo.class)); > ds.printSchema(); > ds.show(); > +-+--+ > |group| var| > +-+--+ > | 0|0.| > | 1|1.| > | 2|2.| > | 3|3.| > | 4|4.| > +-+--+ > {code} > We can see that the DecimalType is precision 38 and 18 scale and all values > are show correctly. > But if we use a first function, they are scaled incorrectly: > {code:java} > ds.groupBy(col("group")) > .agg( > first("var") > ) > .show(); > +-+-+ > |group|first(var, false)| > +-+-+ > | 3| 3.E-14| > | 0| 1.111E-15| > | 1| 1.E-14| > | 4| 4.E-14| > | 2| 2.E-14| > +-+-+ > {code} > This incorrect behavior cannot be reproduced if we use "numerical "functions > like sum or if the column is cast a new Decimal Type. > {code:java} > ds.groupBy(col("group")) > .agg( > sum("var") > ) > .show(); > +-++ > |group| sum(var)| > +-++ > | 3|3.00| > | 0|0.00| > | 1|1.00| > | 4|4.00| > | 2|2.00| > +-++ > ds.groupBy(col("group")) > .agg( > first(col("var").cast(new DecimalType(38, 8))) > ) > .show(); > +-++ > |group|first(CAST(var AS DECIMAL(38,8)), false)| > +-++ > | 3| 3.| > | 0| 0.| > | 1| 1.| > | 4| 4.| > | 2| 2.| > +-++ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions
[ https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26233: -- Affects Version/s: 2.1.3 > Incorrect decimal value with java beans and first/last/max... functions > --- > > Key: SPARK-26233 > URL: https://issues.apache.org/jira/browse/SPARK-26233 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Miquel >Priority: Minor > > Decimal values from Java beans are incorrectly scaled when used with > functions like first/last/max... > This problem came because Encoders.bean always set Decimal values as > _DecimalType(this.MAX_PRECISION(), 18)._ > Usually it's not a problem if you use numeric functions like *sum* but for > functions like *first*/*last*/*max*... it is a problem. > How to reproduce this error: > Using this class as an example: > {code:java} > public class Foo implements Serializable { > private String group; > private BigDecimal var; > public BigDecimal getVar() { > return var; > } > public void setVar(BigDecimal var) { > this.var = var; > } > public String getGroup() { > return group; > } > public void setGroup(String group) { > this.group = group; > } > } > {code} > > And a dummy code to create some objects: > {code:java} > Dataset ds = spark.range(5) > .map(l -> { > Foo foo = new Foo(); > foo.setGroup("" + l); > foo.setVar(BigDecimal.valueOf(l + 0.)); > return foo; > }, Encoders.bean(Foo.class)); > ds.printSchema(); > ds.show(); > +-+--+ > |group| var| > +-+--+ > | 0|0.| > | 1|1.| > | 2|2.| > | 3|3.| > | 4|4.| > +-+--+ > {code} > We can see that the DecimalType is precision 38 and 18 scale and all values > are show correctly. > But if we use a first function, they are scaled incorrectly: > {code:java} > ds.groupBy(col("group")) > .agg( > first("var") > ) > .show(); > +-+-+ > |group|first(var, false)| > +-+-+ > | 3| 3.E-14| > | 0| 1.111E-15| > | 1| 1.E-14| > | 4| 4.E-14| > | 2| 2.E-14| > +-+-+ > {code} > This incorrect behavior cannot be reproduced if we use "numerical "functions > like sum or if the column is cast a new Decimal Type. > {code:java} > ds.groupBy(col("group")) > .agg( > sum("var") > ) > .show(); > +-++ > |group| sum(var)| > +-++ > | 3|3.00| > | 0|0.00| > | 1|1.00| > | 4|4.00| > | 2|2.00| > +-++ > ds.groupBy(col("group")) > .agg( > first(col("var").cast(new DecimalType(38, 8))) > ) > .show(); > +-++ > |group|first(CAST(var AS DECIMAL(38,8)), false)| > +-++ > | 3| 3.| > | 0| 0.| > | 1| 1.| > | 4| 4.| > | 2| 2.| > +-++ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions
[ https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26233: -- Affects Version/s: 2.2.2 > Incorrect decimal value with java beans and first/last/max... functions > --- > > Key: SPARK-26233 > URL: https://issues.apache.org/jira/browse/SPARK-26233 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.2.2, 2.3.1, 2.4.0 >Reporter: Miquel >Priority: Minor > > Decimal values from Java beans are incorrectly scaled when used with > functions like first/last/max... > This problem came because Encoders.bean always set Decimal values as > _DecimalType(this.MAX_PRECISION(), 18)._ > Usually it's not a problem if you use numeric functions like *sum* but for > functions like *first*/*last*/*max*... it is a problem. > How to reproduce this error: > Using this class as an example: > {code:java} > public class Foo implements Serializable { > private String group; > private BigDecimal var; > public BigDecimal getVar() { > return var; > } > public void setVar(BigDecimal var) { > this.var = var; > } > public String getGroup() { > return group; > } > public void setGroup(String group) { > this.group = group; > } > } > {code} > > And a dummy code to create some objects: > {code:java} > Dataset ds = spark.range(5) > .map(l -> { > Foo foo = new Foo(); > foo.setGroup("" + l); > foo.setVar(BigDecimal.valueOf(l + 0.)); > return foo; > }, Encoders.bean(Foo.class)); > ds.printSchema(); > ds.show(); > +-+--+ > |group| var| > +-+--+ > | 0|0.| > | 1|1.| > | 2|2.| > | 3|3.| > | 4|4.| > +-+--+ > {code} > We can see that the DecimalType is precision 38 and 18 scale and all values > are show correctly. > But if we use a first function, they are scaled incorrectly: > {code:java} > ds.groupBy(col("group")) > .agg( > first("var") > ) > .show(); > +-+-+ > |group|first(var, false)| > +-+-+ > | 3| 3.E-14| > | 0| 1.111E-15| > | 1| 1.E-14| > | 4| 4.E-14| > | 2| 2.E-14| > +-+-+ > {code} > This incorrect behavior cannot be reproduced if we use "numerical "functions > like sum or if the column is cast a new Decimal Type. > {code:java} > ds.groupBy(col("group")) > .agg( > sum("var") > ) > .show(); > +-++ > |group| sum(var)| > +-++ > | 3|3.00| > | 0|0.00| > | 1|1.00| > | 4|4.00| > | 2|2.00| > +-++ > ds.groupBy(col("group")) > .agg( > first(col("var").cast(new DecimalType(38, 8))) > ) > .show(); > +-++ > |group|first(CAST(var AS DECIMAL(38,8)), false)| > +-++ > | 3| 3.| > | 0| 0.| > | 1| 1.| > | 4| 4.| > | 2| 2.| > +-++ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26258) Include JobGroupId in the callerContext
Aihua Xu created SPARK-26258: Summary: Include JobGroupId in the callerContext Key: SPARK-26258 URL: https://issues.apache.org/jira/browse/SPARK-26258 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.0 Reporter: Aihua Xu Currently, Spark callerContext prints Job Id, Stage Id, Task Id in the callerContext. It would be useful to include JobGroup Id as well since such JobGroup Id could be meaningful from the Spark Client point of view to group the jobs by JobGroup. callerContext=SPARK_CLIENT_application_1541467098185_0008 callerContext=SPARK_TASK_application_1541702365678_0005_1_JId_1_SId_3_0_TId_3_0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26257) SPIP: Interop Support for Spark Language Extensions
Tyson Condie created SPARK-26257: Summary: SPIP: Interop Support for Spark Language Extensions Key: SPARK-26257 URL: https://issues.apache.org/jira/browse/SPARK-26257 Project: Spark Issue Type: Improvement Components: PySpark, R, Spark Core Affects Versions: 2.4.0 Reporter: Tyson Condie h2. ** Background and Motivation: There is a desire for third party language extensions for Apache Spark. Some notable examples include: * C#/F# from project Mobius [https://github.com/Microsoft/Mobius] * Haskell from project sparkle [https://github.com/tweag/sparkle] * Julia from project Spark.jl [https://github.com/dfdx/Spark.jl] Presently, Apache Spark supports Python and R via a tightly integrated interop layer. It would seem that much of that existing interop layer could be refactored into a clean surface for general (third party) language bindings, such as the above mentioned. More specifically, could we generalize the following modules: * Deploy runners (e.g., PythonRunner and RRunner) * DataFrame Executors * RDD operations? The last being questionable: integrating third party language extensions at the RDD level may be too heavy-weight and unnecessary given the preference towards the Dataframe abstraction. The main goals of this effort would be: * Provide a clean abstraction for third party language extensions making it easier to maintain (the language extension) with the evolution of Apache Spark * Provide guidance to third party language authors on how a language extension should be implemented * Provide general reusable libraries that are not specific to any language extension * Open the door to developers that prefer alternative languages * Identify and clean up common code shared between Python and R interops h2. Target Personas: Data Scientists, Data Engineers, Library Developers h2. Goals: Data scientists and engineers will have the opportunity to work with Spark in languages other than what’s natively supported. Library developers will be able to create language extensions for Spark in a clean way. The interop layer should also provide guidance for developing language extensions. h2. Non-Goals: The proposal does not aim to create an actual language extension. Rather, it aims to provide a stable interop layer for third party language extensions to dock. h2. Proposed API Changes: Much of the work will involve generalizing existing interop APIs for PySpark and R, specifically for the Dataframe API. For instance, it would be good to have a general deploy.Runner (similar to PythonRunner) for language extension efforts. In Spark SQL, it would be good to have a general InteropUDF and evaluator (similar to BatchEvalPythonExec). Low-level RDD operations should not be needed in this initial offering; depending on the success of the interop layer and with proper demand, RDD interop could be added later. However, one open question is supporting a subset of low-level functions that are core to ETL e.g., transform. h2. Optional Design Sketch: The work would be broken down into two top-level phases: Phase 1: Introduce general interop API for deploying a driver/application, running an interop UDF along with any other low-level transformations that aid with ETL. Phase 2: Port existing Python and R language extensions to the new interop layer. This port should be contained solely to the Spark core side, and all protocols specific to Python and R should not change e.g., Python should continue to use py4j is the protocol between the Python process and core Spark. The port itself should be contained to a handful of files e.g., some examples for Python: PythonRunner, BatchEvalPythonExec, +PythonUDFRunner+, PythonRDD (possibly), and will mostly involve refactoring common logic abstract implementations and utilities. h2. Optional Rejected Designs: The clear alternative is the status quo; developers that want to provide a third-party language extension to Spark do so directly; often by extending existing Python classes and overriding the portions that are relevant to the new extension. Not only is this not sound code (e.g., an JuliaRDD is not a PythonRDD, which contains a lot of reusable code), but it runs the great risk of future revisions making the subclass implementation obsolete. It would be hard to imagine that any third-party language extension would be successful if there was not something in place to guarantee its long-term maintainability. Another alternative is that third-party languages should only interact with Spark via pure-SQL; possibly via REST. However, this does not enable UDFs written in the third-party language; a key desideratum in this effort, which most notably takes the form of legacy code/UDFs that would need to be ported to a supported language e.g., Scala. This exercise is extremely cumbersome and not
[jira] [Commented] (SPARK-26254) Move delegation token providers into a separate project
[ https://issues.apache.org/jira/browse/SPARK-26254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707679#comment-16707679 ] Steve Loughran commented on SPARK-26254: +HBase I don't have any opinions on the best place; people who know the spark packaging are the ones there. And people deploying to other infras than YARN will have their opinions too. Token loading can be fairly brittle to classpath problems (HADOOP-15808); its good not to trust everything to be well-configured. > Move delegation token providers into a separate project > --- > > Key: SPARK-26254 > URL: https://issues.apache.org/jira/browse/SPARK-26254 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > There was a discussion in > [PR#22598|https://github.com/apache/spark/pull/22598] that there are several > provided dependencies inside core project which shouldn't be there (for ex. > hive and kafka). This jira is to solve this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25515) Add a config property for disabling auto deletion of PODS for debugging.
[ https://issues.apache.org/jira/browse/SPARK-25515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li resolved SPARK-25515. -- Resolution: Fixed Fix Version/s: 3.0.0 > Add a config property for disabling auto deletion of PODS for debugging. > > > Key: SPARK-25515 > URL: https://issues.apache.org/jira/browse/SPARK-25515 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Prashant Sharma >Priority: Major > Fix For: 3.0.0 > > > Currently, if a pod fails to start due to some failure, it gets removed and > new one is attempted. These sequence of events go on, until the app is > killed. Given the speed of creation and deletion, it becomes difficult to > debug the reason for failure. > So adding a configuration parameter to disable auto-deletion of pods, will be > helpful for debugging. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26213) Custom Receiver for Structured streaming
[ https://issues.apache.org/jira/browse/SPARK-26213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705886#comment-16705886 ] Gabor Somogyi edited comment on SPARK-26213 at 12/3/18 4:27 PM: On the other hand looks like this is not a feature suggestion but a question to the community. In such cases you can write a mail to [d...@spark.apache.org|http://apache-spark-developers-list.1001551.n3.nabble.com/]. [~aarthipa] If you agree please close the jira. was (Author: gsomogyi): On the other hand looks like this is not a feature suggestion but a question to the community. In such cases you can write a mail to [d...@spark.apache.org|http://apache-spark-developers-list.1001551.n3.nabble.com/]. If you agree please close the jira. > Custom Receiver for Structured streaming > > > Key: SPARK-26213 > URL: https://issues.apache.org/jira/browse/SPARK-26213 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Aarthi >Priority: Major > > Hi, > I have implemented a Custom Receiver for a https/json data source by > implementing the Receievr abstract class as provided in the documentation > here [https://spark.apache.org/docs/latest//streaming-custom-receivers.html] > This approach works on Spark streaming context where the custom receiver > class is passed it receiverStream. However I would like the implement the > same for Structured streaming as each of the DStreams have a complex > structure and need to be joined with each other based on complex rules. > ([https://stackoverflow.com/questions/53449599/join-two-spark-dstreams-with-complex-nested-structure]) > Structured streaming uses the Spark Session object that takes in > DataStreamReader which is a final class. Please advice on how to implement > the custom receiver for Strucutred Streaming. > Thanks, -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions
[ https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26233: Assignee: Apache Spark > Incorrect decimal value with java beans and first/last/max... functions > --- > > Key: SPARK-26233 > URL: https://issues.apache.org/jira/browse/SPARK-26233 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.1, 2.4.0 >Reporter: Miquel >Assignee: Apache Spark >Priority: Minor > > Decimal values from Java beans are incorrectly scaled when used with > functions like first/last/max... > This problem came because Encoders.bean always set Decimal values as > _DecimalType(this.MAX_PRECISION(), 18)._ > Usually it's not a problem if you use numeric functions like *sum* but for > functions like *first*/*last*/*max*... it is a problem. > How to reproduce this error: > Using this class as an example: > {code:java} > public class Foo implements Serializable { > private String group; > private BigDecimal var; > public BigDecimal getVar() { > return var; > } > public void setVar(BigDecimal var) { > this.var = var; > } > public String getGroup() { > return group; > } > public void setGroup(String group) { > this.group = group; > } > } > {code} > > And a dummy code to create some objects: > {code:java} > Dataset ds = spark.range(5) > .map(l -> { > Foo foo = new Foo(); > foo.setGroup("" + l); > foo.setVar(BigDecimal.valueOf(l + 0.)); > return foo; > }, Encoders.bean(Foo.class)); > ds.printSchema(); > ds.show(); > +-+--+ > |group| var| > +-+--+ > | 0|0.| > | 1|1.| > | 2|2.| > | 3|3.| > | 4|4.| > +-+--+ > {code} > We can see that the DecimalType is precision 38 and 18 scale and all values > are show correctly. > But if we use a first function, they are scaled incorrectly: > {code:java} > ds.groupBy(col("group")) > .agg( > first("var") > ) > .show(); > +-+-+ > |group|first(var, false)| > +-+-+ > | 3| 3.E-14| > | 0| 1.111E-15| > | 1| 1.E-14| > | 4| 4.E-14| > | 2| 2.E-14| > +-+-+ > {code} > This incorrect behavior cannot be reproduced if we use "numerical "functions > like sum or if the column is cast a new Decimal Type. > {code:java} > ds.groupBy(col("group")) > .agg( > sum("var") > ) > .show(); > +-++ > |group| sum(var)| > +-++ > | 3|3.00| > | 0|0.00| > | 1|1.00| > | 4|4.00| > | 2|2.00| > +-++ > ds.groupBy(col("group")) > .agg( > first(col("var").cast(new DecimalType(38, 8))) > ) > .show(); > +-++ > |group|first(CAST(var AS DECIMAL(38,8)), false)| > +-++ > | 3| 3.| > | 0| 0.| > | 1| 1.| > | 4| 4.| > | 2| 2.| > +-++ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions
[ https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707465#comment-16707465 ] Apache Spark commented on SPARK-26233: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/23210 > Incorrect decimal value with java beans and first/last/max... functions > --- > > Key: SPARK-26233 > URL: https://issues.apache.org/jira/browse/SPARK-26233 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.1, 2.4.0 >Reporter: Miquel >Priority: Minor > > Decimal values from Java beans are incorrectly scaled when used with > functions like first/last/max... > This problem came because Encoders.bean always set Decimal values as > _DecimalType(this.MAX_PRECISION(), 18)._ > Usually it's not a problem if you use numeric functions like *sum* but for > functions like *first*/*last*/*max*... it is a problem. > How to reproduce this error: > Using this class as an example: > {code:java} > public class Foo implements Serializable { > private String group; > private BigDecimal var; > public BigDecimal getVar() { > return var; > } > public void setVar(BigDecimal var) { > this.var = var; > } > public String getGroup() { > return group; > } > public void setGroup(String group) { > this.group = group; > } > } > {code} > > And a dummy code to create some objects: > {code:java} > Dataset ds = spark.range(5) > .map(l -> { > Foo foo = new Foo(); > foo.setGroup("" + l); > foo.setVar(BigDecimal.valueOf(l + 0.)); > return foo; > }, Encoders.bean(Foo.class)); > ds.printSchema(); > ds.show(); > +-+--+ > |group| var| > +-+--+ > | 0|0.| > | 1|1.| > | 2|2.| > | 3|3.| > | 4|4.| > +-+--+ > {code} > We can see that the DecimalType is precision 38 and 18 scale and all values > are show correctly. > But if we use a first function, they are scaled incorrectly: > {code:java} > ds.groupBy(col("group")) > .agg( > first("var") > ) > .show(); > +-+-+ > |group|first(var, false)| > +-+-+ > | 3| 3.E-14| > | 0| 1.111E-15| > | 1| 1.E-14| > | 4| 4.E-14| > | 2| 2.E-14| > +-+-+ > {code} > This incorrect behavior cannot be reproduced if we use "numerical "functions > like sum or if the column is cast a new Decimal Type. > {code:java} > ds.groupBy(col("group")) > .agg( > sum("var") > ) > .show(); > +-++ > |group| sum(var)| > +-++ > | 3|3.00| > | 0|0.00| > | 1|1.00| > | 4|4.00| > | 2|2.00| > +-++ > ds.groupBy(col("group")) > .agg( > first(col("var").cast(new DecimalType(38, 8))) > ) > .show(); > +-++ > |group|first(CAST(var AS DECIMAL(38,8)), false)| > +-++ > | 3| 3.| > | 0| 0.| > | 1| 1.| > | 4| 4.| > | 2| 2.| > +-++ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions
[ https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26233: Assignee: (was: Apache Spark) > Incorrect decimal value with java beans and first/last/max... functions > --- > > Key: SPARK-26233 > URL: https://issues.apache.org/jira/browse/SPARK-26233 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.1, 2.4.0 >Reporter: Miquel >Priority: Minor > > Decimal values from Java beans are incorrectly scaled when used with > functions like first/last/max... > This problem came because Encoders.bean always set Decimal values as > _DecimalType(this.MAX_PRECISION(), 18)._ > Usually it's not a problem if you use numeric functions like *sum* but for > functions like *first*/*last*/*max*... it is a problem. > How to reproduce this error: > Using this class as an example: > {code:java} > public class Foo implements Serializable { > private String group; > private BigDecimal var; > public BigDecimal getVar() { > return var; > } > public void setVar(BigDecimal var) { > this.var = var; > } > public String getGroup() { > return group; > } > public void setGroup(String group) { > this.group = group; > } > } > {code} > > And a dummy code to create some objects: > {code:java} > Dataset ds = spark.range(5) > .map(l -> { > Foo foo = new Foo(); > foo.setGroup("" + l); > foo.setVar(BigDecimal.valueOf(l + 0.)); > return foo; > }, Encoders.bean(Foo.class)); > ds.printSchema(); > ds.show(); > +-+--+ > |group| var| > +-+--+ > | 0|0.| > | 1|1.| > | 2|2.| > | 3|3.| > | 4|4.| > +-+--+ > {code} > We can see that the DecimalType is precision 38 and 18 scale and all values > are show correctly. > But if we use a first function, they are scaled incorrectly: > {code:java} > ds.groupBy(col("group")) > .agg( > first("var") > ) > .show(); > +-+-+ > |group|first(var, false)| > +-+-+ > | 3| 3.E-14| > | 0| 1.111E-15| > | 1| 1.E-14| > | 4| 4.E-14| > | 2| 2.E-14| > +-+-+ > {code} > This incorrect behavior cannot be reproduced if we use "numerical "functions > like sum or if the column is cast a new Decimal Type. > {code:java} > ds.groupBy(col("group")) > .agg( > sum("var") > ) > .show(); > +-++ > |group| sum(var)| > +-++ > | 3|3.00| > | 0|0.00| > | 1|1.00| > | 4|4.00| > | 2|2.00| > +-++ > ds.groupBy(col("group")) > .agg( > first(col("var").cast(new DecimalType(38, 8))) > ) > .show(); > +-++ > |group|first(CAST(var AS DECIMAL(38,8)), false)| > +-++ > | 3| 3.| > | 0| 0.| > | 1| 1.| > | 4| 4.| > | 2| 2.| > +-++ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25498) Fix SQLQueryTestSuite failures when the interpreter mode enabled
[ https://issues.apache.org/jira/browse/SPARK-25498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25498. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22512 [https://github.com/apache/spark/pull/22512] > Fix SQLQueryTestSuite failures when the interpreter mode enabled > > > Key: SPARK-25498 > URL: https://issues.apache.org/jira/browse/SPARK-25498 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25498) Fix SQLQueryTestSuite failures when the interpreter mode enabled
[ https://issues.apache.org/jira/browse/SPARK-25498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-25498: --- Assignee: Takeshi Yamamuro > Fix SQLQueryTestSuite failures when the interpreter mode enabled > > > Key: SPARK-25498 > URL: https://issues.apache.org/jira/browse/SPARK-25498 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26235) Change log level for ClassNotFoundException/NoClassDefFoundError in SparkSubmit to Error
[ https://issues.apache.org/jira/browse/SPARK-26235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26235. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23189 [https://github.com/apache/spark/pull/23189] > Change log level for ClassNotFoundException/NoClassDefFoundError in > SparkSubmit to Error > > > Key: SPARK-26235 > URL: https://issues.apache.org/jira/browse/SPARK-26235 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Trivial > Fix For: 3.0.0 > > > In my local setup, I set log4j root category as ERROR > (https://stackoverflow.com/questions/27781187/how-to-stop-info-messages-displaying-on-spark-console > , first item show up if we google search "set spark log level".) > When I run such command > ``` > spark-submit --class foo bar.jar > ``` > Nothing shows up, and the script exits. > After quick investigation, I think the log level for > ClassNotFoundException/NoClassDefFoundError in SparkSubmit should be ERROR > instead of WARN. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26235) Change log level for ClassNotFoundException/NoClassDefFoundError in SparkSubmit to Error
[ https://issues.apache.org/jira/browse/SPARK-26235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26235: - Assignee: Gengliang Wang > Change log level for ClassNotFoundException/NoClassDefFoundError in > SparkSubmit to Error > > > Key: SPARK-26235 > URL: https://issues.apache.org/jira/browse/SPARK-26235 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Trivial > Fix For: 3.0.0 > > > In my local setup, I set log4j root category as ERROR > (https://stackoverflow.com/questions/27781187/how-to-stop-info-messages-displaying-on-spark-console > , first item show up if we google search "set spark log level".) > When I run such command > ``` > spark-submit --class foo bar.jar > ``` > Nothing shows up, and the script exits. > After quick investigation, I think the log level for > ClassNotFoundException/NoClassDefFoundError in SparkSubmit should be ERROR > instead of WARN. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26181) the `hasMinMaxStats` method of `ColumnStatsMap` is not correct
[ https://issues.apache.org/jira/browse/SPARK-26181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26181: --- Assignee: Adrian Wang > the `hasMinMaxStats` method of `ColumnStatsMap` is not correct > -- > > Key: SPARK-26181 > URL: https://issues.apache.org/jira/browse/SPARK-26181 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Adrian Wang >Assignee: Adrian Wang >Priority: Major > Fix For: 2.4.1, 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26181) the `hasMinMaxStats` method of `ColumnStatsMap` is not correct
[ https://issues.apache.org/jira/browse/SPARK-26181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26181. - Resolution: Fixed Fix Version/s: 2.4.1 3.0.0 Issue resolved by pull request 23152 [https://github.com/apache/spark/pull/23152] > the `hasMinMaxStats` method of `ColumnStatsMap` is not correct > -- > > Key: SPARK-26181 > URL: https://issues.apache.org/jira/browse/SPARK-26181 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Adrian Wang >Assignee: Adrian Wang >Priority: Major > Fix For: 3.0.0, 2.4.1 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26173) Prior regularization for Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-26173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Facundo Bellosi updated SPARK-26173: Description: This feature enables Maximum A Posteriori (MAP) optimization for Logistic Regression based on a Gaussian prior. In practice, this is just implementing a more general form of L2 regularization parameterized by a (multivariate) mean and precisions (inverse of variance) vectors. Prior regularization is calculated through the following formula: !Prior regularization.png! where: * λ: regularization parameter ({{regParam}}) * K: number of coefficients (weights vector length) * w~i~ with prior Normal(μ~i~, β~i~^2^) _Reference: Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning (section 4.5). Berlin, Heidelberg: Springer-Verlag._ h3. Existing implementations * Python: [bayes_logistic|https://pypi.org/project/bayes_logistic/] h2. Implementation * 2 new parameters added to {{LogisticRegression}}: {{priorMean}} and {{priorPrecisions}}. * 1 new class ({{PriorRegularization}}) implements the calculations of the value and gradient of the prior regularization term. * Prior regularization is enabled when both vectors are provided and {{regParam}} > 0 and {{elasticNetParam}} < 1. h2. Tests * {{DifferentiableRegularizationSuite}} ** {{Prior regularization}} * {{LogisticRegressionSuite}} ** {{prior precisions should be required when prior mean is set}} ** {{prior mean should be required when prior precisions is set}} ** {{`regParam` should be positive when using prior regularization}} ** {{`elasticNetParam` should be less than 1.0 when using prior regularization}} ** {{prior mean and precisions should have equal length}} ** {{priors' length should match number of features}} ** {{binary logistic regression with prior regularization equivalent to L2}} ** {{binary logistic regression with prior regularization equivalent to L2 (bis)}} ** {{binary logistic regression with prior regularization}} was: This feature enables Maximum A Posteriori (MAP) optimization for Logistic Regression based on a Gaussian prior. In practice, this is just implementing a more general form of L2 regularization parameterized by a (multivariate) mean and precisions (inverse of variance) vectors. Prior regularization is calculated through the following formula: !Prior regularization.png! where: * λ: regularization parameter ({{regParam}}) * K: number of coefficients (weights vector length) * w~i~ with prior Normal(μ~i~, β~i~^2^) _Reference: Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning (section 4.5). Berlin, Heidelberg: Springer-Verlag._ h2. Implementation * 2 new parameters added to {{LogisticRegression}}: {{priorMean}} and {{priorPrecisions}}. * 1 new class ({{PriorRegularization}}) implements the calculations of the value and gradient of the prior regularization term. * Prior regularization is enabled when both vectors are provided and {{regParam}} > 0 and {{elasticNetParam}} < 1. h2. Tests * {{DifferentiableRegularizationSuite}} ** {{Prior regularization}} * {{LogisticRegressionSuite}} ** {{prior precisions should be required when prior mean is set}} ** {{prior mean should be required when prior precisions is set}} ** {{`regParam` should be positive when using prior regularization}} ** {{`elasticNetParam` should be less than 1.0 when using prior regularization}} ** {{prior mean and precisions should have equal length}} ** {{priors' length should match number of features}} ** {{binary logistic regression with prior regularization equivalent to L2}} ** {{binary logistic regression with prior regularization equivalent to L2 (bis)}} ** {{binary logistic regression with prior regularization}} > Prior regularization for Logistic Regression > > > Key: SPARK-26173 > URL: https://issues.apache.org/jira/browse/SPARK-26173 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.4.0 >Reporter: Facundo Bellosi >Priority: Minor > Attachments: Prior regularization.png > > > This feature enables Maximum A Posteriori (MAP) optimization for Logistic > Regression based on a Gaussian prior. In practice, this is just implementing > a more general form of L2 regularization parameterized by a (multivariate) > mean and precisions (inverse of variance) vectors. > Prior regularization is calculated through the following formula: > !Prior regularization.png! > where: > * λ: regularization parameter ({{regParam}}) > * K: number of coefficients (weights vector length) > * w~i~ with prior Normal(μ~i~, β~i~^2^) > _Reference: Bishop, Christopher M. (2006). Pattern Recognition and Machine > Learning (section 4.5). Berlin, Heidelberg: Springer-Verlag._ > h3. Existing
[jira] [Assigned] (SPARK-26256) Add proper labels when deleting pods
[ https://issues.apache.org/jira/browse/SPARK-26256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26256: Assignee: Apache Spark > Add proper labels when deleting pods > > > Key: SPARK-26256 > URL: https://issues.apache.org/jira/browse/SPARK-26256 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Assignee: Apache Spark >Priority: Major > > As discussed here: > [https://github.com/apache/spark/pull/23136#discussion_r236463330] > we need to add proper labels to avoid killing executors belonging to other > jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25530) data source v2 API refactor (batch write)
[ https://issues.apache.org/jira/browse/SPARK-25530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707365#comment-16707365 ] Apache Spark commented on SPARK-25530: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/23208 > data source v2 API refactor (batch write) > - > > Key: SPARK-25530 > URL: https://issues.apache.org/jira/browse/SPARK-25530 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Adjust the batch write API to match the read API after refactor -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26256) Add proper labels when deleting pods
[ https://issues.apache.org/jira/browse/SPARK-26256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26256: Assignee: (was: Apache Spark) > Add proper labels when deleting pods > > > Key: SPARK-26256 > URL: https://issues.apache.org/jira/browse/SPARK-26256 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Priority: Major > > As discussed here: > [https://github.com/apache/spark/pull/23136#discussion_r236463330] > we need to add proper labels to avoid killing executors belonging to other > jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26256) Add proper labels when deleting pods
[ https://issues.apache.org/jira/browse/SPARK-26256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707370#comment-16707370 ] Apache Spark commented on SPARK-26256: -- User 'skonto' has created a pull request for this issue: https://github.com/apache/spark/pull/23209 > Add proper labels when deleting pods > > > Key: SPARK-26256 > URL: https://issues.apache.org/jira/browse/SPARK-26256 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Priority: Major > > As discussed here: > [https://github.com/apache/spark/pull/23136#discussion_r236463330] > we need to add proper labels to avoid killing executors belonging to other > jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25530) data source v2 API refactor (batch write)
[ https://issues.apache.org/jira/browse/SPARK-25530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707361#comment-16707361 ] Apache Spark commented on SPARK-25530: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/23208 > data source v2 API refactor (batch write) > - > > Key: SPARK-25530 > URL: https://issues.apache.org/jira/browse/SPARK-25530 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Adjust the batch write API to match the read API after refactor -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25530) data source v2 API refactor (batch write)
[ https://issues.apache.org/jira/browse/SPARK-25530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25530: Assignee: (was: Apache Spark) > data source v2 API refactor (batch write) > - > > Key: SPARK-25530 > URL: https://issues.apache.org/jira/browse/SPARK-25530 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Adjust the batch write API to match the read API after refactor -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25530) data source v2 API refactor (batch write)
[ https://issues.apache.org/jira/browse/SPARK-25530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25530: Assignee: Apache Spark > data source v2 API refactor (batch write) > - > > Key: SPARK-25530 > URL: https://issues.apache.org/jira/browse/SPARK-25530 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > > Adjust the batch write API to match the read API after refactor -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25530) data source v2 API refactor (batch write)
[ https://issues.apache.org/jira/browse/SPARK-25530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25530: Summary: data source v2 API refactor (batch write) (was: data source v2 write side API refactoring) > data source v2 API refactor (batch write) > - > > Key: SPARK-25530 > URL: https://issues.apache.org/jira/browse/SPARK-25530 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > refactor the write side API according to this abstraction > {code} > batch: catalog -> table -> write > streaming: catalog -> table -> stream -> write > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25530) data source v2 API refactor (batch write)
[ https://issues.apache.org/jira/browse/SPARK-25530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25530: Description: Adjust the batch write API to match the read API after refactor > data source v2 API refactor (batch write) > - > > Key: SPARK-25530 > URL: https://issues.apache.org/jira/browse/SPARK-25530 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Adjust the batch write API to match the read API after refactor -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25530) data source v2 API refactor (batch write)
[ https://issues.apache.org/jira/browse/SPARK-25530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25530: Description: (was: refactor the write side API according to this abstraction {code} batch: catalog -> table -> write streaming: catalog -> table -> stream -> write {code}) > data source v2 API refactor (batch write) > - > > Key: SPARK-25530 > URL: https://issues.apache.org/jira/browse/SPARK-25530 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26253) Task Summary Metrics Table on Stage Page shows empty table when no data is present
[ https://issues.apache.org/jira/browse/SPARK-26253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26253. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23205 [https://github.com/apache/spark/pull/23205] > Task Summary Metrics Table on Stage Page shows empty table when no data is > present > -- > > Key: SPARK-26253 > URL: https://issues.apache.org/jira/browse/SPARK-26253 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Parth Gandhi >Assignee: Parth Gandhi >Priority: Minor > Fix For: 3.0.0 > > > Task Summary Metrics Table on Stage Page shows empty table when no data is > present instead of showing a message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26253) Task Summary Metrics Table on Stage Page shows empty table when no data is present
[ https://issues.apache.org/jira/browse/SPARK-26253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26253: - Assignee: Parth Gandhi This should just be a follow-up to SPARK-21089, not a new issue. > Task Summary Metrics Table on Stage Page shows empty table when no data is > present > -- > > Key: SPARK-26253 > URL: https://issues.apache.org/jira/browse/SPARK-26253 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Parth Gandhi >Assignee: Parth Gandhi >Priority: Minor > > Task Summary Metrics Table on Stage Page shows empty table when no data is > present instead of showing a message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org