date:20181203

[jira] [Updated] (SPARK-26260) Task Summary Metrics for Stage Page: Efficient implementation for SHS when using disk store.

2018-12-03 Thread shahid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26260:
---
Summary: Task Summary Metrics for Stage Page: Efficient implementation for 
SHS when using disk store.  (was: Summary Task Metrics for Stage Page: 
Efficient implementation for SHS when using disk store.)

> Task Summary Metrics for Stage Page: Efficient implementation for SHS when 
> using disk store.
> 
>
> Key: SPARK-26260
> URL: https://issues.apache.org/jira/browse/SPARK-26260
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: shahid
>Priority: Major
>
> Currently, tasks summary metrics is calculated based on all the tasks, 
> instead of successful tasks. 
> After the JIRA, https://issues.apache.org/jira/browse/SPARK-26119, when using 
> InMemory store, it find task summary metrics for all the successful tasks 
> metrics. But we need to find an efficient implementation for disk store case 
> for SHS. The main bottle neck for disk store is deserialization time overhead.
> Hints: Need to rework on the way indexing works, so that we can index by 
> specific metrics for successful and failed tasks differently (would be 
> tricky). Also would require changing the disk store version (to invalidate 
> old stores).
> OR any other efficient solutions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26260) Summary Task Metrics for Stage Page: Efficient implementation for SHS when using disk store.

2018-12-03 Thread shahid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26260:
---
Summary: Summary Task Metrics for Stage Page: Efficient implementation for 
SHS when using disk store.  (was: Summary Task Metrics for Stage Page: 
Efficient implimentation for SHS when using disk store.)

> Summary Task Metrics for Stage Page: Efficient implementation for SHS when 
> using disk store.
> 
>
> Key: SPARK-26260
> URL: https://issues.apache.org/jira/browse/SPARK-26260
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: shahid
>Priority: Major
>
> Currently, tasks summary metrics is calculated based on all the tasks, 
> instead of successful tasks. 
> After the JIRA, https://issues.apache.org/jira/browse/SPARK-26119, when using 
> InMemory store, it find task summary metrics for all the successful tasks 
> metrics. But we need to find an efficient implementation for disk store case 
> for SHS. The main bottle neck for disk store is deserialization time overhead.
> Hints: Need to rework on the way indexing works, so that we can index by 
> specific metrics for successful and failed tasks differently (would be 
> tricky). Also would require changing the disk store version (to invalidate 
> old stores).
> OR any other efficient solutions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25573) Combine resolveExpression and resolve in the rule ResolveReferences

2018-12-03 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25573.
-
   Resolution: Fixed
 Assignee: Dilip Biswal
Fix Version/s: 3.0.0

> Combine resolveExpression and resolve in the rule ResolveReferences
> ---
>
> Key: SPARK-25573
> URL: https://issues.apache.org/jira/browse/SPARK-25573
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.0
>
>
> In the rule ResolveReferences, two private functions `resolve` and 
> `resolveExpression` should be combined. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-12-03 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708243#comment-16708243
 ] 

Apache Spark commented on SPARK-26155:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/23214

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-12-03 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708241#comment-16708241
 ] 

Apache Spark commented on SPARK-26155:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/23214

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-12-03 Thread indraneel r (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708245#comment-16708245
 ] 

indraneel r commented on SPARK-26206:
-

[~kabhwan] 

Here are some of my observations:

 - The error only ones when you get the data for the second batch. The first 
batch goes through fine. You may not see the output sometimes, not sure why 
though, but you can see the output once you start pumping new data into the 
kafka topic. And this is when it throws the error.

 - This same query works well with spark 2.3.0.

 

Heres how some sample data looks like:



 
{code:java}
[
 {
 "timestamp": 1541043341540,
 "cid": "333-333-333",
 "uid": "11-111-111",
 "sessionId": "11-111-111",
 "merchantId": "",
 "event": "-222-222",
 "ip": "1.1.1.1",
 "refUrl": "",
 "referrer": "",
 "section": "lorem",
 "tag": "lorem,ipsum",
 "eventType": "Random_event_1",
 "sid": "qwwewew"
 },
 {
 "timestamp": 1541043341540,
 "cid": "333-444-444",
 "uid": "11-555-111",
 "sessionId": "11-111-111",
 "merchantId": "3331",
 "event": "-222-333",
 "ip": "1.1.2.1",
 "refUrl": "",
 "referrer": "",
 "section": "ipsum",
 "tag": "lorem,ipsum2",
 "eventType": "Random_event_2",
 "sid": "xxxdfffwewe"
 }]
{code}
 

 

> Spark structured streaming with kafka integration fails in update mode 
> ---
>
> Key: SPARK-26206
> URL: https://issues.apache.org/jira/browse/SPARK-26206
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Operating system : MacOS Mojave
>  spark version : 2.4.0
> spark-sql-kafka-0-10 : 2.4.0
>  kafka version 1.1.1
> scala version : 2.12.7
>Reporter: indraneel r
>Priority: Major
>
> Spark structured streaming with kafka integration fails in update mode with 
> compilation exception in code generation. 
>  Here's the code that was executed:
> {code:java}
> // code placeholder
> override def main(args: Array[String]): Unit = {
>   val spark = SparkSession
>     .builder
>     .master("local[*]")
>     .appName("SparkStreamingTest")
>     .getOrCreate()
>  
>   val kafkaParams = Map[String, String](
>    "kafka.bootstrap.servers" -> "localhost:9092",
>    "startingOffsets" -> "earliest",
>    "subscribe" -> "test_events")
>  
>   val schema = Encoders.product[UserEvent].schema
>   val query = spark.readStream.format("kafka")
>     .options(kafkaParams)
>     .load()
>     .selectExpr("CAST(value AS STRING) as message")
>     .select(from_json(col("message"), schema).as("json"))
>     .select("json.*")
>     .groupBy(window(col("event_time"), "10 minutes"))
>     .count()
>     .writeStream
>     .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
>   println(s"batch : ${batchId}")
>   batch.show(false)
>     }
>     .outputMode("update")
>     .start()
>     query.awaitTermination()
> }{code}
> It succeeds for batch 0 but fails for batch 1 with following exception when 
> more data is arrives in the stream.
> {code:java}
> 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
>     at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
>     at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
>     at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
>     at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
>     at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
>     at

[jira] [Assigned] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-12-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26155:


Assignee: (was: Apache Spark)

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-12-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26155:


Assignee: Apache Spark

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Assignee: Apache Spark
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26051) Can't create table with column name '22222d'

2018-12-03 Thread Dilip Biswal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708233#comment-16708233
 ] 

Dilip Biswal commented on SPARK-26051:
--

[~xiejuntao1...@163.com] Hello, i took a quick look at this. `2d` is parsed 
as a DOUBLE_LITERAL. Thats the reason its not allowed as a column name. Can you 
check other systems ? I checked hive and db2 and both of these systems do not 
allow numeric literals as column names.
{quote}

db2 => create table t1(2d int) 
DB21034E  The command was processed as an SQL statement because it was not a 
valid Command Line Processor command.  During SQL processing it returned:
SQL0103N  The numeric literal "2d" is not valid.  SQLSTATE=42604
{quote}

> Can't create table with column name '2d'
> 
>
> Key: SPARK-26051
> URL: https://issues.apache.org/jira/browse/SPARK-26051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xie Juntao
>Priority: Minor
>
> I can't create table in which the column name is '2d' when I use 
> spark-sql. It seems a SQL parser bug because it's ok for creating table with 
> the column name ''2m".
> {code:java}
> spark-sql> create table t1(2d int);
> Error in query:
> no viable alternative at input 'create table t1(2d'(line 1, pos 16)
> == SQL ==
> create table t1(2d int)
> ^^^
> spark-sql> create table t1(2m int);
> 18/11/14 09:13:53 INFO HiveMetaStore: 0: get_database: global_temp
> 18/11/14 09:13:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: 
> global_temp
> 18/11/14 09:13:53 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_database: default
> 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: 
> default
> 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_database: default
> 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: 
> default
> 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_table : db=default tbl=t1
> 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : 
> db=default tbl=t1
> 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_database: default
> 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: 
> default
> 18/11/14 09:13:55 INFO HiveMetaStore: 0: create_table: Table(tableName:t1, 
> dbName:default, owner:root, createTime:1542158033, lastAccessTime:0, 
> retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:2m, type:int, 
> comment:null)], 
> location:file:/opt/UQuery/spark_/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t1,
>  inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> parameters:{serialization.format=1}), bucketCols:[], sortCols:[], 
> parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
> skewedColValueLocationMaps:{})), partitionKeys:[], 
> parameters:{spark.sql.sources.schema.part.0={"type":"struct","fields":[{"name":"2m","type":"integer","nullable":true,"metadata":{}}]},
>  spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.3.1}, 
> viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, 
> privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, 
> rolePrivileges:null))
> 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=create_table: 
> Table(tableName:t1, dbName:default, owner:root, createTime:1542158033, 
> lastAccessTime:0, retention:0, 
> sd:StorageDescriptor(cols:[FieldSchema(name:2m, type:int, comment:null)], 
> location:file:/opt/UQuery/spark_/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t1,
>  inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> parameters:{serialization.format=1}), bucketCols:[], sortCols:[], 
> parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
> skewedColValueLocationMaps:{})), partitionKeys:[], 
> parameters:{spark.sql.sources.schema.part.0={"type":"struct","fields":[{"name":"2m","type":"integer","nullable":true,"metadata":{}}]},
>  spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.3.1}, 
> viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, 
> privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, 
> rolePrivileges:null))
> 18/11/14 09:13:55 WARN HiveMetaStore: Location: 
>

[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-12-03 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708225#comment-16708225
 ] 

Wenchen Fan commented on SPARK-26155:
-

great to hear that! looking forward to your patch :)

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-12-03 Thread Ke Jia (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708223#comment-16708223
 ] 

Ke Jia commented on SPARK-26155:


[~cloud_fan] [~viirya]  Spark2.3 with the optimized patch can have the same 
performance with spark2.1. 
||spark2.1||spark2.3 with patch||
|49s|47s|

 

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26244) Do not use case class as public API

2018-12-03 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708220#comment-16708220
 ] 

Wenchen Fan commented on SPARK-26244:
-

I took a quick look, and seems most of the public case class APIs are OK, like 
`SparkListenerEvent` implementations. Feel free to add more subtasks if you 
found something to fix. cc [~srowen] [~hyukjin.kwon]

> Do not use case class as public API
> ---
>
> Key: SPARK-26244
> URL: https://issues.apache.org/jira/browse/SPARK-26244
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: release-notes
>
> It's a bad idea to use case class as public API, as it has a very wide 
> surface. For example, the copy method, its fields, the companion object, etc.
> I don't think it's expect to expose so many stuff to end users, and usually 
> we only want to expose a few methods.
> We should use a pure trait as public API, and use case class as an 
> implementation, which should be private and hide from end users.
> Changing class to interface is not binary compatible(but source compatible), 
> so 3.0 is a good chance to do it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale

2018-12-03 Thread Yang Jie (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708219#comment-16708219
 ] 

Yang Jie commented on SPARK-26155:
--

[~cloud_fan]  [~viirya] maybe we needn't revert this patch, offline discussion 
and testing with [~Jk_Self], we will give a patch to optimize the performance.

> Spark SQL  performance degradation after apply SPARK-21052 with Q19 of TPC-DS 
> in 3TB scale
> --
>
> Key: SPARK-26155
> URL: https://issues.apache.org/jira/browse/SPARK-26155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
> Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis 
> in Spark2.3 without L486&487.pdf, q19.sql
>
>
> In our test environment, we found a serious performance degradation issue in 
> Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious 
> performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark 
> 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated 
> this problem and figured out the root cause is in community patch SPARK-21052 
> which add metrics to hash join process. And the impact code is 
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>   . Q19 costs about 30 seconds without these two lines code and 126 seconds 
> with these code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema

2018-12-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708218#comment-16708218
 ] 

Hyukjin Kwon commented on SPARK-26259:
--

Can you try the examples against the current master branch of Spark and 
describe expected input and output if there's an issue?

> RecordSeparator other than newline discovers incorrect schema
> -
>
> Key: SPARK-26259
> URL: https://issues.apache.org/jira/browse/SPARK-26259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: PoojaMurarka
>Priority: Major
>
> Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed 
> in SPARK 2.3 which allows record Separators other than new line but this 
> doesn't work when schema is not specified i.e. while inferring the schema
>  Let me try to explain this using below data and scenarios:
> Input Data - (input_data.csv) as shown below: *+where recordSeparator is 
> "\t"+*
> {noformat}
> "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed"
> "2012-01-01","0","0","0","0","1","9","9.1","66","0"
> "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat}
> *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data 
> correctly:
> {code:java}
> val customSchema = StructType(Array(
> StructField("dteday", DateType, true),
> StructField("hr", IntegerType, true),
> StructField("holiday", IntegerType, true),
> StructField("weekday", IntegerType, true),
> StructField("workingday", DateType, true),
> StructField("weathersit", IntegerType, true),
> StructField("temp", IntegerType, true),
> StructField("atemp", DoubleType, true),
> StructField("hum", IntegerType, true),
> StructField("windspeed", IntegerType, true)));
> Dataset ds = executionContext.getSparkSession().read().format( "csv" )
>   .option( "header", true )
>   .option( "schema", customSchema)
>   .option( "sep", "," )
>   .load( "input_data.csv" );
> {code}
> *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is 
> done i.e. entire data is read as column names.
> {code:java}
> Dataset ds = executionContext.getSparkSession().read().format( "csv" )
>   .option( "header", true )
>   .option( "inferSchema", true)
>   .option( "sep", "," )
>   .load( "input_data.csv" );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26262) Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false

2018-12-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26262:


Assignee: Apache Spark

> Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false
> ---
>
> Key: SPARK-26262
> URL: https://issues.apache.org/jira/browse/SPARK-26262
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> For better test coverage, we need to set `false` at 
> `WHOLESTAGE_CODEGEN_ENABLED` for interpreter execution tests when running 
> `SQLQueryTestSuite`. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26262) Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false

2018-12-03 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708184#comment-16708184
 ] 

Apache Spark commented on SPARK-26262:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/23213

> Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false
> ---
>
> Key: SPARK-26262
> URL: https://issues.apache.org/jira/browse/SPARK-26262
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> For better test coverage, we need to set `false` at 
> `WHOLESTAGE_CODEGEN_ENABLED` for interpreter execution tests when running 
> `SQLQueryTestSuite`. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26262) Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false

2018-12-03 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708185#comment-16708185
 ] 

Apache Spark commented on SPARK-26262:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/23213

> Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false
> ---
>
> Key: SPARK-26262
> URL: https://issues.apache.org/jira/browse/SPARK-26262
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> For better test coverage, we need to set `false` at 
> `WHOLESTAGE_CODEGEN_ENABLED` for interpreter execution tests when running 
> `SQLQueryTestSuite`. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26262) Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false

2018-12-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26262:


Assignee: (was: Apache Spark)

> Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false
> ---
>
> Key: SPARK-26262
> URL: https://issues.apache.org/jira/browse/SPARK-26262
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> For better test coverage, we need to set `false` at 
> `WHOLESTAGE_CODEGEN_ENABLED` for interpreter execution tests when running 
> `SQLQueryTestSuite`. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26262) Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false

2018-12-03 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created SPARK-26262:


 Summary: Run SQLQueryTestSuite with 
WHOLESTAGE_CODEGEN_ENABLED=false
 Key: SPARK-26262
 URL: https://issues.apache.org/jira/browse/SPARK-26262
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Takeshi Yamamuro


For better test coverage, we need to set `false` at 
`WHOLESTAGE_CODEGEN_ENABLED` for interpreter execution tests when running 
`SQLQueryTestSuite`. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema

2018-12-03 Thread PoojaMurarka (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708161#comment-16708161
 ] 

PoojaMurarka edited comment on SPARK-26259 at 12/4/18 4:14 AM:
---

The fix for using custom record delimiters seems to be only available when 
schema is specified based on the examples. Please correct me if I am wrong.
Rather I am looking for setting custom record delimiter while discovery schema 
i.e. only use *inferschema* as true rather than specifying schema.
Let me know if above issue covers both scenarios.


was (Author: pooja.murarka):
The fix for using custom record delimiters seems to be only available when 
schema is specified based on the examples. Please correct me if I am wrong.
Rather I am looking for setting custom record delimiter while discovery schema 
i.e. only use inferschema as true rather than specifying schema.
Let me know if above issue covers both scenarios.

> RecordSeparator other than newline discovers incorrect schema
> -
>
> Key: SPARK-26259
> URL: https://issues.apache.org/jira/browse/SPARK-26259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: PoojaMurarka
>Priority: Major
>
> Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed 
> in SPARK 2.3 which allows record Separators other than new line but this 
> doesn't work when schema is not specified i.e. while inferring the schema
>  Let me try to explain this using below data and scenarios:
> Input Data - (input_data.csv) as shown below: *+where recordSeparator is 
> "\t"+*
> {noformat}
> "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed"
> "2012-01-01","0","0","0","0","1","9","9.1","66","0"
> "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat}
> *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data 
> correctly:
> {code:java}
> val customSchema = StructType(Array(
> StructField("dteday", DateType, true),
> StructField("hr", IntegerType, true),
> StructField("holiday", IntegerType, true),
> StructField("weekday", IntegerType, true),
> StructField("workingday", DateType, true),
> StructField("weathersit", IntegerType, true),
> StructField("temp", IntegerType, true),
> StructField("atemp", DoubleType, true),
> StructField("hum", IntegerType, true),
> StructField("windspeed", IntegerType, true)));
> Dataset ds = executionContext.getSparkSession().read().format( "csv" )
>   .option( "header", true )
>   .option( "schema", customSchema)
>   .option( "sep", "," )
>   .load( "input_data.csv" );
> {code}
> *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is 
> done i.e. entire data is read as column names.
> {code:java}
> Dataset ds = executionContext.getSparkSession().read().format( "csv" )
>   .option( "header", true )
>   .option( "inferSchema", true)
>   .option( "sep", "," )
>   .load( "input_data.csv" );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema

2018-12-03 Thread PoojaMurarka (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708161#comment-16708161
 ] 

PoojaMurarka commented on SPARK-26259:
--

The fix for using custom record delimiters seems to be only available when 
schema is specified based on the examples. Please correct me if I am wrong.
Rather I am looking for setting custom record delimiter while discovery schema 
i.e. only use inferschema as true rather than specifying schema.
Let me know if above issue covers both scenarios.

> RecordSeparator other than newline discovers incorrect schema
> -
>
> Key: SPARK-26259
> URL: https://issues.apache.org/jira/browse/SPARK-26259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: PoojaMurarka
>Priority: Major
>
> Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed 
> in SPARK 2.3 which allows record Separators other than new line but this 
> doesn't work when schema is not specified i.e. while inferring the schema
>  Let me try to explain this using below data and scenarios:
> Input Data - (input_data.csv) as shown below: *+where recordSeparator is 
> "\t"+*
> {noformat}
> "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed"
> "2012-01-01","0","0","0","0","1","9","9.1","66","0"
> "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat}
> *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data 
> correctly:
> {code:java}
> val customSchema = StructType(Array(
> StructField("dteday", DateType, true),
> StructField("hr", IntegerType, true),
> StructField("holiday", IntegerType, true),
> StructField("weekday", IntegerType, true),
> StructField("workingday", DateType, true),
> StructField("weathersit", IntegerType, true),
> StructField("temp", IntegerType, true),
> StructField("atemp", DoubleType, true),
> StructField("hum", IntegerType, true),
> StructField("windspeed", IntegerType, true)));
> Dataset ds = executionContext.getSparkSession().read().format( "csv" )
>   .option( "header", true )
>   .option( "schema", customSchema)
>   .option( "sep", "," )
>   .load( "input_data.csv" );
> {code}
> *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is 
> done i.e. entire data is read as column names.
> {code:java}
> Dataset ds = executionContext.getSparkSession().read().format( "csv" )
>   .option( "header", true )
>   .option( "inferSchema", true)
>   .option( "sep", "," )
>   .load( "input_data.csv" );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26261) Spark does not check completeness temporary file

2018-12-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708153#comment-16708153
 ] 

Hyukjin Kwon commented on SPARK-26261:
--

Mind if I ask the initial test you ran?

> Spark does not check completeness temporary file 
> -
>
> Key: SPARK-26261
> URL: https://issues.apache.org/jira/browse/SPARK-26261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Jialin LIu
>Priority: Minor
>
> Spark does not check temporary files' completeness. When persisting to disk 
> is enabled on some RDDs, a bunch of temporary files will be created on 
> blockmgr folder. Block manager is able to detect missing blocks while it is 
> not able detect file content being modified during execution. 
> Our initial test shows that if we truncate the block file before being used 
> by executors, the program will finish without detecting any error, but the 
> result content is totally wrong.
> We believe there should be a file checksum on every RDD file block and these 
> files should be protected by checksum.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26251) isnan function not picking non-numeric values

2018-12-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708129#comment-16708129
 ] 

Hyukjin Kwon commented on SPARK-26251:
--

Why does it should pick "po box 7896"?

> isnan function not picking non-numeric values
> -
>
> Key: SPARK-26251
> URL: https://issues.apache.org/jira/browse/SPARK-26251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kunal Rao
>Priority: Minor
>
> import org.apache.spark.sql.functions._
> List("po box 7896", "8907", 
> "435435").toDF("rgid").filter(isnan(col("rgid"))).show
>  
> should pick "po box 7896"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26250) Fail to run dataframe.R examples

2018-12-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708125#comment-16708125
 ] 

Hyukjin Kwon commented on SPARK-26250:
--

Please avoid to set a target version which is usually reserved for committers.

> Fail to run dataframe.R examples
> 
>
> Key: SPARK-26250
> URL: https://issues.apache.org/jira/browse/SPARK-26250
> Project: Spark
>  Issue Type: Test
>  Components: Examples
>Affects Versions: 2.4.0
>Reporter: Jean Pierre PIN
>Priority: Major
>
> I get an error=2 running spark-submit examples/src/main/r/dataframe.R
>  the script is working with Rstudio but i've changed the library(SparkR) line 
>  with this one
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> i am at the top root directory of spark installation and the path variable 
> for /bin is specified in the environment so spark-submit is found.  On system 
> window 7 Ultimate 64bits
> read "main" java.io.IOException: Cannot run program "Rscript": CreateProcess 
> error=2, The system cannot find the file specified
> I think the issue is known for a long but i don't find any post.
>  Thanks for answer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema

2018-12-03 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26259.
--
Resolution: Duplicate

> RecordSeparator other than newline discovers incorrect schema
> -
>
> Key: SPARK-26259
> URL: https://issues.apache.org/jira/browse/SPARK-26259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: PoojaMurarka
>Priority: Major
>
> Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed 
> in SPARK 2.3 which allows record Separators other than new line but this 
> doesn't work when schema is not specified i.e. while inferring the schema
>  Let me try to explain this using below data and scenarios:
> Input Data - (input_data.csv) as shown below: *+where recordSeparator is 
> "\t"+*
> {noformat}
> "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed"
> "2012-01-01","0","0","0","0","1","9","9.1","66","0"
> "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat}
> *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data 
> correctly:
> {code:java}
> val customSchema = StructType(Array(
> StructField("dteday", DateType, true),
> StructField("hr", IntegerType, true),
> StructField("holiday", IntegerType, true),
> StructField("weekday", IntegerType, true),
> StructField("workingday", DateType, true),
> StructField("weathersit", IntegerType, true),
> StructField("temp", IntegerType, true),
> StructField("atemp", DoubleType, true),
> StructField("hum", IntegerType, true),
> StructField("windspeed", IntegerType, true)));
> Dataset ds = executionContext.getSparkSession().read().format( "csv" )
>   .option( "header", true )
>   .option( "schema", customSchema)
>   .option( "sep", "," )
>   .load( "input_data.csv" );
> {code}
> *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is 
> done i.e. entire data is read as column names.
> {code:java}
> Dataset ds = executionContext.getSparkSession().read().format( "csv" )
>   .option( "header", true )
>   .option( "inferSchema", true)
>   .option( "sep", "," )
>   .load( "input_data.csv" );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema

2018-12-03 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26259:
-
Component/s: (was: Spark Core)
 SQL

> RecordSeparator other than newline discovers incorrect schema
> -
>
> Key: SPARK-26259
> URL: https://issues.apache.org/jira/browse/SPARK-26259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: PoojaMurarka
>Priority: Major
>
> Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed 
> in SPARK 2.3 which allows record Separators other than new line but this 
> doesn't work when schema is not specified i.e. while inferring the schema
>  Let me try to explain this using below data and scenarios:
> Input Data - (input_data.csv) as shown below: *+where recordSeparator is 
> "\t"+*
> {noformat}
> "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed"
> "2012-01-01","0","0","0","0","1","9","9.1","66","0"
> "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat}
> *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data 
> correctly:
> {code:java}
> val customSchema = StructType(Array(
> StructField("dteday", DateType, true),
> StructField("hr", IntegerType, true),
> StructField("holiday", IntegerType, true),
> StructField("weekday", IntegerType, true),
> StructField("workingday", DateType, true),
> StructField("weathersit", IntegerType, true),
> StructField("temp", IntegerType, true),
> StructField("atemp", DoubleType, true),
> StructField("hum", IntegerType, true),
> StructField("windspeed", IntegerType, true)));
> Dataset ds = executionContext.getSparkSession().read().format( "csv" )
>   .option( "header", true )
>   .option( "schema", customSchema)
>   .option( "sep", "," )
>   .load( "input_data.csv" );
> {code}
> *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is 
> done i.e. entire data is read as column names.
> {code:java}
> Dataset ds = executionContext.getSparkSession().read().format( "csv" )
>   .option( "header", true )
>   .option( "inferSchema", true)
>   .option( "sep", "," )
>   .load( "input_data.csv" );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema

2018-12-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708136#comment-16708136
 ] 

Hyukjin Kwon commented on SPARK-26259:
--

This is fixed in SPARK-26108 which should be availabe from upcoming Spark 
version.

> RecordSeparator other than newline discovers incorrect schema
> -
>
> Key: SPARK-26259
> URL: https://issues.apache.org/jira/browse/SPARK-26259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: PoojaMurarka
>Priority: Major
>
> Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed 
> in SPARK 2.3 which allows record Separators other than new line but this 
> doesn't work when schema is not specified i.e. while inferring the schema
>  Let me try to explain this using below data and scenarios:
> Input Data - (input_data.csv) as shown below: *+where recordSeparator is 
> "\t"+*
> {noformat}
> "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed"
> "2012-01-01","0","0","0","0","1","9","9.1","66","0"
> "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat}
> *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data 
> correctly:
> {code:java}
> val customSchema = StructType(Array(
> StructField("dteday", DateType, true),
> StructField("hr", IntegerType, true),
> StructField("holiday", IntegerType, true),
> StructField("weekday", IntegerType, true),
> StructField("workingday", DateType, true),
> StructField("weathersit", IntegerType, true),
> StructField("temp", IntegerType, true),
> StructField("atemp", DoubleType, true),
> StructField("hum", IntegerType, true),
> StructField("windspeed", IntegerType, true)));
> Dataset ds = executionContext.getSparkSession().read().format( "csv" )
>   .option( "header", true )
>   .option( "schema", customSchema)
>   .option( "sep", "," )
>   .load( "input_data.csv" );
> {code}
> *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is 
> done i.e. entire data is read as column names.
> {code:java}
> Dataset ds = executionContext.getSparkSession().read().format( "csv" )
>   .option( "header", true )
>   .option( "inferSchema", true)
>   .option( "sep", "," )
>   .load( "input_data.csv" );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21289) Text based formats do not support custom end-of-line delimiters

2018-12-03 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21289:
-
Priority: Major  (was: Minor)

> Text based formats do not support custom end-of-line delimiters
> ---
>
> Key: SPARK-21289
> URL: https://issues.apache.org/jira/browse/SPARK-21289
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.1, 2.3.0
>Reporter: Yevgen Galchenko
>Priority: Major
>
> Spark csv and text readers always use default CR, LF or CRLF line terminators 
> without an option to configure a custom delimiter.
> Option "textinputformat.record.delimiter" is not being used to set delimiter 
> in HadoopFileLinesReader and can only be set for Hadoop RDD when textFile() 
> is used to read file.
> Possible solution would be to change HadoopFileLinesReader and create 
> LineRecordReader with delimiters specified in configuration. LineRecordReader 
> already supports passing recordDelimiter in its constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema

2018-12-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708134#comment-16708134
 ] 

Hyukjin Kwon commented on SPARK-26259:
--

please avoid to set the fixed version which is usually set after acutally it's 
fixed.

> RecordSeparator other than newline discovers incorrect schema
> -
>
> Key: SPARK-26259
> URL: https://issues.apache.org/jira/browse/SPARK-26259
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: PoojaMurarka
>Priority: Major
>
> Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed 
> in SPARK 2.3 which allows record Separators other than new line but this 
> doesn't work when schema is not specified i.e. while inferring the schema
>  Let me try to explain this using below data and scenarios:
> Input Data - (input_data.csv) as shown below: *+where recordSeparator is 
> "\t"+*
> {noformat}
> "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed"
> "2012-01-01","0","0","0","0","1","9","9.1","66","0"
> "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat}
> *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data 
> correctly:
> {code:java}
> val customSchema = StructType(Array(
> StructField("dteday", DateType, true),
> StructField("hr", IntegerType, true),
> StructField("holiday", IntegerType, true),
> StructField("weekday", IntegerType, true),
> StructField("workingday", DateType, true),
> StructField("weathersit", IntegerType, true),
> StructField("temp", IntegerType, true),
> StructField("atemp", DoubleType, true),
> StructField("hum", IntegerType, true),
> StructField("windspeed", IntegerType, true)));
> Dataset ds = executionContext.getSparkSession().read().format( "csv" )
>   .option( "header", true )
>   .option( "schema", customSchema)
>   .option( "sep", "," )
>   .load( "input_data.csv" );
> {code}
> *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is 
> done i.e. entire data is read as column names.
> {code:java}
> Dataset ds = executionContext.getSparkSession().read().format( "csv" )
>   .option( "header", true )
>   .option( "inferSchema", true)
>   .option( "sep", "," )
>   .load( "input_data.csv" );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema

2018-12-03 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26259:
-
Fix Version/s: (was: 2.4.1)

> RecordSeparator other than newline discovers incorrect schema
> -
>
> Key: SPARK-26259
> URL: https://issues.apache.org/jira/browse/SPARK-26259
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: PoojaMurarka
>Priority: Major
>
> Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed 
> in SPARK 2.3 which allows record Separators other than new line but this 
> doesn't work when schema is not specified i.e. while inferring the schema
>  Let me try to explain this using below data and scenarios:
> Input Data - (input_data.csv) as shown below: *+where recordSeparator is 
> "\t"+*
> {noformat}
> "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed"
> "2012-01-01","0","0","0","0","1","9","9.1","66","0"
> "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat}
> *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data 
> correctly:
> {code:java}
> val customSchema = StructType(Array(
> StructField("dteday", DateType, true),
> StructField("hr", IntegerType, true),
> StructField("holiday", IntegerType, true),
> StructField("weekday", IntegerType, true),
> StructField("workingday", DateType, true),
> StructField("weathersit", IntegerType, true),
> StructField("temp", IntegerType, true),
> StructField("atemp", DoubleType, true),
> StructField("hum", IntegerType, true),
> StructField("windspeed", IntegerType, true)));
> Dataset ds = executionContext.getSparkSession().read().format( "csv" )
>   .option( "header", true )
>   .option( "schema", customSchema)
>   .option( "sep", "," )
>   .load( "input_data.csv" );
> {code}
> *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is 
> done i.e. entire data is read as column names.
> {code:java}
> Dataset ds = executionContext.getSparkSession().read().format( "csv" )
>   .option( "header", true )
>   .option( "inferSchema", true)
>   .option( "sep", "," )
>   .load( "input_data.csv" );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26255) Custom error/exception is not thrown for the SQL tab when UI filters are added in spark-sql launch

2018-12-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708132#comment-16708132
 ] 

Hyukjin Kwon commented on SPARK-26255:
--

it would be great if some snapshots from UI is uploaded here.

> Custom error/exception is not thrown for the SQL tab when UI filters are 
> added in spark-sql launch
> --
>
> Key: SPARK-26255
> URL: https://issues.apache.org/jira/browse/SPARK-26255
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.3.2
> Environment: 【Test Environment】：
> Server OS :-SUSE 
> No. of Cluster Node:-3 
> Spark Version:- 2.3.2
> Hadoop Version:-3.1
>Reporter: Sushanta Sen
>Priority: Major
>
> 【Detailed description】：Custom error is not thrown for the SQL tab when UI 
> filters are added in spark-sql launch
>  【Precondition】：
>  1.Cluster is up and running【Test step】：
>  1. Launch spark sql as below:
> [spark-sql --master yarn --conf 
> spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter
>  --conf 
> spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=simple"]
> 2. Go to Yarn application list UI link
>  3. Launch the application master for the Spark-SQL app ID
>  4. It will display an error 
>  5. Append /executors, /stages, /jobs, /environment, /SQL
> 【Expect Output】：An error should be displayed "An error has occurred. Please 
> check for all the TABS 
>  【Actual Output】：The error message is displayed  for all the tabs except SQL 
> tab .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26255) Custom error/exception is not thrown for the SQL tab when UI filters are added in spark-sql launch

2018-12-03 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26255:
-
Component/s: (was: Spark Core)
 Web UI
 SQL

> Custom error/exception is not thrown for the SQL tab when UI filters are 
> added in spark-sql launch
> --
>
> Key: SPARK-26255
> URL: https://issues.apache.org/jira/browse/SPARK-26255
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.3.2
> Environment: 【Test Environment】：
> Server OS :-SUSE 
> No. of Cluster Node:-3 
> Spark Version:- 2.3.2
> Hadoop Version:-3.1
>Reporter: Sushanta Sen
>Priority: Major
>
> 【Detailed description】：Custom error is not thrown for the SQL tab when UI 
> filters are added in spark-sql launch
>  【Precondition】：
>  1.Cluster is up and running【Test step】：
>  1. Launch spark sql as below:
> [spark-sql --master yarn --conf 
> spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter
>  --conf 
> spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=simple"]
> 2. Go to Yarn application list UI link
>  3. Launch the application master for the Spark-SQL app ID
>  4. It will display an error 
>  5. Append /executors, /stages, /jobs, /environment, /SQL
> 【Expect Output】：An error should be displayed "An error has occurred. Please 
> check for all the TABS 
>  【Actual Output】：The error message is displayed  for all the tabs except SQL 
> tab .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26149) Read UTF8String from Parquet/ORC may be incorrect

2018-12-03 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708119#comment-16708119
 ] 

Yuming Wang commented on SPARK-26149:
-

This is not a Spark bug, but a Hive bug.

!image-2018-12-04-10-55-49-369.png!

> Read UTF8String from Parquet/ORC may be incorrect
> -
>
> Key: SPARK-26149
> URL: https://issues.apache.org/jira/browse/SPARK-26149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: SPARK-26149.snappy.parquet, 
> image-2018-12-04-10-55-49-369.png
>
>
> How to reproduce:
> {code:bash}
> scala> 
> spark.read.parquet("/Users/yumwang/SPARK-26149/SPARK-26149.snappy.parquet").selectExpr("s1
>  = s2").show
> +-+
> |(s1 = s2)|
> +-+
> |false|
> +-+
> scala> val first = 
> spark.read.parquet("/Users/yumwang/SPARK-26149/SPARK-26149.snappy.parquet").collect().head
> first: org.apache.spark.sql.Row = 
> [a0750c1f13f0k5��F8j���b�Ro'4da96,a0750c1f13f0k5��F8j���b�Ro'4da96]
> scala> println(first.getString(0).equals(first.getString(1)))
> true
> {code}
> {code:sql}
> hive> CREATE TABLE `tb1` (`s1` STRING, `s2` STRING)
> > stored as parquet
> > location "/Users/yumwang/SPARK-26149";
> OK
> Time taken: 0.224 seconds
> hive> select s1 = s2 from tb1;
> OK
> true
> Time taken: 0.167 seconds, Fetched: 1 row(s)
> {code}
> As you can see, only UTF8String returns {{false}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-26251) isnan function not picking non-numeric values

2018-12-03 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26251:
-
Comment: was deleted

(was: Why does it should pick "po box 7896"?)

> isnan function not picking non-numeric values
> -
>
> Key: SPARK-26251
> URL: https://issues.apache.org/jira/browse/SPARK-26251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kunal Rao
>Priority: Minor
>
> import org.apache.spark.sql.functions._
> List("po box 7896", "8907", 
> "435435").toDF("rgid").filter(isnan(col("rgid"))).show
>  
> should pick "po box 7896"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26251) isnan function not picking non-numeric values

2018-12-03 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26251:
-
Component/s: (was: Spark Core)
 SQL

> isnan function not picking non-numeric values
> -
>
> Key: SPARK-26251
> URL: https://issues.apache.org/jira/browse/SPARK-26251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kunal Rao
>Priority: Minor
>
> import org.apache.spark.sql.functions._
> List("po box 7896", "8907", 
> "435435").toDF("rgid").filter(isnan(col("rgid"))).show
>  
> should pick "po box 7896"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26251) isnan function not picking non-numeric values

2018-12-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708128#comment-16708128
 ] 

Hyukjin Kwon commented on SPARK-26251:
--

Why does it should pick "po box 7896"?

> isnan function not picking non-numeric values
> -
>
> Key: SPARK-26251
> URL: https://issues.apache.org/jira/browse/SPARK-26251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kunal Rao
>Priority: Minor
>
> import org.apache.spark.sql.functions._
> List("po box 7896", "8907", 
> "435435").toDF("rgid").filter(isnan(col("rgid"))).show
>  
> should pick "po box 7896"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26149) Read UTF8String from Parquet/ORC may be incorrect

2018-12-03 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-26149:

Attachment: image-2018-12-04-10-55-49-369.png

> Read UTF8String from Parquet/ORC may be incorrect
> -
>
> Key: SPARK-26149
> URL: https://issues.apache.org/jira/browse/SPARK-26149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: SPARK-26149.snappy.parquet, 
> image-2018-12-04-10-55-49-369.png
>
>
> How to reproduce:
> {code:bash}
> scala> 
> spark.read.parquet("/Users/yumwang/SPARK-26149/SPARK-26149.snappy.parquet").selectExpr("s1
>  = s2").show
> +-+
> |(s1 = s2)|
> +-+
> |false|
> +-+
> scala> val first = 
> spark.read.parquet("/Users/yumwang/SPARK-26149/SPARK-26149.snappy.parquet").collect().head
> first: org.apache.spark.sql.Row = 
> [a0750c1f13f0k5��F8j���b�Ro'4da96,a0750c1f13f0k5��F8j���b�Ro'4da96]
> scala> println(first.getString(0).equals(first.getString(1)))
> true
> {code}
> {code:sql}
> hive> CREATE TABLE `tb1` (`s1` STRING, `s2` STRING)
> > stored as parquet
> > location "/Users/yumwang/SPARK-26149";
> OK
> Time taken: 0.224 seconds
> hive> select s1 = s2 from tb1;
> OK
> true
> Time taken: 0.167 seconds, Fetched: 1 row(s)
> {code}
> As you can see, only UTF8String returns {{false}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26250) Fail to run dataframe.R examples

2018-12-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708126#comment-16708126
 ] 

Hyukjin Kwon commented on SPARK-26250:
--

The error looks literally {{Rscript}} is not installed in your computer.

> Fail to run dataframe.R examples
> 
>
> Key: SPARK-26250
> URL: https://issues.apache.org/jira/browse/SPARK-26250
> Project: Spark
>  Issue Type: Test
>  Components: Examples
>Affects Versions: 2.4.0
>Reporter: Jean Pierre PIN
>Priority: Major
>
> I get an error=2 running spark-submit examples/src/main/r/dataframe.R
>  the script is working with Rstudio but i've changed the library(SparkR) line 
>  with this one
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> i am at the top root directory of spark installation and the path variable 
> for /bin is specified in the environment so spark-submit is found.  On system 
> window 7 Ultimate 64bits
> read "main" java.io.IOException: Cannot run program "Rscript": CreateProcess 
> error=2, The system cannot find the file specified
> I think the issue is known for a long but i don't find any post.
>  Thanks for answer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26250) Fail to run dataframe.R examples

2018-12-03 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26250.
--
Resolution: Invalid

> Fail to run dataframe.R examples
> 
>
> Key: SPARK-26250
> URL: https://issues.apache.org/jira/browse/SPARK-26250
> Project: Spark
>  Issue Type: Test
>  Components: Examples
>Affects Versions: 2.4.0
>Reporter: Jean Pierre PIN
>Priority: Major
>
> I get an error=2 running spark-submit examples/src/main/r/dataframe.R
>  the script is working with Rstudio but i've changed the library(SparkR) line 
>  with this one
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> i am at the top root directory of spark installation and the path variable 
> for /bin is specified in the environment so spark-submit is found.  On system 
> window 7 Ultimate 64bits
> read "main" java.io.IOException: Cannot run program "Rscript": CreateProcess 
> error=2, The system cannot find the file specified
> I think the issue is known for a long but i don't find any post.
>  Thanks for answer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26250) Fail to run dataframe.R examples

2018-12-03 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26250:
-
Target Version/s:   (was: 2.4.0)

> Fail to run dataframe.R examples
> 
>
> Key: SPARK-26250
> URL: https://issues.apache.org/jira/browse/SPARK-26250
> Project: Spark
>  Issue Type: Test
>  Components: Examples
>Affects Versions: 2.4.0
>Reporter: Jean Pierre PIN
>Priority: Major
>
> I get an error=2 running spark-submit examples/src/main/r/dataframe.R
>  the script is working with Rstudio but i've changed the library(SparkR) line 
>  with this one
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> i am at the top root directory of spark installation and the path variable 
> for /bin is specified in the environment so spark-submit is found.  On system 
> window 7 Ultimate 64bits
> read "main" java.io.IOException: Cannot run program "Rscript": CreateProcess 
> error=2, The system cannot find the file specified
> I think the issue is known for a long but i don't find any post.
>  Thanks for answer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26261) Spark does not check completeness temporary file

2018-12-03 Thread Jialin LIu (JIRA)

Jialin LIu created SPARK-26261:
--

 Summary: Spark does not check completeness temporary file 
 Key: SPARK-26261
 URL: https://issues.apache.org/jira/browse/SPARK-26261
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.2
Reporter: Jialin LIu


Spark does not check temporary files' completeness. When persisting to disk is 
enabled on some RDDs, a bunch of temporary files will be created on blockmgr 
folder. Block manager is able to detect missing blocks while it is not able 
detect file content being modified during execution. 

Our initial test shows that if we truncate the block file before being used by 
executors, the program will finish without detecting any error, but the result 
content is totally wrong.

We believe there should be a file checksum on every RDD file block and these 
files should be protected by checksum.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26149) Read UTF8String from Parquet/ORC may be incorrect

2018-12-03 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-26149.
-
Resolution: Not A Problem

> Read UTF8String from Parquet/ORC may be incorrect
> -
>
> Key: SPARK-26149
> URL: https://issues.apache.org/jira/browse/SPARK-26149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: SPARK-26149.snappy.parquet, 
> image-2018-12-04-10-55-49-369.png
>
>
> How to reproduce:
> {code:bash}
> scala> 
> spark.read.parquet("/Users/yumwang/SPARK-26149/SPARK-26149.snappy.parquet").selectExpr("s1
>  = s2").show
> +-+
> |(s1 = s2)|
> +-+
> |false|
> +-+
> scala> val first = 
> spark.read.parquet("/Users/yumwang/SPARK-26149/SPARK-26149.snappy.parquet").collect().head
> first: org.apache.spark.sql.Row = 
> [a0750c1f13f0k5��F8j���b�Ro'4da96,a0750c1f13f0k5��F8j���b�Ro'4da96]
> scala> println(first.getString(0).equals(first.getString(1)))
> true
> {code}
> {code:sql}
> hive> CREATE TABLE `tb1` (`s1` STRING, `s2` STRING)
> > stored as parquet
> > location "/Users/yumwang/SPARK-26149";
> OK
> Time taken: 0.224 seconds
> hive> select s1 = s2 from tb1;
> OK
> true
> Time taken: 0.167 seconds, Fetched: 1 row(s)
> {code}
> As you can see, only UTF8String returns {{false}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26260) Summary Task Metrics for Stage Page: Efficient implimentation for SHS when using disk store.

2018-12-03 Thread shahid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26260:
---
Description: 
Currently, tasks summary metrics is calculated based on all the tasks, instead 
of successful tasks. 
After the JIRA, https://issues.apache.org/jira/browse/SPARK-26119, when using 
InMemory store, it find task summary metrics for all the successful tasks 
metrics. But we need to find an efficient implementation for disk store case 
for SHS. The main bottle neck for disk store is deserialization time overhead.

Hints: Need to rework on the way indexing works, so that we can index by 
specific metrics for successful and failed tasks differently (would be tricky). 
Also would require changing the disk store version (to invalidate old stores).

OR any other efficient solutions.

  was:
Currently, tasks summary metrics is calculated based on all the tasks, instead 
of successful tasks. 
After the JIRA, https://issues.apache.org/jira/browse/SPARK-26119, when using 
InMemory store, it find task summary metrics for all the successful tasks 
metrics. But we need to find an efficient implementation for disk store case 
for SHS. The main bottle neck for disk store is deserialization time overhead.

Hints: Need to rework on the way indexing works, so that we can index by 
specific metrics for successful and failed tasks differently (would be tricky). 
Also would require changing the disk store version (to invalidate old stores).


> Summary Task Metrics for Stage Page: Efficient implimentation for SHS when 
> using disk store.
> 
>
> Key: SPARK-26260
> URL: https://issues.apache.org/jira/browse/SPARK-26260
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: shahid
>Priority: Major
>
> Currently, tasks summary metrics is calculated based on all the tasks, 
> instead of successful tasks. 
> After the JIRA, https://issues.apache.org/jira/browse/SPARK-26119, when using 
> InMemory store, it find task summary metrics for all the successful tasks 
> metrics. But we need to find an efficient implementation for disk store case 
> for SHS. The main bottle neck for disk store is deserialization time overhead.
> Hints: Need to rework on the way indexing works, so that we can index by 
> specific metrics for successful and failed tasks differently (would be 
> tricky). Also would require changing the disk store version (to invalidate 
> old stores).
> OR any other efficient solutions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25498) Fix SQLQueryTestSuite failures when the interpreter mode enabled

2018-12-03 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708027#comment-16708027
 ] 

Apache Spark commented on SPARK-25498:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/23212

> Fix SQLQueryTestSuite failures when the interpreter mode enabled
> 
>
> Key: SPARK-25498
> URL: https://issues.apache.org/jira/browse/SPARK-25498
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26260) Summary Task Metrics for Stage Page: Efficient implimentation for SHS when using disk store.

2018-12-03 Thread shahid (JIRA)

shahid created SPARK-26260:
--

 Summary: Summary Task Metrics for Stage Page: Efficient 
implimentation for SHS when using disk store.
 Key: SPARK-26260
 URL: https://issues.apache.org/jira/browse/SPARK-26260
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0, 3.0.0
Reporter: shahid


Currently, tasks summary metrics is calculated based on all the tasks, instead 
of successful tasks. 
After the JIRA, https://issues.apache.org/jira/browse/SPARK-26119, when using 
InMemory store, it find task summary metrics for all the successful tasks 
metrics. But we need to find an efficient implementation for disk store case 
for SHS. The main bottle neck for disk store is deserialization time overhead.

Hints: Need to rework on the way indexing works, so that we can index by 
specific metrics for successful and failed tasks differently (would be tricky). 
Also would require changing the disk store version (to invalidate old stores).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26083) Pyspark command is not working properly with default Docker Image build

2018-12-03 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26083:
--

Assignee: Qi Shao

> Pyspark command is not working properly with default Docker Image build
> ---
>
> Key: SPARK-26083
> URL: https://issues.apache.org/jira/browse/SPARK-26083
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Qi Shao
>Assignee: Qi Shao
>Priority: Minor
>  Labels: easyfix, newbie, patch, pull-request-available
> Fix For: 3.0.0
>
>
> When I try to run
> {code:java}
> ./bin/pyspark{code}
> in a pod in Kubernetes(image built without change from pyspark Dockerfile), 
> I'm getting an error:
> {code:java}
> $SPARK_HOME/bin/pyspark --deploy-mode client --master 
> k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... 
> Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type 
> "help", "copyright", "credits" or "license" for more information. 
> Could not open PYTHONSTARTUP 
> IOError: [Errno 2] No such file or directory: 
> '/opt/spark/python/pyspark/shell.py'{code}
> This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving

2018-12-03 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707997#comment-16707997
 ] 

Stavros Kontopoulos edited comment on SPARK-26247 at 12/3/18 11:59 PM:
---

Hi [~aholler], Yes I agree there are representation issues, on the other hand a 
spark specific model format locks you in.

I dont know if PMML is a lost cause but at least you can get it running also on 
limited devices that dont run Spark.

Also I understand that if the focus is on improving the internal format then 
makes sense.


was (Author: skonto):
Hi [~aholler], Yes I agree there are representation issues, on the other hand a 
spark specific model format locks you in.

I dont know if PMML is a lost cause but at least you can get it running also on 
limited devices that dont run Spark.

> SPIP - ML Model Extension for no-Spark MLLib Online Serving
> ---
>
> Key: SPARK-26247
> URL: https://issues.apache.org/jira/browse/SPARK-26247
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Anne Holler
>Priority: Major
>  Labels: SPIP
> Attachments: SPIPMlModelExtensionForOnlineServing.pdf
>
>
> This ticket tracks an SPIP to improve model load time and model serving 
> interfaces for online serving of Spark MLlib models.  The SPIP is here
> [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub]
>  
> The improvement opportunity exists in all versions of spark.  We developed 
> our set of changes wrt version 2.1.0 and can port them forward to other 
> versions (e.g., we have ported them forward to 2.3.2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving

2018-12-03 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707997#comment-16707997
 ] 

Stavros Kontopoulos commented on SPARK-26247:
-

Hi [~aholler], Yes I agree there are representation issues, on the other hand a 
spark specific model format locks you in.

I dont know if PMML is a lost cause but at least you can get it running also on 
limited devices that dont run Spark.

> SPIP - ML Model Extension for no-Spark MLLib Online Serving
> ---
>
> Key: SPARK-26247
> URL: https://issues.apache.org/jira/browse/SPARK-26247
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Anne Holler
>Priority: Major
>  Labels: SPIP
> Attachments: SPIPMlModelExtensionForOnlineServing.pdf
>
>
> This ticket tracks an SPIP to improve model load time and model serving 
> interfaces for online serving of Spark MLlib models.  The SPIP is here
> [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub]
>  
> The improvement opportunity exists in all versions of spark.  We developed 
> our set of changes wrt version 2.1.0 and can port them forward to other 
> versions (e.g., we have ported them forward to 2.3.2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26083) Pyspark command is not working properly with default Docker Image build

2018-12-03 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26083.

   Resolution: Fixed
Fix Version/s: (was: 2.4.1)
   3.0.0

Issue resolved by pull request 23037
[https://github.com/apache/spark/pull/23037]

> Pyspark command is not working properly with default Docker Image build
> ---
>
> Key: SPARK-26083
> URL: https://issues.apache.org/jira/browse/SPARK-26083
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Qi Shao
>Priority: Minor
>  Labels: easyfix, newbie, patch, pull-request-available
> Fix For: 3.0.0
>
>
> When I try to run
> {code:java}
> ./bin/pyspark{code}
> in a pod in Kubernetes(image built without change from pyspark Dockerfile), 
> I'm getting an error:
> {code:java}
> $SPARK_HOME/bin/pyspark --deploy-mode client --master 
> k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... 
> Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type 
> "help", "copyright", "credits" or "license" for more information. 
> Could not open PYTHONSTARTUP 
> IOError: [Errno 2] No such file or directory: 
> '/opt/spark/python/pyspark/shell.py'{code}
> This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26219) Executor summary is not getting updated for failure jobs in history server UI

2018-12-03 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-26219:
---
Fix Version/s: 2.4.1

> Executor summary is not getting updated for failure jobs in history server UI
> -
>
> Key: SPARK-26219
> URL: https://issues.apache.org/jira/browse/SPARK-26219
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shahid
>Assignee: shahid
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
> Attachments: Screenshot from 2018-11-29 22-13-34.png, Screenshot from 
> 2018-11-29 22-13-44.png
>
>
> Test step to reproduce:
> {code:java}
> bin/spark-shell --master yarn --conf spark.executor.instances=3
> sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() 
> {code}
> 1)Open the application from History UI 
> 2) Go to the executor tab
> From History UI:
> !Screenshot from 2018-11-29 22-13-34.png! 
> From Live UI:
>  !Screenshot from 2018-11-29 22-13-44.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26239) Add configurable auth secret source in k8s backend

2018-12-03 Thread Matt Cheah (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707962#comment-16707962
 ] 

Matt Cheah commented on SPARK-26239:


It could work in client mode but is less useful there overall because the user 
has to determine how to get ahold of that secret file. Nevertheless for cluster 
mode users that have secret file mounting systems for the driver and executors, 
it would be a great start. I can start building the code for this.

> Add configurable auth secret source in k8s backend
> --
>
> Key: SPARK-26239
> URL: https://issues.apache.org/jira/browse/SPARK-26239
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> This is a follow up to SPARK-26194, which aims to add auto-generated secrets 
> similar to the YARN backend.
> There's a desire to support different ways to generate and propagate these 
> auth secrets (e.g. using things like Vault). Need to investigate:
> - exposing configuration to support that
> - changing SecurityManager so that it can delegate some of the 
> secret-handling logic to custom implementations
> - figuring out whether this can also be used in client-mode, where the driver 
> is not created by the k8s backend in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26256) Add proper labels when deleting pods

2018-12-03 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-26256:
---
Fix Version/s: 2.4.1

> Add proper labels when deleting pods
> 
>
> Key: SPARK-26256
> URL: https://issues.apache.org/jira/browse/SPARK-26256
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> As discussed here: 
> [https://github.com/apache/spark/pull/23136#discussion_r236463330]
> we need to add proper labels to avoid killing executors belonging to other 
> jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26256) Add proper labels when deleting pods

2018-12-03 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26256:
--

Assignee: Stavros Kontopoulos

> Add proper labels when deleting pods
> 
>
> Key: SPARK-26256
> URL: https://issues.apache.org/jira/browse/SPARK-26256
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> As discussed here: 
> [https://github.com/apache/spark/pull/23136#discussion_r236463330]
> we need to add proper labels to avoid killing executors belonging to other 
> jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26256) Add proper labels when deleting pods

2018-12-03 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26256.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23209
[https://github.com/apache/spark/pull/23209]

> Add proper labels when deleting pods
> 
>
> Key: SPARK-26256
> URL: https://issues.apache.org/jira/browse/SPARK-26256
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> As discussed here: 
> [https://github.com/apache/spark/pull/23136#discussion_r236463330]
> we need to add proper labels to avoid killing executors belonging to other 
> jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving

2018-12-03 Thread Anne Holler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707907#comment-16707907
 ] 

Anne Holler commented on SPARK-26247:
-

Hi, [~skonto],

My basic take on model representation is that any representation that is not 
the same format that the
spark mllib code produces for training and consumes for serving basically 
introduces additional maintenance
toil and potential risk of model serving mismatch.  In that sense, spark mllib 
format is a de facto standard.


Unless PMML were to completely replace spark mllib representation as the first 
class citizen model
representation in spark (which doesn't seem to have clear switchover ROI), the 
team I am on would not
choose to move to it, because we do not want to take the risk that the model 
trained and evaluated wrt spark
mllib native representation has some difference when served in batch or online 
mode from PMML representation.

Best regards, Anne

> SPIP - ML Model Extension for no-Spark MLLib Online Serving
> ---
>
> Key: SPARK-26247
> URL: https://issues.apache.org/jira/browse/SPARK-26247
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Anne Holler
>Priority: Major
>  Labels: SPIP
> Attachments: SPIPMlModelExtensionForOnlineServing.pdf
>
>
> This ticket tracks an SPIP to improve model load time and model serving 
> interfaces for online serving of Spark MLlib models.  The SPIP is here
> [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub]
>  
> The improvement opportunity exists in all versions of spark.  We developed 
> our set of changes wrt version 2.1.0 and can port them forward to other 
> versions (e.g., we have ported them forward to 2.3.2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan

2018-12-03 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707901#comment-16707901
 ] 

Apache Spark commented on SPARK-19712:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/23211

> EXISTS and Left Semi join do not produce the same plan
> --
>
> Key: SPARK-19712
> URL: https://issues.apache.org/jira/browse/SPARK-19712
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nattavut Sutyanyong
>Priority: Major
>
> This problem was found during the development of SPARK-18874.
> The EXISTS form in the following query:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 
> from t3 where t1.t1b=t3.t3b)")}}
> gives the optimized plan below:
> {code}
> == Optimized Logical Plan ==
> Join Inner, (t1a#7 = t2a#25)
> :- Join LeftSemi, (t1b#8 = t3b#58)
> :  :- Filter isnotnull(t1a#7)
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Project [1 AS 1#271, t3b#58]
> : +- Relation[t3a#57,t3b#58,t3c#59] parquet
> +- Filter isnotnull(t2a#25)
>+- Relation[t2a#25,t2b#26,t2c#27] parquet
> {code}
> whereas a semantically equivalent Left Semi join query below:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on 
> t1.t1b=t3.t3b")}}
> gives the following optimized plan:
> {code}
> == Optimized Logical Plan ==
> Join LeftSemi, (t1b#8 = t3b#58)
> :- Join Inner, (t1a#7 = t2a#25)
> :  :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7))
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Filter isnotnull(t2a#25)
> : +- Relation[t2a#25,t2b#26,t2c#27] parquet
> +- Project [t3b#58]
>+- Relation[t3a#57,t3b#58,t3c#59] parquet
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan

2018-12-03 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707902#comment-16707902
 ] 

Apache Spark commented on SPARK-19712:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/23211

> EXISTS and Left Semi join do not produce the same plan
> --
>
> Key: SPARK-19712
> URL: https://issues.apache.org/jira/browse/SPARK-19712
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nattavut Sutyanyong
>Priority: Major
>
> This problem was found during the development of SPARK-18874.
> The EXISTS form in the following query:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 
> from t3 where t1.t1b=t3.t3b)")}}
> gives the optimized plan below:
> {code}
> == Optimized Logical Plan ==
> Join Inner, (t1a#7 = t2a#25)
> :- Join LeftSemi, (t1b#8 = t3b#58)
> :  :- Filter isnotnull(t1a#7)
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Project [1 AS 1#271, t3b#58]
> : +- Relation[t3a#57,t3b#58,t3c#59] parquet
> +- Filter isnotnull(t2a#25)
>+- Relation[t2a#25,t2b#26,t2c#27] parquet
> {code}
> whereas a semantically equivalent Left Semi join query below:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on 
> t1.t1b=t3.t3b")}}
> gives the following optimized plan:
> {code}
> == Optimized Logical Plan ==
> Join LeftSemi, (t1b#8 = t3b#58)
> :- Join Inner, (t1a#7 = t2a#25)
> :  :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7))
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Filter isnotnull(t2a#25)
> : +- Relation[t2a#25,t2b#26,t2c#27] parquet
> +- Project [t3b#58]
>+- Relation[t3a#57,t3b#58,t3c#59] parquet
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema

2018-12-03 Thread PoojaMurarka (JIRA)

PoojaMurarka created SPARK-26259:


 Summary: RecordSeparator other than newline discovers incorrect 
schema
 Key: SPARK-26259
 URL: https://issues.apache.org/jira/browse/SPARK-26259
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: PoojaMurarka
 Fix For: 2.4.1


Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed 
in SPARK 2.3 which allows record Separators other than new line but this 
doesn't work when schema is not specified i.e. while inferring the schema

 Let me try to explain this using below data and scenarios:

Input Data - (input_data.csv) as shown below: *+where recordSeparator is "\t"+*
{noformat}
"dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed"
"2012-01-01","0","0","0","0","1","9","9.1","66","0"
"2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat}
*Case 1: Schema Defined *: Below Spark code with defined *schema* reads data 
correctly:
{code:java}
val customSchema = StructType(Array(
StructField("dteday", DateType, true),
StructField("hr", IntegerType, true),
StructField("holiday", IntegerType, true),
StructField("weekday", IntegerType, true),
StructField("workingday", DateType, true),
StructField("weathersit", IntegerType, true),
StructField("temp", IntegerType, true),
StructField("atemp", DoubleType, true),
StructField("hum", IntegerType, true),
StructField("windspeed", IntegerType, true)));

Dataset ds = executionContext.getSparkSession().read().format( "csv" )
  .option( "header", true )
  .option( "schema", customSchema)
  .option( "sep", "," )
  .load( "input_data.csv" );
{code}
*Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is 
done i.e. entire data is read as column names.
{code:java}
Dataset ds = executionContext.getSparkSession().read().format( "csv" )
  .option( "header", true )
  .option( "inferSchema", true)
  .option( "sep", "," )
  .load( "input_data.csv" );
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving

2018-12-03 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707856#comment-16707856
 ] 

Stavros Kontopoulos commented on SPARK-26247:
-

Some pipelines export models in pmml (I know there is a long history about, 
license etc) but would be better to standardize on something? This would 
minimize a few issues mentioned in this quote "Recreating and reimplementing 
model reading outside of the Spark project is costly and error prone, and over 
time is hard to keep in-sync with Spark changes over various Spark releases"

> SPIP - ML Model Extension for no-Spark MLLib Online Serving
> ---
>
> Key: SPARK-26247
> URL: https://issues.apache.org/jira/browse/SPARK-26247
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Anne Holler
>Priority: Major
>  Labels: SPIP
> Attachments: SPIPMlModelExtensionForOnlineServing.pdf
>
>
> This ticket tracks an SPIP to improve model load time and model serving 
> interfaces for online serving of Spark MLlib models.  The SPIP is here
> [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub]
>  
> The improvement opportunity exists in all versions of spark.  We developed 
> our set of changes wrt version 2.1.0 and can port them forward to other 
> versions (e.g., we have ported them forward to 2.3.2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving

2018-12-03 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707815#comment-16707815
 ] 

Stavros Kontopoulos commented on SPARK-26247:
-

Thanx a lot! If you make it accessible would be better so we can add comments.

> SPIP - ML Model Extension for no-Spark MLLib Online Serving
> ---
>
> Key: SPARK-26247
> URL: https://issues.apache.org/jira/browse/SPARK-26247
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Anne Holler
>Priority: Major
>  Labels: SPIP
> Attachments: SPIPMlModelExtensionForOnlineServing.pdf
>
>
> This ticket tracks an SPIP to improve model load time and model serving 
> interfaces for online serving of Spark MLlib models.  The SPIP is here
> [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub]
>  
> The improvement opportunity exists in all versions of spark.  We developed 
> our set of changes wrt version 2.1.0 and can port them forward to other 
> versions (e.g., we have ported them forward to 2.3.2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving

2018-12-03 Thread Anne Holler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707812#comment-16707812
 ] 

Anne Holler commented on SPARK-26247:
-

Hi, [~skonto], I just attached a pdf of the document to this issue.  [I am a 
google doc doofus,
apparently, not sure why my attempt to publish the google doc for public access 
wasn't successful.]

> SPIP - ML Model Extension for no-Spark MLLib Online Serving
> ---
>
> Key: SPARK-26247
> URL: https://issues.apache.org/jira/browse/SPARK-26247
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Anne Holler
>Priority: Major
>  Labels: SPIP
> Attachments: SPIPMlModelExtensionForOnlineServing.pdf
>
>
> This ticket tracks an SPIP to improve model load time and model serving 
> interfaces for online serving of Spark MLlib models.  The SPIP is here
> [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub]
>  
> The improvement opportunity exists in all versions of spark.  We developed 
> our set of changes wrt version 2.1.0 and can port them forward to other 
> versions (e.g., we have ported them forward to 2.3.2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving

2018-12-03 Thread Anne Holler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anne Holler updated SPARK-26247:

Attachment: SPIPMlModelExtensionForOnlineServing.pdf

> SPIP - ML Model Extension for no-Spark MLLib Online Serving
> ---
>
> Key: SPARK-26247
> URL: https://issues.apache.org/jira/browse/SPARK-26247
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Anne Holler
>Priority: Major
>  Labels: SPIP
> Attachments: SPIPMlModelExtensionForOnlineServing.pdf
>
>
> This ticket tracks an SPIP to improve model load time and model serving 
> interfaces for online serving of Spark MLlib models.  The SPIP is here
> [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub]
>  
> The improvement opportunity exists in all versions of spark.  We developed 
> our set of changes wrt version 2.1.0 and can port them forward to other 
> versions (e.g., we have ported them forward to 2.3.2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving

2018-12-03 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707798#comment-16707798
 ] 

Stavros Kontopoulos commented on SPARK-26247:
-

[~aholler] could you provide public access to the doc?

> SPIP - ML Model Extension for no-Spark MLLib Online Serving
> ---
>
> Key: SPARK-26247
> URL: https://issues.apache.org/jira/browse/SPARK-26247
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Anne Holler
>Priority: Major
>  Labels: SPIP
>
> This ticket tracks an SPIP to improve model load time and model serving 
> interfaces for online serving of Spark MLlib models.  The SPIP is here
> [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub]
>  
> The improvement opportunity exists in all versions of spark.  We developed 
> our set of changes wrt version 2.1.0 and can port them forward to other 
> versions (e.g., we have ported them forward to 2.3.2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions

2018-12-03 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707738#comment-16707738
 ] 

Dongjoon Hyun commented on SPARK-26233:
---

Thanks. I raised the priority because this is a correctness issue.

> Incorrect decimal value with java beans and first/last/max... functions
> ---
>
> Key: SPARK-26233
> URL: https://issues.apache.org/jira/browse/SPARK-26233
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Miquel
>Priority: Blocker
>  Labels: correctness
>
> Decimal values from Java beans are incorrectly scaled when used with 
> functions like first/last/max...
> This problem came because Encoders.bean always set Decimal values as 
> _DecimalType(this.MAX_PRECISION(), 18)._
> Usually it's not a problem if you use numeric functions like *sum* but for 
> functions like *first*/*last*/*max*... it is a problem.
> How to reproduce this error:
> Using this class as an example:
> {code:java}
> public class Foo implements Serializable {
>   private String group;
>   private BigDecimal var;
>   public BigDecimal getVar() {
> return var;
>   }
>   public void setVar(BigDecimal var) {
> this.var = var;
>   }
>   public String getGroup() {
> return group;
>   }
>   public void setGroup(String group) {
> this.group = group;
>   }
> }
> {code}
>  
> And a dummy code to create some objects:
> {code:java}
> Dataset ds = spark.range(5)
> .map(l -> {
>   Foo foo = new Foo();
>   foo.setGroup("" + l);
>   foo.setVar(BigDecimal.valueOf(l + 0.));
>   return foo;
> }, Encoders.bean(Foo.class));
> ds.printSchema();
> ds.show();
> +-+--+
> |group| var|
> +-+--+
> | 0|0.|
> | 1|1.|
> | 2|2.|
> | 3|3.|
> | 4|4.|
> +-+--+
> {code}
> We can see that the DecimalType is precision 38 and 18 scale and all values 
> are show correctly.
> But if we use a first function, they are scaled incorrectly:
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> first("var")
> )
> .show();
> +-+-+
> |group|first(var, false)|
> +-+-+
> | 3| 3.E-14|
> | 0| 1.111E-15|
> | 1| 1.E-14|
> | 4| 4.E-14|
> | 2| 2.E-14|
> +-+-+
> {code}
> This incorrect behavior cannot be reproduced if we use "numerical "functions 
> like sum or if the column is cast a new Decimal Type.
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> sum("var")
> )
> .show();
> +-++
> |group| sum(var)|
> +-++
> | 3|3.00|
> | 0|0.00|
> | 1|1.00|
> | 4|4.00|
> | 2|2.00|
> +-++
> ds.groupBy(col("group"))
> .agg(
> first(col("var").cast(new DecimalType(38, 8)))
> )
> .show();
> +-++
> |group|first(CAST(var AS DECIMAL(38,8)), false)|
> +-++
> | 3| 3.|
> | 0| 0.|
> | 1| 1.|
> | 4| 4.|
> | 2| 2.|
> +-++
> {code}
>    
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions

2018-12-03 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26233:
--
Component/s: (was: Java API)
 SQL

> Incorrect decimal value with java beans and first/last/max... functions
> ---
>
> Key: SPARK-26233
> URL: https://issues.apache.org/jira/browse/SPARK-26233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Miquel
>Priority: Blocker
>  Labels: correctness
>
> Decimal values from Java beans are incorrectly scaled when used with 
> functions like first/last/max...
> This problem came because Encoders.bean always set Decimal values as 
> _DecimalType(this.MAX_PRECISION(), 18)._
> Usually it's not a problem if you use numeric functions like *sum* but for 
> functions like *first*/*last*/*max*... it is a problem.
> How to reproduce this error:
> Using this class as an example:
> {code:java}
> public class Foo implements Serializable {
>   private String group;
>   private BigDecimal var;
>   public BigDecimal getVar() {
> return var;
>   }
>   public void setVar(BigDecimal var) {
> this.var = var;
>   }
>   public String getGroup() {
> return group;
>   }
>   public void setGroup(String group) {
> this.group = group;
>   }
> }
> {code}
>  
> And a dummy code to create some objects:
> {code:java}
> Dataset ds = spark.range(5)
> .map(l -> {
>   Foo foo = new Foo();
>   foo.setGroup("" + l);
>   foo.setVar(BigDecimal.valueOf(l + 0.));
>   return foo;
> }, Encoders.bean(Foo.class));
> ds.printSchema();
> ds.show();
> +-+--+
> |group| var|
> +-+--+
> | 0|0.|
> | 1|1.|
> | 2|2.|
> | 3|3.|
> | 4|4.|
> +-+--+
> {code}
> We can see that the DecimalType is precision 38 and 18 scale and all values 
> are show correctly.
> But if we use a first function, they are scaled incorrectly:
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> first("var")
> )
> .show();
> +-+-+
> |group|first(var, false)|
> +-+-+
> | 3| 3.E-14|
> | 0| 1.111E-15|
> | 1| 1.E-14|
> | 4| 4.E-14|
> | 2| 2.E-14|
> +-+-+
> {code}
> This incorrect behavior cannot be reproduced if we use "numerical "functions 
> like sum or if the column is cast a new Decimal Type.
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> sum("var")
> )
> .show();
> +-++
> |group| sum(var)|
> +-++
> | 3|3.00|
> | 0|0.00|
> | 1|1.00|
> | 4|4.00|
> | 2|2.00|
> +-++
> ds.groupBy(col("group"))
> .agg(
> first(col("var").cast(new DecimalType(38, 8)))
> )
> .show();
> +-++
> |group|first(CAST(var AS DECIMAL(38,8)), false)|
> +-++
> | 3| 3.|
> | 0| 0.|
> | 1| 1.|
> | 4| 4.|
> | 2| 2.|
> +-++
> {code}
>    
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions

2018-12-03 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707729#comment-16707729
 ] 

Marco Gaido commented on SPARK-26233:
-

[~dongjoon] I think so. SPARK-24957 was a long standing issue too, and the 
problem is analogous...

> Incorrect decimal value with java beans and first/last/max... functions
> ---
>
> Key: SPARK-26233
> URL: https://issues.apache.org/jira/browse/SPARK-26233
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Miquel
>Priority: Minor
>
> Decimal values from Java beans are incorrectly scaled when used with 
> functions like first/last/max...
> This problem came because Encoders.bean always set Decimal values as 
> _DecimalType(this.MAX_PRECISION(), 18)._
> Usually it's not a problem if you use numeric functions like *sum* but for 
> functions like *first*/*last*/*max*... it is a problem.
> How to reproduce this error:
> Using this class as an example:
> {code:java}
> public class Foo implements Serializable {
>   private String group;
>   private BigDecimal var;
>   public BigDecimal getVar() {
> return var;
>   }
>   public void setVar(BigDecimal var) {
> this.var = var;
>   }
>   public String getGroup() {
> return group;
>   }
>   public void setGroup(String group) {
> this.group = group;
>   }
> }
> {code}
>  
> And a dummy code to create some objects:
> {code:java}
> Dataset ds = spark.range(5)
> .map(l -> {
>   Foo foo = new Foo();
>   foo.setGroup("" + l);
>   foo.setVar(BigDecimal.valueOf(l + 0.));
>   return foo;
> }, Encoders.bean(Foo.class));
> ds.printSchema();
> ds.show();
> +-+--+
> |group| var|
> +-+--+
> | 0|0.|
> | 1|1.|
> | 2|2.|
> | 3|3.|
> | 4|4.|
> +-+--+
> {code}
> We can see that the DecimalType is precision 38 and 18 scale and all values 
> are show correctly.
> But if we use a first function, they are scaled incorrectly:
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> first("var")
> )
> .show();
> +-+-+
> |group|first(var, false)|
> +-+-+
> | 3| 3.E-14|
> | 0| 1.111E-15|
> | 1| 1.E-14|
> | 4| 4.E-14|
> | 2| 2.E-14|
> +-+-+
> {code}
> This incorrect behavior cannot be reproduced if we use "numerical "functions 
> like sum or if the column is cast a new Decimal Type.
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> sum("var")
> )
> .show();
> +-++
> |group| sum(var)|
> +-++
> | 3|3.00|
> | 0|0.00|
> | 1|1.00|
> | 4|4.00|
> | 2|2.00|
> +-++
> ds.groupBy(col("group"))
> .agg(
> first(col("var").cast(new DecimalType(38, 8)))
> )
> .show();
> +-++
> |group|first(CAST(var AS DECIMAL(38,8)), false)|
> +-++
> | 3| 3.|
> | 0| 0.|
> | 1| 1.|
> | 4| 4.|
> | 2| 2.|
> +-++
> {code}
>    
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26258) Include JobGroupId in the callerContext

2018-12-03 Thread Aihua Xu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated SPARK-26258:
-
Description: 
SPARK-15857 adds the support of callerContext for HDFS and Yarn. Currently, 
Spark callerContext prints Job Id, Stage Id, Task Id in the callerContext. It 
would be useful to include JobGroup Id as well since such JobGroup Id could be 
meaningful from the Spark Client point of view to group the jobs by JobGroup. 

callerContext=SPARK_CLIENT_application_1541467098185_0008
callerContext=SPARK_TASK_application_1541702365678_0005_1_JId_1_SId_3_0_TId_3_0

  was:
Currently, Spark callerContext prints Job Id, Stage Id, Task Id in the 
callerContext. It would be useful to include JobGroup Id as well since such 
JobGroup Id could be meaningful from the Spark Client point of view to group 
the jobs by JobGroup.

callerContext=SPARK_CLIENT_application_1541467098185_0008
callerContext=SPARK_TASK_application_1541702365678_0005_1_JId_1_SId_3_0_TId_3_0


> Include JobGroupId in the callerContext 
> 
>
> Key: SPARK-26258
> URL: https://issues.apache.org/jira/browse/SPARK-26258
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Aihua Xu
>Priority: Major
>
> SPARK-15857 adds the support of callerContext for HDFS and Yarn. Currently, 
> Spark callerContext prints Job Id, Stage Id, Task Id in the callerContext. It 
> would be useful to include JobGroup Id as well since such JobGroup Id could 
> be meaningful from the Spark Client point of view to group the jobs by 
> JobGroup. 
> callerContext=SPARK_CLIENT_application_1541467098185_0008
> callerContext=SPARK_TASK_application_1541702365678_0005_1_JId_1_SId_3_0_TId_3_0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions

2018-12-03 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26233:
--
Priority: Blocker  (was: Minor)

> Incorrect decimal value with java beans and first/last/max... functions
> ---
>
> Key: SPARK-26233
> URL: https://issues.apache.org/jira/browse/SPARK-26233
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Miquel
>Priority: Blocker
>  Labels: correctness
>
> Decimal values from Java beans are incorrectly scaled when used with 
> functions like first/last/max...
> This problem came because Encoders.bean always set Decimal values as 
> _DecimalType(this.MAX_PRECISION(), 18)._
> Usually it's not a problem if you use numeric functions like *sum* but for 
> functions like *first*/*last*/*max*... it is a problem.
> How to reproduce this error:
> Using this class as an example:
> {code:java}
> public class Foo implements Serializable {
>   private String group;
>   private BigDecimal var;
>   public BigDecimal getVar() {
> return var;
>   }
>   public void setVar(BigDecimal var) {
> this.var = var;
>   }
>   public String getGroup() {
> return group;
>   }
>   public void setGroup(String group) {
> this.group = group;
>   }
> }
> {code}
>  
> And a dummy code to create some objects:
> {code:java}
> Dataset ds = spark.range(5)
> .map(l -> {
>   Foo foo = new Foo();
>   foo.setGroup("" + l);
>   foo.setVar(BigDecimal.valueOf(l + 0.));
>   return foo;
> }, Encoders.bean(Foo.class));
> ds.printSchema();
> ds.show();
> +-+--+
> |group| var|
> +-+--+
> | 0|0.|
> | 1|1.|
> | 2|2.|
> | 3|3.|
> | 4|4.|
> +-+--+
> {code}
> We can see that the DecimalType is precision 38 and 18 scale and all values 
> are show correctly.
> But if we use a first function, they are scaled incorrectly:
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> first("var")
> )
> .show();
> +-+-+
> |group|first(var, false)|
> +-+-+
> | 3| 3.E-14|
> | 0| 1.111E-15|
> | 1| 1.E-14|
> | 4| 4.E-14|
> | 2| 2.E-14|
> +-+-+
> {code}
> This incorrect behavior cannot be reproduced if we use "numerical "functions 
> like sum or if the column is cast a new Decimal Type.
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> sum("var")
> )
> .show();
> +-++
> |group| sum(var)|
> +-++
> | 3|3.00|
> | 0|0.00|
> | 1|1.00|
> | 4|4.00|
> | 2|2.00|
> +-++
> ds.groupBy(col("group"))
> .agg(
> first(col("var").cast(new DecimalType(38, 8)))
> )
> .show();
> +-++
> |group|first(CAST(var AS DECIMAL(38,8)), false)|
> +-++
> | 3| 3.|
> | 0| 0.|
> | 1| 1.|
> | 4| 4.|
> | 2| 2.|
> +-++
> {code}
>    
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions

2018-12-03 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26233:
--
Labels: correctness  (was: )

> Incorrect decimal value with java beans and first/last/max... functions
> ---
>
> Key: SPARK-26233
> URL: https://issues.apache.org/jira/browse/SPARK-26233
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Miquel
>Priority: Minor
>  Labels: correctness
>
> Decimal values from Java beans are incorrectly scaled when used with 
> functions like first/last/max...
> This problem came because Encoders.bean always set Decimal values as 
> _DecimalType(this.MAX_PRECISION(), 18)._
> Usually it's not a problem if you use numeric functions like *sum* but for 
> functions like *first*/*last*/*max*... it is a problem.
> How to reproduce this error:
> Using this class as an example:
> {code:java}
> public class Foo implements Serializable {
>   private String group;
>   private BigDecimal var;
>   public BigDecimal getVar() {
> return var;
>   }
>   public void setVar(BigDecimal var) {
> this.var = var;
>   }
>   public String getGroup() {
> return group;
>   }
>   public void setGroup(String group) {
> this.group = group;
>   }
> }
> {code}
>  
> And a dummy code to create some objects:
> {code:java}
> Dataset ds = spark.range(5)
> .map(l -> {
>   Foo foo = new Foo();
>   foo.setGroup("" + l);
>   foo.setVar(BigDecimal.valueOf(l + 0.));
>   return foo;
> }, Encoders.bean(Foo.class));
> ds.printSchema();
> ds.show();
> +-+--+
> |group| var|
> +-+--+
> | 0|0.|
> | 1|1.|
> | 2|2.|
> | 3|3.|
> | 4|4.|
> +-+--+
> {code}
> We can see that the DecimalType is precision 38 and 18 scale and all values 
> are show correctly.
> But if we use a first function, they are scaled incorrectly:
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> first("var")
> )
> .show();
> +-+-+
> |group|first(var, false)|
> +-+-+
> | 3| 3.E-14|
> | 0| 1.111E-15|
> | 1| 1.E-14|
> | 4| 4.E-14|
> | 2| 2.E-14|
> +-+-+
> {code}
> This incorrect behavior cannot be reproduced if we use "numerical "functions 
> like sum or if the column is cast a new Decimal Type.
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> sum("var")
> )
> .show();
> +-++
> |group| sum(var)|
> +-++
> | 3|3.00|
> | 0|0.00|
> | 1|1.00|
> | 4|4.00|
> | 2|2.00|
> +-++
> ds.groupBy(col("group"))
> .agg(
> first(col("var").cast(new DecimalType(38, 8)))
> )
> .show();
> +-++
> |group|first(CAST(var AS DECIMAL(38,8)), false)|
> +-++
> | 3| 3.|
> | 0| 0.|
> | 1| 1.|
> | 4| 4.|
> | 2| 2.|
> +-++
> {code}
>    
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions

2018-12-03 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707717#comment-16707717
 ] 

Dongjoon Hyun commented on SPARK-26233:
---

Hi, [~mcanes] and [~mgaido]. This seems to exist for a long time, right?

> Incorrect decimal value with java beans and first/last/max... functions
> ---
>
> Key: SPARK-26233
> URL: https://issues.apache.org/jira/browse/SPARK-26233
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Miquel
>Priority: Minor
>
> Decimal values from Java beans are incorrectly scaled when used with 
> functions like first/last/max...
> This problem came because Encoders.bean always set Decimal values as 
> _DecimalType(this.MAX_PRECISION(), 18)._
> Usually it's not a problem if you use numeric functions like *sum* but for 
> functions like *first*/*last*/*max*... it is a problem.
> How to reproduce this error:
> Using this class as an example:
> {code:java}
> public class Foo implements Serializable {
>   private String group;
>   private BigDecimal var;
>   public BigDecimal getVar() {
> return var;
>   }
>   public void setVar(BigDecimal var) {
> this.var = var;
>   }
>   public String getGroup() {
> return group;
>   }
>   public void setGroup(String group) {
> this.group = group;
>   }
> }
> {code}
>  
> And a dummy code to create some objects:
> {code:java}
> Dataset ds = spark.range(5)
> .map(l -> {
>   Foo foo = new Foo();
>   foo.setGroup("" + l);
>   foo.setVar(BigDecimal.valueOf(l + 0.));
>   return foo;
> }, Encoders.bean(Foo.class));
> ds.printSchema();
> ds.show();
> +-+--+
> |group| var|
> +-+--+
> | 0|0.|
> | 1|1.|
> | 2|2.|
> | 3|3.|
> | 4|4.|
> +-+--+
> {code}
> We can see that the DecimalType is precision 38 and 18 scale and all values 
> are show correctly.
> But if we use a first function, they are scaled incorrectly:
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> first("var")
> )
> .show();
> +-+-+
> |group|first(var, false)|
> +-+-+
> | 3| 3.E-14|
> | 0| 1.111E-15|
> | 1| 1.E-14|
> | 4| 4.E-14|
> | 2| 2.E-14|
> +-+-+
> {code}
> This incorrect behavior cannot be reproduced if we use "numerical "functions 
> like sum or if the column is cast a new Decimal Type.
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> sum("var")
> )
> .show();
> +-++
> |group| sum(var)|
> +-++
> | 3|3.00|
> | 0|0.00|
> | 1|1.00|
> | 4|4.00|
> | 2|2.00|
> +-++
> ds.groupBy(col("group"))
> .agg(
> first(col("var").cast(new DecimalType(38, 8)))
> )
> .show();
> +-++
> |group|first(CAST(var AS DECIMAL(38,8)), false)|
> +-++
> | 3| 3.|
> | 0| 0.|
> | 1| 1.|
> | 4| 4.|
> | 2| 2.|
> +-++
> {code}
>    
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions

2018-12-03 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26233:
--
Affects Version/s: 2.0.2

> Incorrect decimal value with java beans and first/last/max... functions
> ---
>
> Key: SPARK-26233
> URL: https://issues.apache.org/jira/browse/SPARK-26233
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Miquel
>Priority: Minor
>
> Decimal values from Java beans are incorrectly scaled when used with 
> functions like first/last/max...
> This problem came because Encoders.bean always set Decimal values as 
> _DecimalType(this.MAX_PRECISION(), 18)._
> Usually it's not a problem if you use numeric functions like *sum* but for 
> functions like *first*/*last*/*max*... it is a problem.
> How to reproduce this error:
> Using this class as an example:
> {code:java}
> public class Foo implements Serializable {
>   private String group;
>   private BigDecimal var;
>   public BigDecimal getVar() {
> return var;
>   }
>   public void setVar(BigDecimal var) {
> this.var = var;
>   }
>   public String getGroup() {
> return group;
>   }
>   public void setGroup(String group) {
> this.group = group;
>   }
> }
> {code}
>  
> And a dummy code to create some objects:
> {code:java}
> Dataset ds = spark.range(5)
> .map(l -> {
>   Foo foo = new Foo();
>   foo.setGroup("" + l);
>   foo.setVar(BigDecimal.valueOf(l + 0.));
>   return foo;
> }, Encoders.bean(Foo.class));
> ds.printSchema();
> ds.show();
> +-+--+
> |group| var|
> +-+--+
> | 0|0.|
> | 1|1.|
> | 2|2.|
> | 3|3.|
> | 4|4.|
> +-+--+
> {code}
> We can see that the DecimalType is precision 38 and 18 scale and all values 
> are show correctly.
> But if we use a first function, they are scaled incorrectly:
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> first("var")
> )
> .show();
> +-+-+
> |group|first(var, false)|
> +-+-+
> | 3| 3.E-14|
> | 0| 1.111E-15|
> | 1| 1.E-14|
> | 4| 4.E-14|
> | 2| 2.E-14|
> +-+-+
> {code}
> This incorrect behavior cannot be reproduced if we use "numerical "functions 
> like sum or if the column is cast a new Decimal Type.
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> sum("var")
> )
> .show();
> +-++
> |group| sum(var)|
> +-++
> | 3|3.00|
> | 0|0.00|
> | 1|1.00|
> | 4|4.00|
> | 2|2.00|
> +-++
> ds.groupBy(col("group"))
> .agg(
> first(col("var").cast(new DecimalType(38, 8)))
> )
> .show();
> +-++
> |group|first(CAST(var AS DECIMAL(38,8)), false)|
> +-++
> | 3| 3.|
> | 0| 0.|
> | 1| 1.|
> | 4| 4.|
> | 2| 2.|
> +-++
> {code}
>    
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions

2018-12-03 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26233:
--
Affects Version/s: 2.1.3

> Incorrect decimal value with java beans and first/last/max... functions
> ---
>
> Key: SPARK-26233
> URL: https://issues.apache.org/jira/browse/SPARK-26233
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Miquel
>Priority: Minor
>
> Decimal values from Java beans are incorrectly scaled when used with 
> functions like first/last/max...
> This problem came because Encoders.bean always set Decimal values as 
> _DecimalType(this.MAX_PRECISION(), 18)._
> Usually it's not a problem if you use numeric functions like *sum* but for 
> functions like *first*/*last*/*max*... it is a problem.
> How to reproduce this error:
> Using this class as an example:
> {code:java}
> public class Foo implements Serializable {
>   private String group;
>   private BigDecimal var;
>   public BigDecimal getVar() {
> return var;
>   }
>   public void setVar(BigDecimal var) {
> this.var = var;
>   }
>   public String getGroup() {
> return group;
>   }
>   public void setGroup(String group) {
> this.group = group;
>   }
> }
> {code}
>  
> And a dummy code to create some objects:
> {code:java}
> Dataset ds = spark.range(5)
> .map(l -> {
>   Foo foo = new Foo();
>   foo.setGroup("" + l);
>   foo.setVar(BigDecimal.valueOf(l + 0.));
>   return foo;
> }, Encoders.bean(Foo.class));
> ds.printSchema();
> ds.show();
> +-+--+
> |group| var|
> +-+--+
> | 0|0.|
> | 1|1.|
> | 2|2.|
> | 3|3.|
> | 4|4.|
> +-+--+
> {code}
> We can see that the DecimalType is precision 38 and 18 scale and all values 
> are show correctly.
> But if we use a first function, they are scaled incorrectly:
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> first("var")
> )
> .show();
> +-+-+
> |group|first(var, false)|
> +-+-+
> | 3| 3.E-14|
> | 0| 1.111E-15|
> | 1| 1.E-14|
> | 4| 4.E-14|
> | 2| 2.E-14|
> +-+-+
> {code}
> This incorrect behavior cannot be reproduced if we use "numerical "functions 
> like sum or if the column is cast a new Decimal Type.
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> sum("var")
> )
> .show();
> +-++
> |group| sum(var)|
> +-++
> | 3|3.00|
> | 0|0.00|
> | 1|1.00|
> | 4|4.00|
> | 2|2.00|
> +-++
> ds.groupBy(col("group"))
> .agg(
> first(col("var").cast(new DecimalType(38, 8)))
> )
> .show();
> +-++
> |group|first(CAST(var AS DECIMAL(38,8)), false)|
> +-++
> | 3| 3.|
> | 0| 0.|
> | 1| 1.|
> | 4| 4.|
> | 2| 2.|
> +-++
> {code}
>    
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions

2018-12-03 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26233:
--
Affects Version/s: 2.2.2

> Incorrect decimal value with java beans and first/last/max... functions
> ---
>
> Key: SPARK-26233
> URL: https://issues.apache.org/jira/browse/SPARK-26233
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.2, 2.3.1, 2.4.0
>Reporter: Miquel
>Priority: Minor
>
> Decimal values from Java beans are incorrectly scaled when used with 
> functions like first/last/max...
> This problem came because Encoders.bean always set Decimal values as 
> _DecimalType(this.MAX_PRECISION(), 18)._
> Usually it's not a problem if you use numeric functions like *sum* but for 
> functions like *first*/*last*/*max*... it is a problem.
> How to reproduce this error:
> Using this class as an example:
> {code:java}
> public class Foo implements Serializable {
>   private String group;
>   private BigDecimal var;
>   public BigDecimal getVar() {
> return var;
>   }
>   public void setVar(BigDecimal var) {
> this.var = var;
>   }
>   public String getGroup() {
> return group;
>   }
>   public void setGroup(String group) {
> this.group = group;
>   }
> }
> {code}
>  
> And a dummy code to create some objects:
> {code:java}
> Dataset ds = spark.range(5)
> .map(l -> {
>   Foo foo = new Foo();
>   foo.setGroup("" + l);
>   foo.setVar(BigDecimal.valueOf(l + 0.));
>   return foo;
> }, Encoders.bean(Foo.class));
> ds.printSchema();
> ds.show();
> +-+--+
> |group| var|
> +-+--+
> | 0|0.|
> | 1|1.|
> | 2|2.|
> | 3|3.|
> | 4|4.|
> +-+--+
> {code}
> We can see that the DecimalType is precision 38 and 18 scale and all values 
> are show correctly.
> But if we use a first function, they are scaled incorrectly:
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> first("var")
> )
> .show();
> +-+-+
> |group|first(var, false)|
> +-+-+
> | 3| 3.E-14|
> | 0| 1.111E-15|
> | 1| 1.E-14|
> | 4| 4.E-14|
> | 2| 2.E-14|
> +-+-+
> {code}
> This incorrect behavior cannot be reproduced if we use "numerical "functions 
> like sum or if the column is cast a new Decimal Type.
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> sum("var")
> )
> .show();
> +-++
> |group| sum(var)|
> +-++
> | 3|3.00|
> | 0|0.00|
> | 1|1.00|
> | 4|4.00|
> | 2|2.00|
> +-++
> ds.groupBy(col("group"))
> .agg(
> first(col("var").cast(new DecimalType(38, 8)))
> )
> .show();
> +-++
> |group|first(CAST(var AS DECIMAL(38,8)), false)|
> +-++
> | 3| 3.|
> | 0| 0.|
> | 1| 1.|
> | 4| 4.|
> | 2| 2.|
> +-++
> {code}
>    
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26258) Include JobGroupId in the callerContext

2018-12-03 Thread Aihua Xu (JIRA)

Aihua Xu created SPARK-26258:


 Summary: Include JobGroupId in the callerContext 
 Key: SPARK-26258
 URL: https://issues.apache.org/jira/browse/SPARK-26258
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Aihua Xu


Currently, Spark callerContext prints Job Id, Stage Id, Task Id in the 
callerContext. It would be useful to include JobGroup Id as well since such 
JobGroup Id could be meaningful from the Spark Client point of view to group 
the jobs by JobGroup.

callerContext=SPARK_CLIENT_application_1541467098185_0008
callerContext=SPARK_TASK_application_1541702365678_0005_1_JId_1_SId_3_0_TId_3_0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26257) SPIP: Interop Support for Spark Language Extensions

2018-12-03 Thread Tyson Condie (JIRA)

Tyson Condie created SPARK-26257:


 Summary: SPIP: Interop Support for Spark Language Extensions
 Key: SPARK-26257
 URL: https://issues.apache.org/jira/browse/SPARK-26257
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, R, Spark Core
Affects Versions: 2.4.0
Reporter: Tyson Condie


h2.  ** Background and Motivation:

There is a desire for third party language extensions for Apache Spark. Some 
notable examples include:
 * C#/F# from project Mobius [https://github.com/Microsoft/Mobius]
 * Haskell from project sparkle [https://github.com/tweag/sparkle]
 * Julia from project Spark.jl [https://github.com/dfdx/Spark.jl]

Presently, Apache Spark supports Python and R via a tightly integrated interop 
layer. It would seem that much of that existing interop layer could be 
refactored into a clean surface for general (third party) language bindings, 
such as the above mentioned. More specifically, could we generalize the 
following modules:
 * Deploy runners (e.g., PythonRunner and RRunner)
 * DataFrame Executors
 * RDD operations?

The last being questionable: integrating third party language extensions at the 
RDD level may be too heavy-weight and unnecessary given the preference towards 
the Dataframe abstraction.

The main goals of this effort would be:
 * Provide a clean abstraction for third party language extensions making it 
easier to maintain (the language extension) with the evolution of Apache Spark
 * Provide guidance to third party language authors on how a language extension 
should be implemented
 * Provide general reusable libraries that are not specific to any language 
extension
 * Open the door to developers that prefer alternative languages
 * Identify and clean up common code shared between Python and R interops

h2. Target Personas:

Data Scientists, Data Engineers, Library Developers
h2. Goals:

Data scientists and engineers will have the opportunity to work with Spark in 
languages other than what’s natively supported. Library developers will be able 
to create language extensions for Spark in a clean way. The interop layer 
should also provide guidance for developing language extensions.
h2. Non-Goals:

The proposal does not aim to create an actual language extension. Rather, it 
aims to provide a stable interop layer for third party language extensions to 
dock.
h2. Proposed API Changes:

Much of the work will involve generalizing existing interop APIs for PySpark 
and R, specifically for the Dataframe API. For instance, it would be good to 
have a general deploy.Runner (similar to PythonRunner) for language extension 
efforts. In Spark SQL, it would be good to have a general InteropUDF and 
evaluator (similar to BatchEvalPythonExec).

Low-level RDD operations should not be needed in this initial offering; 
depending on the success of the interop layer and with proper demand, RDD 
interop could be added later. However, one open question is supporting a subset 
of low-level functions that are core to ETL e.g., transform.
h2. Optional Design Sketch:

The work would be broken down into two top-level phases:
 Phase 1: Introduce general interop API for deploying a driver/application, 
running an interop UDF along with any other low-level transformations that aid 
with ETL.

Phase 2: Port existing Python and R language extensions to the new interop 
layer. This port should be contained solely to the Spark core side, and all 
protocols specific to Python and R should not change e.g., Python should 
continue to use py4j is the protocol between the Python process and core Spark. 
The port itself should be contained to a handful of files e.g., some examples 
for Python: PythonRunner, BatchEvalPythonExec, +PythonUDFRunner+, PythonRDD 
(possibly), and will mostly involve refactoring common logic abstract 
implementations and utilities.
h2. Optional Rejected Designs:

The clear alternative is the status quo; developers that want to provide a 
third-party language extension to Spark do so directly; often by extending 
existing Python classes and overriding the portions that are relevant to the 
new extension. Not only is this not sound code (e.g., an JuliaRDD is not a 
PythonRDD, which contains a lot of reusable code), but it runs the great risk 
of future revisions making the subclass implementation obsolete. It would be 
hard to imagine that any third-party language extension would be successful if 
there was not something in place to guarantee its long-term maintainability. 

Another alternative is that third-party languages should only interact with 
Spark via pure-SQL; possibly via REST. However, this does not enable UDFs 
written in the third-party language; a key desideratum in this effort, which 
most notably takes the form of legacy code/UDFs that would need to be ported to 
a supported language e.g., Scala. This exercise is extremely cumbersome and not

[jira] [Commented] (SPARK-26254) Move delegation token providers into a separate project

2018-12-03 Thread Steve Loughran (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707679#comment-16707679
 ] 

Steve Loughran commented on SPARK-26254:


+HBase

I don't have any opinions on the best place; people who know the spark 
packaging are the ones there. And people deploying to other infras than YARN 
will have their opinions too.

Token loading can be fairly brittle to classpath problems (HADOOP-15808); its 
good not to trust everything to be well-configured. 

> Move delegation token providers into a separate project
> ---
>
> Key: SPARK-26254
> URL: https://issues.apache.org/jira/browse/SPARK-26254
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> There was a discussion in 
> [PR#22598|https://github.com/apache/spark/pull/22598] that there are several 
> provided dependencies inside core project which shouldn't be there (for ex. 
> hive and kafka). This jira is to solve this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25515) Add a config property for disabling auto deletion of PODS for debugging.

2018-12-03 Thread Yinan Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yinan Li resolved SPARK-25515.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

> Add a config property for disabling auto deletion of PODS for debugging.
> 
>
> Key: SPARK-25515
> URL: https://issues.apache.org/jira/browse/SPARK-25515
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Prashant Sharma
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, if a pod fails to start due to some failure, it gets removed and 
> new one is attempted. These sequence of events go on, until the app is 
> killed. Given the speed of creation and deletion, it becomes difficult to 
> debug the reason for failure.
> So adding a configuration parameter to disable auto-deletion of pods, will be 
> helpful for debugging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26213) Custom Receiver for Structured streaming

2018-12-03 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705886#comment-16705886
 ] 

Gabor Somogyi edited comment on SPARK-26213 at 12/3/18 4:27 PM:


On the other hand looks like this is not a feature suggestion but a question to 
the community. In such cases you can write a mail to 
[d...@spark.apache.org|http://apache-spark-developers-list.1001551.n3.nabble.com/].

[~aarthipa] If you agree please close the jira.

 


was (Author: gsomogyi):
On the other hand looks like this is not a feature suggestion but a question to 
the community. In such cases you can write a mail to 
[d...@spark.apache.org|http://apache-spark-developers-list.1001551.n3.nabble.com/].

If you agree please close the jira.

 

> Custom Receiver for Structured streaming
> 
>
> Key: SPARK-26213
> URL: https://issues.apache.org/jira/browse/SPARK-26213
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Aarthi
>Priority: Major
>
> Hi,
> I have implemented a Custom Receiver for a https/json data source by 
> implementing the Receievr abstract class as provided in the documentation 
> here [https://spark.apache.org/docs/latest//streaming-custom-receivers.html]
> This approach works on Spark streaming context  where the custom receiver 
> class is passed it receiverStream. However I would like the implement the 
> same for Structured streaming as each of the DStreams have a complex 
> structure and need to be joined with each other based on complex rules. 
> ([https://stackoverflow.com/questions/53449599/join-two-spark-dstreams-with-complex-nested-structure])
>  Structured streaming uses the Spark Session object that takes in 
> DataStreamReader which is a final class. Please advice on how to implement 
> the custom receiver for Strucutred Streaming. 
> Thanks,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions

2018-12-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26233:


Assignee: Apache Spark

> Incorrect decimal value with java beans and first/last/max... functions
> ---
>
> Key: SPARK-26233
> URL: https://issues.apache.org/jira/browse/SPARK-26233
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Miquel
>Assignee: Apache Spark
>Priority: Minor
>
> Decimal values from Java beans are incorrectly scaled when used with 
> functions like first/last/max...
> This problem came because Encoders.bean always set Decimal values as 
> _DecimalType(this.MAX_PRECISION(), 18)._
> Usually it's not a problem if you use numeric functions like *sum* but for 
> functions like *first*/*last*/*max*... it is a problem.
> How to reproduce this error:
> Using this class as an example:
> {code:java}
> public class Foo implements Serializable {
>   private String group;
>   private BigDecimal var;
>   public BigDecimal getVar() {
> return var;
>   }
>   public void setVar(BigDecimal var) {
> this.var = var;
>   }
>   public String getGroup() {
> return group;
>   }
>   public void setGroup(String group) {
> this.group = group;
>   }
> }
> {code}
>  
> And a dummy code to create some objects:
> {code:java}
> Dataset ds = spark.range(5)
> .map(l -> {
>   Foo foo = new Foo();
>   foo.setGroup("" + l);
>   foo.setVar(BigDecimal.valueOf(l + 0.));
>   return foo;
> }, Encoders.bean(Foo.class));
> ds.printSchema();
> ds.show();
> +-+--+
> |group| var|
> +-+--+
> | 0|0.|
> | 1|1.|
> | 2|2.|
> | 3|3.|
> | 4|4.|
> +-+--+
> {code}
> We can see that the DecimalType is precision 38 and 18 scale and all values 
> are show correctly.
> But if we use a first function, they are scaled incorrectly:
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> first("var")
> )
> .show();
> +-+-+
> |group|first(var, false)|
> +-+-+
> | 3| 3.E-14|
> | 0| 1.111E-15|
> | 1| 1.E-14|
> | 4| 4.E-14|
> | 2| 2.E-14|
> +-+-+
> {code}
> This incorrect behavior cannot be reproduced if we use "numerical "functions 
> like sum or if the column is cast a new Decimal Type.
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> sum("var")
> )
> .show();
> +-++
> |group| sum(var)|
> +-++
> | 3|3.00|
> | 0|0.00|
> | 1|1.00|
> | 4|4.00|
> | 2|2.00|
> +-++
> ds.groupBy(col("group"))
> .agg(
> first(col("var").cast(new DecimalType(38, 8)))
> )
> .show();
> +-++
> |group|first(CAST(var AS DECIMAL(38,8)), false)|
> +-++
> | 3| 3.|
> | 0| 0.|
> | 1| 1.|
> | 4| 4.|
> | 2| 2.|
> +-++
> {code}
>    
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions

2018-12-03 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707465#comment-16707465
 ] 

Apache Spark commented on SPARK-26233:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/23210

> Incorrect decimal value with java beans and first/last/max... functions
> ---
>
> Key: SPARK-26233
> URL: https://issues.apache.org/jira/browse/SPARK-26233
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Miquel
>Priority: Minor
>
> Decimal values from Java beans are incorrectly scaled when used with 
> functions like first/last/max...
> This problem came because Encoders.bean always set Decimal values as 
> _DecimalType(this.MAX_PRECISION(), 18)._
> Usually it's not a problem if you use numeric functions like *sum* but for 
> functions like *first*/*last*/*max*... it is a problem.
> How to reproduce this error:
> Using this class as an example:
> {code:java}
> public class Foo implements Serializable {
>   private String group;
>   private BigDecimal var;
>   public BigDecimal getVar() {
> return var;
>   }
>   public void setVar(BigDecimal var) {
> this.var = var;
>   }
>   public String getGroup() {
> return group;
>   }
>   public void setGroup(String group) {
> this.group = group;
>   }
> }
> {code}
>  
> And a dummy code to create some objects:
> {code:java}
> Dataset ds = spark.range(5)
> .map(l -> {
>   Foo foo = new Foo();
>   foo.setGroup("" + l);
>   foo.setVar(BigDecimal.valueOf(l + 0.));
>   return foo;
> }, Encoders.bean(Foo.class));
> ds.printSchema();
> ds.show();
> +-+--+
> |group| var|
> +-+--+
> | 0|0.|
> | 1|1.|
> | 2|2.|
> | 3|3.|
> | 4|4.|
> +-+--+
> {code}
> We can see that the DecimalType is precision 38 and 18 scale and all values 
> are show correctly.
> But if we use a first function, they are scaled incorrectly:
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> first("var")
> )
> .show();
> +-+-+
> |group|first(var, false)|
> +-+-+
> | 3| 3.E-14|
> | 0| 1.111E-15|
> | 1| 1.E-14|
> | 4| 4.E-14|
> | 2| 2.E-14|
> +-+-+
> {code}
> This incorrect behavior cannot be reproduced if we use "numerical "functions 
> like sum or if the column is cast a new Decimal Type.
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> sum("var")
> )
> .show();
> +-++
> |group| sum(var)|
> +-++
> | 3|3.00|
> | 0|0.00|
> | 1|1.00|
> | 4|4.00|
> | 2|2.00|
> +-++
> ds.groupBy(col("group"))
> .agg(
> first(col("var").cast(new DecimalType(38, 8)))
> )
> .show();
> +-++
> |group|first(CAST(var AS DECIMAL(38,8)), false)|
> +-++
> | 3| 3.|
> | 0| 0.|
> | 1| 1.|
> | 4| 4.|
> | 2| 2.|
> +-++
> {code}
>    
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions

2018-12-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26233:


Assignee: (was: Apache Spark)

> Incorrect decimal value with java beans and first/last/max... functions
> ---
>
> Key: SPARK-26233
> URL: https://issues.apache.org/jira/browse/SPARK-26233
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Miquel
>Priority: Minor
>
> Decimal values from Java beans are incorrectly scaled when used with 
> functions like first/last/max...
> This problem came because Encoders.bean always set Decimal values as 
> _DecimalType(this.MAX_PRECISION(), 18)._
> Usually it's not a problem if you use numeric functions like *sum* but for 
> functions like *first*/*last*/*max*... it is a problem.
> How to reproduce this error:
> Using this class as an example:
> {code:java}
> public class Foo implements Serializable {
>   private String group;
>   private BigDecimal var;
>   public BigDecimal getVar() {
> return var;
>   }
>   public void setVar(BigDecimal var) {
> this.var = var;
>   }
>   public String getGroup() {
> return group;
>   }
>   public void setGroup(String group) {
> this.group = group;
>   }
> }
> {code}
>  
> And a dummy code to create some objects:
> {code:java}
> Dataset ds = spark.range(5)
> .map(l -> {
>   Foo foo = new Foo();
>   foo.setGroup("" + l);
>   foo.setVar(BigDecimal.valueOf(l + 0.));
>   return foo;
> }, Encoders.bean(Foo.class));
> ds.printSchema();
> ds.show();
> +-+--+
> |group| var|
> +-+--+
> | 0|0.|
> | 1|1.|
> | 2|2.|
> | 3|3.|
> | 4|4.|
> +-+--+
> {code}
> We can see that the DecimalType is precision 38 and 18 scale and all values 
> are show correctly.
> But if we use a first function, they are scaled incorrectly:
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> first("var")
> )
> .show();
> +-+-+
> |group|first(var, false)|
> +-+-+
> | 3| 3.E-14|
> | 0| 1.111E-15|
> | 1| 1.E-14|
> | 4| 4.E-14|
> | 2| 2.E-14|
> +-+-+
> {code}
> This incorrect behavior cannot be reproduced if we use "numerical "functions 
> like sum or if the column is cast a new Decimal Type.
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> sum("var")
> )
> .show();
> +-++
> |group| sum(var)|
> +-++
> | 3|3.00|
> | 0|0.00|
> | 1|1.00|
> | 4|4.00|
> | 2|2.00|
> +-++
> ds.groupBy(col("group"))
> .agg(
> first(col("var").cast(new DecimalType(38, 8)))
> )
> .show();
> +-++
> |group|first(CAST(var AS DECIMAL(38,8)), false)|
> +-++
> | 3| 3.|
> | 0| 0.|
> | 1| 1.|
> | 4| 4.|
> | 2| 2.|
> +-++
> {code}
>    
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25498) Fix SQLQueryTestSuite failures when the interpreter mode enabled

2018-12-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25498.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22512
[https://github.com/apache/spark/pull/22512]

> Fix SQLQueryTestSuite failures when the interpreter mode enabled
> 
>
> Key: SPARK-25498
> URL: https://issues.apache.org/jira/browse/SPARK-25498
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25498) Fix SQLQueryTestSuite failures when the interpreter mode enabled

2018-12-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25498:
---

Assignee: Takeshi Yamamuro

> Fix SQLQueryTestSuite failures when the interpreter mode enabled
> 
>
> Key: SPARK-25498
> URL: https://issues.apache.org/jira/browse/SPARK-25498
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26235) Change log level for ClassNotFoundException/NoClassDefFoundError in SparkSubmit to Error

2018-12-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26235.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23189
[https://github.com/apache/spark/pull/23189]

> Change log level for ClassNotFoundException/NoClassDefFoundError in 
> SparkSubmit to Error
> 
>
> Key: SPARK-26235
> URL: https://issues.apache.org/jira/browse/SPARK-26235
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Trivial
> Fix For: 3.0.0
>
>
> In my local setup, I set log4j root category as ERROR 
> (https://stackoverflow.com/questions/27781187/how-to-stop-info-messages-displaying-on-spark-console
>  , first item show up if we google search "set spark log level".)
> When I run such command
> ```
> spark-submit --class foo bar.jar
> ```
> Nothing shows up, and the script exits.
> After quick investigation, I think the log level for 
> ClassNotFoundException/NoClassDefFoundError in SparkSubmit should be ERROR 
> instead of WARN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26235) Change log level for ClassNotFoundException/NoClassDefFoundError in SparkSubmit to Error

2018-12-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26235:
-

Assignee: Gengliang Wang

> Change log level for ClassNotFoundException/NoClassDefFoundError in 
> SparkSubmit to Error
> 
>
> Key: SPARK-26235
> URL: https://issues.apache.org/jira/browse/SPARK-26235
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Trivial
> Fix For: 3.0.0
>
>
> In my local setup, I set log4j root category as ERROR 
> (https://stackoverflow.com/questions/27781187/how-to-stop-info-messages-displaying-on-spark-console
>  , first item show up if we google search "set spark log level".)
> When I run such command
> ```
> spark-submit --class foo bar.jar
> ```
> Nothing shows up, and the script exits.
> After quick investigation, I think the log level for 
> ClassNotFoundException/NoClassDefFoundError in SparkSubmit should be ERROR 
> instead of WARN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26181) the `hasMinMaxStats` method of `ColumnStatsMap` is not correct

2018-12-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26181:
---

Assignee: Adrian Wang

> the `hasMinMaxStats` method of `ColumnStatsMap` is not correct
> --
>
> Key: SPARK-26181
> URL: https://issues.apache.org/jira/browse/SPARK-26181
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26181) the `hasMinMaxStats` method of `ColumnStatsMap` is not correct

2018-12-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26181.
-
   Resolution: Fixed
Fix Version/s: 2.4.1
   3.0.0

Issue resolved by pull request 23152
[https://github.com/apache/spark/pull/23152]

> the `hasMinMaxStats` method of `ColumnStatsMap` is not correct
> --
>
> Key: SPARK-26181
> URL: https://issues.apache.org/jira/browse/SPARK-26181
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>Priority: Major
> Fix For: 3.0.0, 2.4.1
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26173) Prior regularization for Logistic Regression

2018-12-03 Thread Facundo Bellosi (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Facundo Bellosi updated SPARK-26173:

Description: 
This feature enables Maximum A Posteriori (MAP) optimization for Logistic 
Regression based on a Gaussian prior. In practice, this is just implementing a 
more general form of L2 regularization parameterized by a (multivariate) mean 
and precisions (inverse of variance) vectors.

Prior regularization is calculated through the following formula:

!Prior regularization.png!

where:
 * λ: regularization parameter ({{regParam}})
 * K: number of coefficients (weights vector length)
 * w~i~ with prior Normal(μ~i~, β~i~^2^)

_Reference: Bishop, Christopher M. (2006). Pattern Recognition and Machine 
Learning (section 4.5). Berlin, Heidelberg: Springer-Verlag._

h3. Existing implementations
* Python: [bayes_logistic|https://pypi.org/project/bayes_logistic/]

h2.  Implementation
 * 2 new parameters added to {{LogisticRegression}}: {{priorMean}} and 
{{priorPrecisions}}.
 * 1 new class ({{PriorRegularization}}) implements the calculations of the 
value and gradient of the prior regularization term.
 * Prior regularization is enabled when both vectors are provided and 
{{regParam}} > 0 and {{elasticNetParam}} < 1.

h2. Tests
 * {{DifferentiableRegularizationSuite}}
 ** {{Prior regularization}}
 * {{LogisticRegressionSuite}}
 ** {{prior precisions should be required when prior mean is set}}
 ** {{prior mean should be required when prior precisions is set}}
 ** {{`regParam` should be positive when using prior regularization}}
 ** {{`elasticNetParam` should be less than 1.0 when using prior 
regularization}}
 ** {{prior mean and precisions should have equal length}}
 ** {{priors' length should match number of features}}
 ** {{binary logistic regression with prior regularization equivalent to L2}}
 ** {{binary logistic regression with prior regularization equivalent to L2 
(bis)}}
 ** {{binary logistic regression with prior regularization}}

  was:
This feature enables Maximum A Posteriori (MAP) optimization for Logistic 
Regression based on a Gaussian prior. In practice, this is just implementing a 
more general form of L2 regularization parameterized by a (multivariate) mean 
and precisions (inverse of variance) vectors.

Prior regularization is calculated through the following formula:

!Prior regularization.png!

where:
 * λ: regularization parameter ({{regParam}})
 * K: number of coefficients (weights vector length)
 * w~i~ with prior Normal(μ~i~, β~i~^2^)

_Reference: Bishop, Christopher M. (2006). Pattern Recognition and Machine 
Learning (section 4.5). Berlin, Heidelberg: Springer-Verlag._
h2.  Implementation
 * 2 new parameters added to {{LogisticRegression}}: {{priorMean}} and 
{{priorPrecisions}}.
 * 1 new class ({{PriorRegularization}}) implements the calculations of the 
value and gradient of the prior regularization term.
 * Prior regularization is enabled when both vectors are provided and 
{{regParam}} > 0 and {{elasticNetParam}} < 1.

h2. Tests
 * {{DifferentiableRegularizationSuite}}
 ** {{Prior regularization}}
 * {{LogisticRegressionSuite}}
 ** {{prior precisions should be required when prior mean is set}}
 ** {{prior mean should be required when prior precisions is set}}
 ** {{`regParam` should be positive when using prior regularization}}
 ** {{`elasticNetParam` should be less than 1.0 when using prior 
regularization}}
 ** {{prior mean and precisions should have equal length}}
 ** {{priors' length should match number of features}}
 ** {{binary logistic regression with prior regularization equivalent to L2}}
 ** {{binary logistic regression with prior regularization equivalent to L2 
(bis)}}
 ** {{binary logistic regression with prior regularization}}


> Prior regularization for Logistic Regression
> 
>
> Key: SPARK-26173
> URL: https://issues.apache.org/jira/browse/SPARK-26173
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: Facundo Bellosi
>Priority: Minor
> Attachments: Prior regularization.png
>
>
> This feature enables Maximum A Posteriori (MAP) optimization for Logistic 
> Regression based on a Gaussian prior. In practice, this is just implementing 
> a more general form of L2 regularization parameterized by a (multivariate) 
> mean and precisions (inverse of variance) vectors.
> Prior regularization is calculated through the following formula:
> !Prior regularization.png!
> where:
>  * λ: regularization parameter ({{regParam}})
>  * K: number of coefficients (weights vector length)
>  * w~i~ with prior Normal(μ~i~, β~i~^2^)
> _Reference: Bishop, Christopher M. (2006). Pattern Recognition and Machine 
> Learning (section 4.5). Berlin, Heidelberg: Springer-Verlag._
> h3. Existing

[jira] [Assigned] (SPARK-26256) Add proper labels when deleting pods

2018-12-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26256:


Assignee: Apache Spark

> Add proper labels when deleting pods
> 
>
> Key: SPARK-26256
> URL: https://issues.apache.org/jira/browse/SPARK-26256
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Assignee: Apache Spark
>Priority: Major
>
> As discussed here: 
> [https://github.com/apache/spark/pull/23136#discussion_r236463330]
> we need to add proper labels to avoid killing executors belonging to other 
> jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25530) data source v2 API refactor (batch write)

2018-12-03 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707365#comment-16707365
 ] 

Apache Spark commented on SPARK-25530:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/23208

> data source v2 API refactor (batch write)
> -
>
> Key: SPARK-25530
> URL: https://issues.apache.org/jira/browse/SPARK-25530
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Adjust the batch write API to match the read API after refactor



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26256) Add proper labels when deleting pods

2018-12-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26256:


Assignee: (was: Apache Spark)

> Add proper labels when deleting pods
> 
>
> Key: SPARK-26256
> URL: https://issues.apache.org/jira/browse/SPARK-26256
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> As discussed here: 
> [https://github.com/apache/spark/pull/23136#discussion_r236463330]
> we need to add proper labels to avoid killing executors belonging to other 
> jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26256) Add proper labels when deleting pods

2018-12-03 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707370#comment-16707370
 ] 

Apache Spark commented on SPARK-26256:
--

User 'skonto' has created a pull request for this issue:
https://github.com/apache/spark/pull/23209

> Add proper labels when deleting pods
> 
>
> Key: SPARK-26256
> URL: https://issues.apache.org/jira/browse/SPARK-26256
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> As discussed here: 
> [https://github.com/apache/spark/pull/23136#discussion_r236463330]
> we need to add proper labels to avoid killing executors belonging to other 
> jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25530) data source v2 API refactor (batch write)

2018-12-03 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707361#comment-16707361
 ] 

Apache Spark commented on SPARK-25530:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/23208

> data source v2 API refactor (batch write)
> -
>
> Key: SPARK-25530
> URL: https://issues.apache.org/jira/browse/SPARK-25530
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Adjust the batch write API to match the read API after refactor



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25530) data source v2 API refactor (batch write)

2018-12-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25530:


Assignee: (was: Apache Spark)

> data source v2 API refactor (batch write)
> -
>
> Key: SPARK-25530
> URL: https://issues.apache.org/jira/browse/SPARK-25530
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Adjust the batch write API to match the read API after refactor



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25530) data source v2 API refactor (batch write)

2018-12-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25530:


Assignee: Apache Spark

> data source v2 API refactor (batch write)
> -
>
> Key: SPARK-25530
> URL: https://issues.apache.org/jira/browse/SPARK-25530
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>
> Adjust the batch write API to match the read API after refactor



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25530) data source v2 API refactor (batch write)

2018-12-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25530:

Summary: data source v2 API refactor (batch write)  (was: data source v2 
write side API refactoring)

> data source v2 API refactor (batch write)
> -
>
> Key: SPARK-25530
> URL: https://issues.apache.org/jira/browse/SPARK-25530
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> refactor the write side API according to this abstraction
> {code}
> batch: catalog -> table -> write
> streaming: catalog -> table -> stream -> write
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25530) data source v2 API refactor (batch write)

2018-12-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25530:

Description: Adjust the batch write API to match the read API after refactor

> data source v2 API refactor (batch write)
> -
>
> Key: SPARK-25530
> URL: https://issues.apache.org/jira/browse/SPARK-25530
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Adjust the batch write API to match the read API after refactor



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25530) data source v2 API refactor (batch write)

2018-12-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25530:

Description: (was: refactor the write side API according to this 
abstraction
{code}
batch: catalog -> table -> write
streaming: catalog -> table -> stream -> write
{code})

> data source v2 API refactor (batch write)
> -
>
> Key: SPARK-25530
> URL: https://issues.apache.org/jira/browse/SPARK-25530
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26253) Task Summary Metrics Table on Stage Page shows empty table when no data is present

2018-12-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26253.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23205
[https://github.com/apache/spark/pull/23205]

> Task Summary Metrics Table on Stage Page shows empty table when no data is 
> present
> --
>
> Key: SPARK-26253
> URL: https://issues.apache.org/jira/browse/SPARK-26253
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Minor
> Fix For: 3.0.0
>
>
> Task Summary Metrics Table on Stage Page shows empty table when no data is 
> present instead of showing a message. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26253) Task Summary Metrics Table on Stage Page shows empty table when no data is present

2018-12-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26253:
-

Assignee: Parth Gandhi

This should just be a follow-up to SPARK-21089, not a new issue.

> Task Summary Metrics Table on Stage Page shows empty table when no data is 
> present
> --
>
> Key: SPARK-26253
> URL: https://issues.apache.org/jira/browse/SPARK-26253
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Minor
>
> Task Summary Metrics Table on Stage Page shows empty table when no data is 
> present instead of showing a message. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 145 matches

Mail list logo