[jira] [Commented] (SPARK-26709) OptimizeMetadataOnlyQuery does not correctly handle the files with zero record

2019-01-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751975#comment-16751975
 ] 

Apache Spark commented on SPARK-26709:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/23648

> OptimizeMetadataOnlyQuery does not correctly handle the files with zero record
> --
>
> Key: SPARK-26709
> URL: https://issues.apache.org/jira/browse/SPARK-26709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.2, 2.4.0
>Reporter: Xiao Li
>Assignee: Gengliang Wang
>Priority: Blocker
>  Labels: correctness
>
> {code:java}
> import org.apache.spark.sql.functions.lit
> withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
>   withTempPath { path =>
> val tabLocation = path.getAbsolutePath
> val partLocation = new Path(path.getAbsolutePath, "partCol1=3")
> val df = spark.emptyDataFrame.select(lit(1).as("col1"))
> df.write.parquet(partLocation.toString)
> val readDF = spark.read.parquet(tabLocation)
> checkAnswer(readDF.selectExpr("max(partCol1)"), Row(null))
> checkAnswer(readDF.selectExpr("max(col1)"), Row(null))
>   }
> }
> {code}
> OptimizeMetadataOnlyQuery has a correctness bug to handle the file with the 
> empty records for partitioned tables. The above test will fail in 2.4, which 
> can generate an empty file, but the underlying issue in the read path still 
> exists in 2.3, 2.2 and 2.1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26709) OptimizeMetadataOnlyQuery does not correctly handle the files with zero record

2019-01-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751974#comment-16751974
 ] 

Apache Spark commented on SPARK-26709:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/23648

> OptimizeMetadataOnlyQuery does not correctly handle the files with zero record
> --
>
> Key: SPARK-26709
> URL: https://issues.apache.org/jira/browse/SPARK-26709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.2, 2.4.0
>Reporter: Xiao Li
>Assignee: Gengliang Wang
>Priority: Blocker
>  Labels: correctness
>
> {code:java}
> import org.apache.spark.sql.functions.lit
> withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
>   withTempPath { path =>
> val tabLocation = path.getAbsolutePath
> val partLocation = new Path(path.getAbsolutePath, "partCol1=3")
> val df = spark.emptyDataFrame.select(lit(1).as("col1"))
> df.write.parquet(partLocation.toString)
> val readDF = spark.read.parquet(tabLocation)
> checkAnswer(readDF.selectExpr("max(partCol1)"), Row(null))
> checkAnswer(readDF.selectExpr("max(col1)"), Row(null))
>   }
> }
> {code}
> OptimizeMetadataOnlyQuery has a correctness bug to handle the file with the 
> empty records for partitioned tables. The above test will fail in 2.4, which 
> can generate an empty file, but the underlying issue in the read path still 
> exists in 2.3, 2.2 and 2.1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26712) Disk broken causing YarnShuffleSerivce not available

2019-01-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26712:


Assignee: (was: Apache Spark)

> Disk broken causing YarnShuffleSerivce not available
> 
>
> Key: SPARK-26712
> URL: https://issues.apache.org/jira/browse/SPARK-26712
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.0, 2.4.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, `ExecutorShuffleInfo` can be recovered from file if NM recovery 
> enabled, however, the recovery file is under a single directory, which may be 
> unavailable if disk broken. So if a NM restart happen(may be caused by kill 
> or some reason), the shuffle service can not start even if there are 
> executors on the node.
> This may finally cause job failures(if node or executors on it not 
> blacklisted), or at least, it will cause resource waste.(shuffle from this 
> node always failed.)
> For long running spark applications, this problem may be more serious.
> So I think we should support multi directories(multi disk) for this recovery. 
> and change to good directory when the disk of current directory is broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26712) Disk broken causing YarnShuffleSerivce not available

2019-01-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26712:


Assignee: Apache Spark

> Disk broken causing YarnShuffleSerivce not available
> 
>
> Key: SPARK-26712
> URL: https://issues.apache.org/jira/browse/SPARK-26712
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.0, 2.4.0
>Reporter: liupengcheng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, `ExecutorShuffleInfo` can be recovered from file if NM recovery 
> enabled, however, the recovery file is under a single directory, which may be 
> unavailable if disk broken. So if a NM restart happen(may be caused by kill 
> or some reason), the shuffle service can not start even if there are 
> executors on the node.
> This may finally cause job failures(if node or executors on it not 
> blacklisted), or at least, it will cause resource waste.(shuffle from this 
> node always failed.)
> For long running spark applications, this problem may be more serious.
> So I think we should support multi directories(multi disk) for this recovery. 
> and change to good directory when the disk of current directory is broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26708) Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan

2019-01-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751927#comment-16751927
 ] 

Dongjoon Hyun commented on SPARK-26708:
---

Hi, [~smilegator]. Is this only related to Spark 2.4.0?

> Incorrect result caused by inconsistency between a SQL cache's cached RDD and 
> its physical plan
> ---
>
> Key: SPARK-26708
> URL: https://issues.apache.org/jira/browse/SPARK-26708
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Maryann Xue
>Priority: Blocker
>  Labels: correctness
>
> When performing non-cascading cache invalidation, {{recache}} is called on 
> the other cache entries which are dependent on the cache being invalidated. 
> It leads to the the physical plans of those cache entries being re-compiled. 
> For those cache entries, if the cache RDD has already been persisted, chances 
> are there will be inconsistency between the data and the new plan. It can 
> cause a correctness issue if the new plan's {{outputPartitioning}} or 
> {{outputOrdering}} is different from the that of the actual data, and 
> meanwhile the cache is used by another query that asks for specific 
> {{outputPartitioning}} or {{outputOrdering}} which happens to match the new 
> plan but not the actual data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26725) Fix the input values of UnifiedMemoryManager constructor in test suites

2019-01-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26725:
-

Assignee: Sean Owen

> Fix the input values of UnifiedMemoryManager constructor in test suites
> ---
>
> Key: SPARK-26725
> URL: https://issues.apache.org/jira/browse/SPARK-26725
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Sean Owen
>Priority: Minor
>  Labels: starter
>
> Addressed the comments in 
> https://github.com/apache/spark/pull/23457#issuecomment-457409976



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26725) Fix the input values of UnifiedMemoryManager constructor in test suites

2019-01-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26725:


Assignee: Sean Owen  (was: Apache Spark)

> Fix the input values of UnifiedMemoryManager constructor in test suites
> ---
>
> Key: SPARK-26725
> URL: https://issues.apache.org/jira/browse/SPARK-26725
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Sean Owen
>Priority: Minor
>  Labels: starter
>
> Addressed the comments in 
> https://github.com/apache/spark/pull/23457#issuecomment-457409976



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26725) Fix the input values of UnifiedMemoryManager constructor in test suites

2019-01-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26725:


Assignee: Apache Spark  (was: Sean Owen)

> Fix the input values of UnifiedMemoryManager constructor in test suites
> ---
>
> Key: SPARK-26725
> URL: https://issues.apache.org/jira/browse/SPARK-26725
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Addressed the comments in 
> https://github.com/apache/spark/pull/23457#issuecomment-457409976



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26725) Fix the input values of UnifiedMemoryManager constructor in test suites

2019-01-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26725:
--
Priority: Minor  (was: Major)

> Fix the input values of UnifiedMemoryManager constructor in test suites
> ---
>
> Key: SPARK-26725
> URL: https://issues.apache.org/jira/browse/SPARK-26725
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Minor
>  Labels: starter
>
> Addressed the comments in 
> https://github.com/apache/spark/pull/23457#issuecomment-457409976



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751890#comment-16751890
 ] 

Dongjoon Hyun commented on SPARK-26677:
---

Thank you, [~anandchinn] and [~hyukjin.kwon].
So, according to the PR and PARQUET-1309, only Parquet 1.10.0 (used in Spark 
2.4.0) version has this issue.

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Blocker
>  Labels: correctness
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26708) Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan

2019-01-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26708:


Assignee: Apache Spark  (was: Maryann Xue)

> Incorrect result caused by inconsistency between a SQL cache's cached RDD and 
> its physical plan
> ---
>
> Key: SPARK-26708
> URL: https://issues.apache.org/jira/browse/SPARK-26708
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: correctness
>
> When performing non-cascading cache invalidation, {{recache}} is called on 
> the other cache entries which are dependent on the cache being invalidated. 
> It leads to the the physical plans of those cache entries being re-compiled. 
> For those cache entries, if the cache RDD has already been persisted, chances 
> are there will be inconsistency between the data and the new plan. It can 
> cause a correctness issue if the new plan's {{outputPartitioning}} or 
> {{outputOrdering}} is different from the that of the actual data, and 
> meanwhile the cache is used by another query that asks for specific 
> {{outputPartitioning}} or {{outputOrdering}} which happens to match the new 
> plan but not the actual data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26708) Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan

2019-01-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26708:


Assignee: Maryann Xue  (was: Apache Spark)

> Incorrect result caused by inconsistency between a SQL cache's cached RDD and 
> its physical plan
> ---
>
> Key: SPARK-26708
> URL: https://issues.apache.org/jira/browse/SPARK-26708
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Maryann Xue
>Priority: Blocker
>  Labels: correctness
>
> When performing non-cascading cache invalidation, {{recache}} is called on 
> the other cache entries which are dependent on the cache being invalidated. 
> It leads to the the physical plans of those cache entries being re-compiled. 
> For those cache entries, if the cache RDD has already been persisted, chances 
> are there will be inconsistency between the data and the new plan. It can 
> cause a correctness issue if the new plan's {{outputPartitioning}} or 
> {{outputOrdering}} is different from the that of the actual data, and 
> meanwhile the cache is used by another query that asks for specific 
> {{outputPartitioning}} or {{outputOrdering}} which happens to match the new 
> plan but not the actual data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25767) Error reported in Spark logs when using the org.apache.spark:spark-sql_2.11:2.3.2 Java library

2019-01-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25767:
--
Fix Version/s: 2.3.3

> Error reported in Spark logs when using the 
> org.apache.spark:spark-sql_2.11:2.3.2 Java library
> --
>
> Key: SPARK-25767
> URL: https://issues.apache.org/jira/browse/SPARK-25767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.2
>Reporter: Thomas Brugiere
>Assignee: Peter Toth
>Priority: Major
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
> Attachments: fileA.csv, fileB.csv, fileC.csv
>
>
> Hi,
> Here is a bug I found using the latest version of spark-sql_2.11:2.2.0. Note 
> that this case was also tested with spark-sql_2.11:2.3.2 and the bug is also 
> present.
> This issue is a duplicate of the SPARK-25582 issue that I had to close after 
> an accidental manipulation from another developer (was linked to a wrong PR)
> You will find attached three small sample CSV files with the minimal content 
> to raise the bug.
> Find below a reproducer code:
> {code:java}
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import scala.collection.JavaConverters;
> import scala.collection.Seq;
> import java.util.Arrays;
> public class SparkBug {
> private static  Seq arrayToSeq(T[] input) {
> return 
> JavaConverters.asScalaIteratorConverter(Arrays.asList(input).iterator()).asScala().toSeq();
> }
> public static void main(String[] args) throws Exception {
> SparkConf conf = new 
> SparkConf().setAppName("SparkBug").setMaster("local");
> SparkSession sparkSession = 
> SparkSession.builder().config(conf).getOrCreate();
> Dataset df_a = sparkSession.read().option("header", 
> true).csv("local/fileA.csv").dropDuplicates();
> Dataset df_b = sparkSession.read().option("header", 
> true).csv("local/fileB.csv").dropDuplicates();
> Dataset df_c = sparkSession.read().option("header", 
> true).csv("local/fileC.csv").dropDuplicates();
> String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", 
> "colE", "colF"};
> String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", 
> "colE"};
> Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), 
> "left");
> Dataset df_inventory_2 = df_inventory_1.join(df_c, 
> arrayToSeq(key_join_2), "left");
> df_inventory_2.show();
> }
> }
> {code}
> When running this code, I can see the exception below:
> {code:java}
> 18/10/18 09:25:49 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
>     at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
>     at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
>     at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
>     at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
>     at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
>     at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
>     at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958)
>     at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212)
>     at 
> 

[jira] [Resolved] (SPARK-26649) Noop Streaming Sink using DSV2

2019-01-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26649.
---
   Resolution: Fixed
 Assignee: Gabor Somogyi
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23631

> Noop Streaming Sink using DSV2
> --
>
> Key: SPARK-26649
> URL: https://issues.apache.org/jira/browse/SPARK-26649
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Extend this noop data source to support a streaming sink that ignores all the 
> elements.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26680) StackOverflowError if Stream passed to groupBy

2019-01-24 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26680:
--
Fix Version/s: 2.3.3

> StackOverflowError if Stream passed to groupBy
> --
>
> Key: SPARK-26680
> URL: https://issues.apache.org/jira/browse/SPARK-26680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0, 3.0.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> This Java code results in a StackOverflowError:
> {code:java}
> List groupByCols = new ArrayList<>();
> groupByCols.add(new Column("id1"));
> scala.collection.Seq groupByColsSeq =
> JavaConverters.asScalaIteratorConverter(groupByCols.iterator())
> .asScala().toSeq();
> df.groupBy(groupByColsSeq).max("id2").toDF("id1", "id2").show();
> {code}
> The {{toSeq}} method above produces a Stream. Passing a Stream to groupBy 
> results in the StackOverflowError. In fact, the error can be produced more 
> easily in spark-shell:
> {noformat}
> scala> val df = spark.read.schema("id1 int, id2 int").csv("testinput.csv")
> df: org.apache.spark.sql.DataFrame = [id1: int, id2: int]
> scala> val groupBySeq = Stream(col("id1"))
> groupBySeq: scala.collection.immutable.Stream[org.apache.spark.sql.Column] = 
> Stream(id1, ?)
> scala> df.groupBy(groupBySeq: _*).max("id2").toDF("id1", "id2").collect
> java.lang.StackOverflowError
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1161)
>   at scala.collection.immutable.Stream.drop(Stream.scala:797)
>   at scala.collection.immutable.Stream.drop(Stream.scala:204)
>   at scala.collection.LinearSeqOptimized.apply(LinearSeqOptimized.scala:66)
>   at scala.collection.LinearSeqOptimized.apply$(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.Stream.apply(Stream.scala:204)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$consume$3(WholeStageCodegenExec.scala:159)
>   at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1171)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1161)
>   at scala.collection.immutable.Stream.drop(Stream.scala:797)
>   at scala.collection.immutable.Stream.drop(Stream.scala:204)
>   at scala.collection.LinearSeqOptimized.apply(LinearSeqOptimized.scala:66)
>   at scala.collection.LinearSeqOptimized.apply$(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.Stream.apply(Stream.scala:204)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$consume$3(WholeStageCodegenExec.scala:159)
> ...etc...
> {noformat}
> This is due to the lazy nature of Streams. The method {{consume}} in 
> {{CodegenSupport}} assumes that a map function will be eagerly evaluated:
> {code:java}
> val inputVars =
> ctx.currentVars = null <== the closure cares about this
> ctx.INPUT_ROW = row
> output.zipWithIndex.map { case (attr, i) =>
>   BoundReference(i, attr.dataType, attr.nullable).genCode(ctx)
> -
> -
> -
> ctx.currentVars = inputVars
> ctx.INPUT_ROW = null
> ctx.freshNamePrefix = parent.variablePrefix
> val evaluated = evaluateRequiredVariables(output, inputVars, 
> parent.usedInputs)
> {code}
> The closure passed to the map function assumes {{ctx.currentVars}} will be 
> set to null. But due to lazy evaluation, {{ctx.currentVars}} is set to 
> something else by the time the closure is actually called. Worse yet, 
> {{ctx.currentVars}} is set to the yet-to-be evaluated inputVars stream. The 
> closure uses {{ctx.currentVars}} (via the call {{genCode(ctx)}}), therefore 
> it ends up using the data structure it is attempting to create.
> You can recreate the problem is a vanilla Scala shell:
> {code:java}
> scala> var p1: Seq[Any] = null
> p1: Seq[Any] = null
> scala> val s = Stream(1, 2).zipWithIndex.map { case (x, i) => if (p1 != null) 
> p1(i) else x }
> s: scala.collection.immutable.Stream[Any] = Stream(1, ?)
> scala> p1 = s
> p1: Seq[Any] = Stream(1, ?)
> scala> s.foreach(println)
> 1

[jira] [Assigned] (SPARK-25713) Implement copy() for ColumnarArray

2019-01-24 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25713:
---

Assignee: Artsiom Yudovin

> Implement copy() for ColumnarArray
> --
>
> Key: SPARK-25713
> URL: https://issues.apache.org/jira/browse/SPARK-25713
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Liwen Sun
>Assignee: Artsiom Yudovin
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26654) Use Timestamp/DateFormatter in CatalogColumnStat

2019-01-24 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751855#comment-16751855
 ] 

Wenchen Fan commented on SPARK-26654:
-

+1. I think we store string format instead of the actual long value, so that 
stats can be human-readable. For correctness, the string format must be able to 
convert back to the actual long value without ambiguity.

> Use Timestamp/DateFormatter in CatalogColumnStat
> 
>
> Key: SPARK-26654
> URL: https://issues.apache.org/jira/browse/SPARK-26654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Need to switch fromExternalString on Timestamp/DateFormatters, in particular:
> https://github.com/apache/spark/blob/3b7395fe025a4c9a591835e53ac6ca05be6868f1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L481-L482



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26569) Fixed point for batch Operator Optimizations never reached when optimize logicalPlan

2019-01-24 Thread Chen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751801#comment-16751801
 ] 

Chen Fan commented on SPARK-26569:
--

I believe PR under https://issues.apache.org/jira/browse/SPARK-21652 already 
fix it on Spark 2.3.

> Fixed point for batch Operator Optimizations never reached when optimize 
> logicalPlan
> 
>
> Key: SPARK-26569
> URL: https://issues.apache.org/jira/browse/SPARK-26569
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment:  
>  
>Reporter: Chen Fan
>Priority: Major
>
> There is a bit complicated Spark App using DataSet api run once a day, and I 
> noticed the app will hang once in a while, 
>  I add some log and compare two driver log which one belong to successful 
> app, another belong to faied, and here is some results of investigation
> 1. Usually the app will running correctly, but sometime it will hang after 
> finishing job 1
> !image-2019-01-08-19-53-20-509.png!
> 2. According to log I append , the successful app always reach the fixed 
> point when iteration is 7 on Batch Operator Optimizations, but failed app 
> never reached this fixed point.
> {code:java}
> 2019-01-04,11:35:34,199 DEBUG org.apache.spark.sql.execution.SparkOptimizer: 
> === Result of Batch Operator Optimizations ===
> 2019-01-04,14:00:42,847 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> iteration is 6/100, for batch Operator Optimizations
> 2019-01-04,14:00:42,851 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
>  
> 2019-01-04,14:00:42,852 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownPredicate 
> ===
>  
> 2019-01-04,14:00:42,903 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
>  
> 2019-01-04,14:00:42,939 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
>  
> 2019-01-04,14:00:42,951 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.PruneFilters ===
>  
> 2019-01-04,14:00:42,970 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> iteration is 7/100, for batch Operator Optimizations
> 2019-01-04,14:00:42,971 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> Fixed point reached for batch Operator Optimizations after 7 iterations.
> 2019-01-04,14:13:15,616 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> iteration is 45/100, for batch Operator Optimizations
> 2019-01-04,14:13:15,619 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
>  
> 2019-01-04,14:13:15,620 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownPredicate 
> ===
>  
> 2019-01-04,14:13:59,529 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
>  
> 2019-01-04,14:13:59,806 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
>  
> 2019-01-04,14:13:59,845 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> iteration is 46/100, for batch Operator Optimizations
> 2019-01-04,14:13:59,849 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
>  
> 2019-01-04,14:13:59,849 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownPredicate 
> ===
>  
> 2019-01-04,14:14:45,340 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
>  
> 2019-01-04,14:14:45,631 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
>  
> 2019-01-04,14:14:45,678 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> iteration is 47/100, for batch Operator Optimizations
> {code}
> 3. The difference between two logical plan appear in BooleanSimplification on 
> iteration, before this rule, two logical plan is same:
> {code:java}
> // just a head part of plan
> Project [model#2486, version#12, device#11, date#30, imei#13, pType#14, 
> releaseType#15, regionType#16, expType#17, tag#18, imeiCount#1586, 
> startdate#2360, enddate#2361, 

[jira] [Created] (SPARK-26725) Fix the input values of UnifiedMemoryManager constructor in test suites

2019-01-24 Thread Xiao Li (JIRA)
Xiao Li created SPARK-26725:
---

 Summary: Fix the input values of UnifiedMemoryManager constructor 
in test suites
 Key: SPARK-26725
 URL: https://issues.apache.org/jira/browse/SPARK-26725
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Xiao Li


Addressed the comments in 
https://github.com/apache/spark/pull/23457#issuecomment-457409976



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26725) Fix the input values of UnifiedMemoryManager constructor in test suites

2019-01-24 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26725:

Labels: starter  (was: )

> Fix the input values of UnifiedMemoryManager constructor in test suites
> ---
>
> Key: SPARK-26725
> URL: https://issues.apache.org/jira/browse/SPARK-26725
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> Addressed the comments in 
> https://github.com/apache/spark/pull/23457#issuecomment-457409976



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes

2019-01-24 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751830#comment-16751830
 ] 

Imran Rashid commented on SPARK-26688:
--

OK that is a reasonable request ... but to play devil's advocate, there is a 
downside to having devs try to use this workaround instead of going through 
Ops.  They never end up sending the feedback to Ops, Ops doesn't ever get 
metrics on the bad nodes, because nothing even tries to run there.  Meanwhile 
devs start to apply this willy-nilly, as these configs tend to just keep 
getting built up over time (I've seen cases where configs included 50 GB 
containers, just because once, 3 years ago, somebody's app needed that much 
memory, and that config just keeps getting copied forward to everyone).  
Eventually you've got a hodge-podge of blacklisting in place, and nobody really 
knows which nodes are actually good or bad.

Ideally, blacklisting and speculation should be able to prevent that problem 
from being so noticeable in the first place, but I can see how reality might be 
far from that ideal.

Nonetheless, on the whole I think I lean slightly in favor of adding this, but 
would like to hear more opinions.

> Provide configuration of initially blacklisted YARN nodes
> -
>
> Key: SPARK-26688
> URL: https://issues.apache.org/jira/browse/SPARK-26688
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Introducing new config for initially blacklisted YARN nodes.
> This came up in the apache spark user mailing list: 
> [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26569) Fixed point for batch Operator Optimizations never reached when optimize logicalPlan

2019-01-24 Thread Chen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Fan resolved SPARK-26569.
--
Resolution: Duplicate

> Fixed point for batch Operator Optimizations never reached when optimize 
> logicalPlan
> 
>
> Key: SPARK-26569
> URL: https://issues.apache.org/jira/browse/SPARK-26569
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment:  
>  
>Reporter: Chen Fan
>Priority: Major
>
> There is a bit complicated Spark App using DataSet api run once a day, and I 
> noticed the app will hang once in a while, 
>  I add some log and compare two driver log which one belong to successful 
> app, another belong to faied, and here is some results of investigation
> 1. Usually the app will running correctly, but sometime it will hang after 
> finishing job 1
> !image-2019-01-08-19-53-20-509.png!
> 2. According to log I append , the successful app always reach the fixed 
> point when iteration is 7 on Batch Operator Optimizations, but failed app 
> never reached this fixed point.
> {code:java}
> 2019-01-04,11:35:34,199 DEBUG org.apache.spark.sql.execution.SparkOptimizer: 
> === Result of Batch Operator Optimizations ===
> 2019-01-04,14:00:42,847 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> iteration is 6/100, for batch Operator Optimizations
> 2019-01-04,14:00:42,851 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
>  
> 2019-01-04,14:00:42,852 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownPredicate 
> ===
>  
> 2019-01-04,14:00:42,903 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
>  
> 2019-01-04,14:00:42,939 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
>  
> 2019-01-04,14:00:42,951 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.PruneFilters ===
>  
> 2019-01-04,14:00:42,970 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> iteration is 7/100, for batch Operator Optimizations
> 2019-01-04,14:00:42,971 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> Fixed point reached for batch Operator Optimizations after 7 iterations.
> 2019-01-04,14:13:15,616 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> iteration is 45/100, for batch Operator Optimizations
> 2019-01-04,14:13:15,619 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
>  
> 2019-01-04,14:13:15,620 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownPredicate 
> ===
>  
> 2019-01-04,14:13:59,529 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
>  
> 2019-01-04,14:13:59,806 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
>  
> 2019-01-04,14:13:59,845 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> iteration is 46/100, for batch Operator Optimizations
> 2019-01-04,14:13:59,849 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
>  
> 2019-01-04,14:13:59,849 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownPredicate 
> ===
>  
> 2019-01-04,14:14:45,340 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
>  
> 2019-01-04,14:14:45,631 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
>  
> 2019-01-04,14:14:45,678 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> iteration is 47/100, for batch Operator Optimizations
> {code}
> 3. The difference between two logical plan appear in BooleanSimplification on 
> iteration, before this rule, two logical plan is same:
> {code:java}
> // just a head part of plan
> Project [model#2486, version#12, device#11, date#30, imei#13, pType#14, 
> releaseType#15, regionType#16, expType#17, tag#18, imeiCount#1586, 
> startdate#2360, enddate#2361, status#2362]
> +- Join Inner, (((cast(cast(imeiCount#1586 as decimal(20,0)) as int)  
> 1) || (model#2486 

[jira] [Assigned] (SPARK-23674) Add Spark ML Listener for Tracking ML Pipeline Status

2019-01-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23674:


Assignee: Hyukjin Kwon

> Add Spark ML Listener for Tracking ML Pipeline Status
> -
>
> Key: SPARK-23674
> URL: https://issues.apache.org/jira/browse/SPARK-23674
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Mingjie Tang
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, Spark provides status monitoring for different components of 
> Spark, like spark history server, streaming listener, sql listener and etc. 
> The use case would be (1) front UI to track the status of training coverage 
> rate during iteration, then DS can understand how the job converge when 
> training, like K-means, Logistic and other linear regression model.  (2) 
> tracking the data lineage for the input and output of training data.  
> In this proposal, we hope to provide Spark ML pipeline listener to track the 
> status of Spark ML pipeline status includes: 
>  # ML pipeline create and saved 
>  # ML pipeline model created, saved and load  
>  # ML model training status monitoring  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23674) Add Spark ML Listener for Tracking ML Pipeline Status

2019-01-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23674.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23263
[https://github.com/apache/spark/pull/23263]

> Add Spark ML Listener for Tracking ML Pipeline Status
> -
>
> Key: SPARK-23674
> URL: https://issues.apache.org/jira/browse/SPARK-23674
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Mingjie Tang
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, Spark provides status monitoring for different components of 
> Spark, like spark history server, streaming listener, sql listener and etc. 
> The use case would be (1) front UI to track the status of training coverage 
> rate during iteration, then DS can understand how the job converge when 
> training, like K-means, Logistic and other linear regression model.  (2) 
> tracking the data lineage for the input and output of training data.  
> In this proposal, we hope to provide Spark ML pipeline listener to track the 
> status of Spark ML pipeline status includes: 
>  # ML pipeline create and saved 
>  # ML pipeline model created, saved and load  
>  # ML model training status monitoring  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2019-01-24 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751802#comment-16751802
 ] 

Bryan Cutler commented on SPARK-24579:
--

It would be great to start up this discussion again, I saw related jiras 
SPARK-24579 and SPARK-26413 have been recently created. I think the ideal way 
for Spark to exchange data is an iterator of Arrow record batches, which could 
be done with an arrow_udf (mentioned in the design here) or through a 
{{mapPartitions}} function. Maybe expanding on the examples touched on in the 
design doc would help with discussion?

> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> 
>
> Key: SPARK-24579
> URL: https://issues.apache.org/jira/browse/SPARK-24579
> Project: Spark
>  Issue Type: Epic
>  Components: ML, PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen
> Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange 
> between Apache Spark and DL%2FAI Frameworks .pdf
>
>
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark 
> as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache 
> MXNet (incubating).
> Both big data and AI are indispensable components to drive business 
> innovation and there have
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI 
> frameworks like tf.data and tf.Transform. However, with 50+ data sources and 
> built-in SQL, DataFrames, and Streaming features, Spark remains the community 
> choice for big data. This is why we saw many efforts to integrate DL/AI 
> frameworks with Spark to leverage its power, for example, TFRecords data 
> source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project 
> Hydrogen, this SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark 
> and external DL/AI frameworks. And the performance matters. However, there 
> doesn’t exist a standard way to exchange data and hence implementation and 
> performance optimization fall into pieces. For example, TensorFlowOnSpark 
> uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and 
> save data and pass the RDD records to TensorFlow in Python. And TensorFrames 
> converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s 
> Java API. How can we reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format) 
> between Spark and DL/AI frameworks and optimize data conversion from/to this 
> interface.  So DL/AI frameworks can leverage Spark to load data virtually 
> from anywhere without spending extra effort building complex data solutions, 
> like reading features from a production data warehouse or streaming model 
> inference. Spark users can use DL/AI frameworks without learning specific 
> data APIs implemented there. And developers from both sides can work on 
> performance optimizations independently given the interface itself doesn’t 
> introduce big overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames for the entire partition

2019-01-24 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751785#comment-16751785
 ] 

Bryan Cutler commented on SPARK-26412:
--

[~mengxr] I think Arrow record batches would be a much more ideal way to 
connect with other frameworks. Making the conversion to Pandas carries some 
overhead and the Arrow format/types are more solidly defined. It is also better 
suited to be used with an iterator - most of the Arrow IPC mechanisms operate 
on streams of record batches. Is this proposal instead of SPARK-24579 SPIP: 
Standardize Optimized Data Exchange between Spark and DL/AI frameworks?

> Allow Pandas UDF to take an iterator of pd.DataFrames for the entire partition
> --
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to batch scope, user need to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient. I created this JIRA to discuss possible solutions.
> Essentially we need to support "start()" and "finish()" besides "apply". We 
> can either provide those interfaces or simply provide users the iterator of 
> batches in pd.DataFrame and let user code handle it.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26410) Support per Pandas UDF configuration

2019-01-24 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751771#comment-16751771
 ] 

Bryan Cutler commented on SPARK-26410:
--

This could be useful to have, but it does seem a little strange to bind batch 
size to a udf. To me, batch size seems more related to the data being used, and 
merging different batch sizes could complicate the behavior. Still, I can see 
how someone might want to change batch size at different points in a session.

> Support per Pandas UDF configuration
> 
>
> Key: SPARK-26410
> URL: https://issues.apache.org/jira/browse/SPARK-26410
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the 
> "right" batch size usually depends on the task itself. It would be nice if 
> user can configure the batch size when they declare the Pandas UDF.
> This is orthogonal to SPARK-23258 (using max buffer size instead of row 
> count).
> Besides API, we should also discuss how to merge Pandas UDFs of different 
> configurations. For example,
> {code}
> df.select(predict1(col("features"), predict2(col("features")))
> {code}
> when predict1 requests 100 rows per batch, while predict2 requests 120 rows 
> per batch.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19591) Add sample weights to decision trees

2019-01-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19591.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21632
[https://github.com/apache/spark/pull/21632]

> Add sample weights to decision trees
> 
>
> Key: SPARK-19591
> URL: https://issues.apache.org/jira/browse/SPARK-19591
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Major
> Fix For: 3.0.0
>
>
> Add sample weights to decision trees.  See [SPARK-9478] for details on the 
> design.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751767#comment-16751767
 ] 

Hyukjin Kwon commented on SPARK-26677:
--

Yes, please read the linked PR above.

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Blocker
>  Labels: correctness
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26187) Stream-stream left outer join returns outer nulls for already matched rows

2019-01-24 Thread Jungtaek Lim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-26187.
--
Resolution: Duplicate

> Stream-stream left outer join returns outer nulls for already matched rows
> --
>
> Key: SPARK-26187
> URL: https://issues.apache.org/jira/browse/SPARK-26187
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Pavel Chernikov
>Priority: Major
>
> This is basically the same issue as SPARK-26154, but with slightly easier 
> reproducible and concrete example:
> {code:java}
> val rateStream = session.readStream
>  .format("rate")
>  .option("rowsPerSecond", 1)
>  .option("numPartitions", 1)
>  .load()
> import org.apache.spark.sql.functions._
> val fooStream = rateStream
>  .select(col("value").as("fooId"), col("timestamp").as("fooTime"))
> val barStream = rateStream
>  // Introduce misses for ease of debugging
>  .where(col("value") % 2 === 0)
>  .select(col("value").as("barId"), col("timestamp").as("barTime")){code}
> If barStream is configured to happen earlier than fooStream, based on time 
> range condition, than everything is all right, no previously matched records 
> are flushed with outer NULLs:
> {code:java}
> val query = fooStream
>  .withWatermark("fooTime", "5 seconds")
>  .join(
>barStream.withWatermark("barTime", "5 seconds"),
>expr("""
>  barId = fooId AND
>  fooTime >= barTime AND
>  fooTime <= barTime + interval 5 seconds
> """),
>joinType = "leftOuter"
>  )
>  .writeStream
>  .format("console")
>  .option("truncate", false)
>  .start(){code}
> It's easy to observe that only odd rows are flushed with NULLs on the right:
> {code:java}
> [info] Batch: 1 
> [info] +-+---+-+---+ 
> [info] |fooId|fooTime                |barId|barTime                | 
> [info] +-+---+-+---+ 
> [info] |0    |2018-11-27 13:12:34.976|0    |2018-11-27 13:12:34.976| 
> [info] |6    |2018-11-27 13:12:40.976|6    |2018-11-27 13:12:40.976| 
> [info] |10   |2018-11-27 13:12:44.976|10   |2018-11-27 13:12:44.976| 
> [info] |8    |2018-11-27 13:12:42.976|8    |2018-11-27 13:12:42.976| 
> [info] |2    |2018-11-27 13:12:36.976|2    |2018-11-27 13:12:36.976| 
> [info] |4    |2018-11-27 13:12:38.976|4    |2018-11-27 13:12:38.976| 
> [info] +-+---+-+---+ 
> [info] Batch: 2 
> [info] +-+---+-+---+ 
> [info] |fooId|fooTime                |barId|barTime                | 
> [info] +-+---+-+---+ 
> [info] |1    |2018-11-27 13:12:35.976|null |null                   | 
> [info] |3    |2018-11-27 13:12:37.976|null |null                   | 
> [info] |12   |2018-11-27 13:12:46.976|12   |2018-11-27 13:12:46.976| 
> [info] |18   |2018-11-27 13:12:52.976|18   |2018-11-27 13:12:52.976| 
> [info] |14   |2018-11-27 13:12:48.976|14   |2018-11-27 13:12:48.976| 
> [info] |20   |2018-11-27 13:12:54.976|20   |2018-11-27 13:12:54.976| 
> [info] |16   |2018-11-27 13:12:50.976|16   |2018-11-27 13:12:50.976| 
> [info] +-+---+-+---+ 
> [info] Batch: 3 
> [info] +-+---+-+---+ 
> [info] |fooId|fooTime                |barId|barTime                | 
> [info] +-+---+-+---+ 
> [info] |26   |2018-11-27 13:13:00.976|26   |2018-11-27 13:13:00.976| 
> [info] |22   |2018-11-27 13:12:56.976|22   |2018-11-27 13:12:56.976| 
> [info] |7    |2018-11-27 13:12:41.976|null |null                   | 
> [info] |9    |2018-11-27 13:12:43.976|null |null                   | 
> [info] |28   |2018-11-27 13:13:02.976|28   |2018-11-27 13:13:02.976| 
> [info] |5    |2018-11-27 13:12:39.976|null |null                   | 
> [info] |11   |2018-11-27 13:12:45.976|null |null                   | 
> [info] |13   |2018-11-27 13:12:47.976|null |null                   | 
> [info] |24   |2018-11-27 13:12:58.976|24   |2018-11-27 13:12:58.976| 
> [info] +-+---+-+---+
> {code}
> On the other hand, if we switch the ordering and now fooStream is happening 
> earlier based on time range condition:
> {code:java}
> val query = fooStream
>  .withWatermark("fooTime", "5 seconds")
>  .join(
>barStream.withWatermark("barTime", "5 seconds"),
>expr("""
>  barId = fooId AND
>  barTime >= fooTime AND
>  barTime <= fooTime + interval 5 seconds
> """),
>joinType = "leftOuter"
>  )
>  .writeStream
>  .format("console")
>  .option("truncate", false)
>  .start(){code}
> Some, not all, 

[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-24 Thread ANAND CHINNAKANNAN (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751757#comment-16751757
 ] 

ANAND CHINNAKANNAN commented on SPARK-26677:


[~hyukjin.kwon] - Do you know exactly the issue is from ParquetFileReader, The 
file reader was an issue with override the duplicate row keys.

Let me know your thoughts. 

 

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Blocker
>  Labels: correctness
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26724) Non negative coefficients for LinearRegression

2019-01-24 Thread Alex Chang (JIRA)
Alex Chang created SPARK-26724:
--

 Summary: Non negative coefficients for LinearRegression
 Key: SPARK-26724
 URL: https://issues.apache.org/jira/browse/SPARK-26724
 Project: Spark
  Issue Type: Question
  Components: MLlib
Affects Versions: 2.4.0
Reporter: Alex Chang


Hi, 

 

For 
[pyspark.ml.regression.LinearRegression|http://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=linearregression#pyspark.ml.regression.LinearRegression],
 is there any approach or API to force coefficients to be non negative?

 

Thanks

Alex



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13587) Support virtualenv in PySpark

2019-01-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13587:


Assignee: Apache Spark

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Major
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13587) Support virtualenv in PySpark

2019-01-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13587:


Assignee: (was: Apache Spark)

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Jeff Zhang
>Priority: Major
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13587) Support virtualenv in PySpark

2019-01-24 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-13587:
--

Assignee: (was: Marcelo Vanzin)

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Jeff Zhang
>Priority: Major
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13587) Support virtualenv in PySpark

2019-01-24 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-13587:
--

Assignee: Marcelo Vanzin  (was: Jeff Zhang)

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Jeff Zhang
>Assignee: Marcelo Vanzin
>Priority: Major
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26697) ShuffleBlockFetcherIterator can log block sizes in addition to num blocks

2019-01-24 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26697:
--

Assignee: Imran Rashid

> ShuffleBlockFetcherIterator can log block sizes in addition to num blocks
> -
>
> Key: SPARK-26697
> URL: https://issues.apache.org/jira/browse/SPARK-26697
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
>
> Every so often I find myself looking at executor logs, wondering why 
> something is going wrong (failed exec, or seems to be stuck etc) and I wish I 
> had a bit more info about shuffle sizes.  {{ShuffleBlockFetcherIterator}} 
> logs the number of local & remote blocks, but not their sizes.  It would be 
> really easy to add in size info too.
> eg. instead of 
> {noformat}
> 19/01/22 23:10:49.978 Executor task launch worker for task 8 INFO 
> ShuffleBlockFetcherIterator: Getting 2 non-empty blocks including 1 local 
> blocks and 1 remote blocks
> {noformat}
> it should be 
> {noformat}
> 19/01/22 23:10:49.978 Executor task launch worker for task 8 INFO 
> ShuffleBlockFetcherIterator: Getting 2 (194.0 B) non-empty blocks including 1 
> (97.0 B) local blocks and 1 (97.0 B) remote blocks
> {noformat}
> I know this is a really minor change, but I've wanted it multiple times, 
> seems worth it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26723) Spark web UI only shows parts of SQL query graphs for queries with persist operations

2019-01-24 Thread Vladimir Matveev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Matveev updated SPARK-26723:
-
Attachment: Screen Shot 2019-01-24 at 4.13.14 PM.png
Screen Shot 2019-01-24 at 4.13.02 PM.png

> Spark web UI only shows parts of SQL query graphs for queries with persist 
> operations
> -
>
> Key: SPARK-26723
> URL: https://issues.apache.org/jira/browse/SPARK-26723
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: Vladimir Matveev
>Priority: Major
> Attachments: Screen Shot 2019-01-24 at 4.13.02 PM.png, Screen Shot 
> 2019-01-24 at 4.13.14 PM.png
>
>
> Currently it looks like the SQL view in Spark UI will truncate the graph on 
> the nodes corresponding to persist operations on the dataframe, only showing 
> everything after "LocalTableScan". This is *very* inconvenient, because in a 
> common case when you have a heavy computation and want to persist it before 
> writing to multiple outputs with some minor preprocessing, you lose almost 
> the entire graph with potentially very useful information in it.
> The query plans below the graph, however, show the full query, including all 
> computations before persists. Unfortunately, for complex queries looking into 
> the plan is unfeasible, and graph visualization becomes a very helpful tool; 
> with persist, it is apparently broken.
> You can verify it in Spark Shell with a very simple example:
> {code}
> import org.apache.spark.sql.{functions => f}
> import org.apache.spark.sql.expressions.Window
> val query = Vector(1, 2, 3).toDF()
>   .select(($"value".cast("long") * f.rand).as("value"))
>   .withColumn("valueAvg", f.avg($"value") over Window.orderBy("value"))
> query.show()
> query.persist().show()
> {code}
> Here the same query is executed first without persist, and then with it. If 
> you now navigate to the Spark web UI SQL page, you'll see two queries, but 
> their graphs will be radically different: the one without persist will 
> contain the whole transformation with exchange, sort and window steps, while 
> the one with persist will only contain only a LocalTableScan step with some 
> intermediate transformations needed for `show`.
> After some looking into Spark code, I think that the reason for this is that 
> the `org.apache.spark.sql.execution.SparkPlanInfo#fromSparkPlan` method 
> (which is used to serialize a plan before emitting the 
> SparkListenerSQLExecutionStart event) constructs the `SparkPlanInfo` object 
> from a `SparkPlan` object incorrectly, because if you invoke the `toString` 
> method on `SparkPlan` you'll see the entire plan, but the `SparkPlanInfo` 
> object will only contain nodes corresponding to actions after `persist`. 
> However, my knowledge of Spark internals is not deep enough to understand how 
> to fix this, and how SparkPlanInfo.fromSparkPlan is different from what 
> SparkPlan.toString does.
> This can be observed on Spark 2.3.2, but given that 2.4.0 code of 
> SparkPlanInfo does not seem to change much since 2.3.2, I'd expect that it 
> could be reproduced there too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26697) ShuffleBlockFetcherIterator can log block sizes in addition to num blocks

2019-01-24 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26697.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23621
[https://github.com/apache/spark/pull/23621]

> ShuffleBlockFetcherIterator can log block sizes in addition to num blocks
> -
>
> Key: SPARK-26697
> URL: https://issues.apache.org/jira/browse/SPARK-26697
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 3.0.0
>
>
> Every so often I find myself looking at executor logs, wondering why 
> something is going wrong (failed exec, or seems to be stuck etc) and I wish I 
> had a bit more info about shuffle sizes.  {{ShuffleBlockFetcherIterator}} 
> logs the number of local & remote blocks, but not their sizes.  It would be 
> really easy to add in size info too.
> eg. instead of 
> {noformat}
> 19/01/22 23:10:49.978 Executor task launch worker for task 8 INFO 
> ShuffleBlockFetcherIterator: Getting 2 non-empty blocks including 1 local 
> blocks and 1 remote blocks
> {noformat}
> it should be 
> {noformat}
> 19/01/22 23:10:49.978 Executor task launch worker for task 8 INFO 
> ShuffleBlockFetcherIterator: Getting 2 (194.0 B) non-empty blocks including 1 
> (97.0 B) local blocks and 1 (97.0 B) remote blocks
> {noformat}
> I know this is a really minor change, but I've wanted it multiple times, 
> seems worth it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26723) Spark web UI only shows parts of SQL query graphs for queries with persist operations

2019-01-24 Thread Vladimir Matveev (JIRA)
Vladimir Matveev created SPARK-26723:


 Summary: Spark web UI only shows parts of SQL query graphs for 
queries with persist operations
 Key: SPARK-26723
 URL: https://issues.apache.org/jira/browse/SPARK-26723
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.2
Reporter: Vladimir Matveev


Currently it looks like the SQL view in Spark UI will truncate the graph on the 
nodes corresponding to persist operations on the dataframe, only showing 
everything after "LocalTableScan". This is *very* inconvenient, because in a 
common case when you have a heavy computation and want to persist it before 
writing to multiple outputs with some minor preprocessing, you lose almost the 
entire graph with potentially very useful information in it.

The query plans below the graph, however, show the full query, including all 
computations before persists. Unfortunately, for complex queries looking into 
the plan is unfeasible, and graph visualization becomes a very helpful tool; 
with persist, it is apparently broken.

You can verify it in Spark Shell with a very simple example:
{code}
import org.apache.spark.sql.{functions => f}
import org.apache.spark.sql.expressions.Window

val query = Vector(1, 2, 3).toDF()
  .select(($"value".cast("long") * f.rand).as("value"))
  .withColumn("valueAvg", f.avg($"value") over Window.orderBy("value"))
query.show()
query.persist().show()
{code}
Here the same query is executed first without persist, and then with it. If you 
now navigate to the Spark web UI SQL page, you'll see two queries, but their 
graphs will be radically different: the one without persist will contain the 
whole transformation with exchange, sort and window steps, while the one with 
persist will only contain only a LocalTableScan step with some intermediate 
transformations needed for `show`.

After some looking into Spark code, I think that the reason for this is that 
the `org.apache.spark.sql.execution.SparkPlanInfo#fromSparkPlan` method (which 
is used to serialize a plan before emitting the SparkListenerSQLExecutionStart 
event) constructs the `SparkPlanInfo` object from a `SparkPlan` object 
incorrectly, because if you invoke the `toString` method on `SparkPlan` you'll 
see the entire plan, but the `SparkPlanInfo` object will only contain nodes 
corresponding to actions after `persist`. However, my knowledge of Spark 
internals is not deep enough to understand how to fix this, and how 
SparkPlanInfo.fromSparkPlan is different from what SparkPlan.toString does.

This can be observed on Spark 2.3.2, but given that 2.4.0 code of SparkPlanInfo 
does not seem to change much since 2.3.2, I'd expect that it could be 
reproduced there too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26187) Stream-stream left outer join returns outer nulls for already matched rows

2019-01-24 Thread Pavel Chernikov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751708#comment-16751708
 ] 

Pavel Chernikov commented on SPARK-26187:
-

[~kabhwan], I'm definitely okay with that. Feel free to close this one. I'm 
glad that this example helped. Thanks for creating a PR!

> Stream-stream left outer join returns outer nulls for already matched rows
> --
>
> Key: SPARK-26187
> URL: https://issues.apache.org/jira/browse/SPARK-26187
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Pavel Chernikov
>Priority: Major
>
> This is basically the same issue as SPARK-26154, but with slightly easier 
> reproducible and concrete example:
> {code:java}
> val rateStream = session.readStream
>  .format("rate")
>  .option("rowsPerSecond", 1)
>  .option("numPartitions", 1)
>  .load()
> import org.apache.spark.sql.functions._
> val fooStream = rateStream
>  .select(col("value").as("fooId"), col("timestamp").as("fooTime"))
> val barStream = rateStream
>  // Introduce misses for ease of debugging
>  .where(col("value") % 2 === 0)
>  .select(col("value").as("barId"), col("timestamp").as("barTime")){code}
> If barStream is configured to happen earlier than fooStream, based on time 
> range condition, than everything is all right, no previously matched records 
> are flushed with outer NULLs:
> {code:java}
> val query = fooStream
>  .withWatermark("fooTime", "5 seconds")
>  .join(
>barStream.withWatermark("barTime", "5 seconds"),
>expr("""
>  barId = fooId AND
>  fooTime >= barTime AND
>  fooTime <= barTime + interval 5 seconds
> """),
>joinType = "leftOuter"
>  )
>  .writeStream
>  .format("console")
>  .option("truncate", false)
>  .start(){code}
> It's easy to observe that only odd rows are flushed with NULLs on the right:
> {code:java}
> [info] Batch: 1 
> [info] +-+---+-+---+ 
> [info] |fooId|fooTime                |barId|barTime                | 
> [info] +-+---+-+---+ 
> [info] |0    |2018-11-27 13:12:34.976|0    |2018-11-27 13:12:34.976| 
> [info] |6    |2018-11-27 13:12:40.976|6    |2018-11-27 13:12:40.976| 
> [info] |10   |2018-11-27 13:12:44.976|10   |2018-11-27 13:12:44.976| 
> [info] |8    |2018-11-27 13:12:42.976|8    |2018-11-27 13:12:42.976| 
> [info] |2    |2018-11-27 13:12:36.976|2    |2018-11-27 13:12:36.976| 
> [info] |4    |2018-11-27 13:12:38.976|4    |2018-11-27 13:12:38.976| 
> [info] +-+---+-+---+ 
> [info] Batch: 2 
> [info] +-+---+-+---+ 
> [info] |fooId|fooTime                |barId|barTime                | 
> [info] +-+---+-+---+ 
> [info] |1    |2018-11-27 13:12:35.976|null |null                   | 
> [info] |3    |2018-11-27 13:12:37.976|null |null                   | 
> [info] |12   |2018-11-27 13:12:46.976|12   |2018-11-27 13:12:46.976| 
> [info] |18   |2018-11-27 13:12:52.976|18   |2018-11-27 13:12:52.976| 
> [info] |14   |2018-11-27 13:12:48.976|14   |2018-11-27 13:12:48.976| 
> [info] |20   |2018-11-27 13:12:54.976|20   |2018-11-27 13:12:54.976| 
> [info] |16   |2018-11-27 13:12:50.976|16   |2018-11-27 13:12:50.976| 
> [info] +-+---+-+---+ 
> [info] Batch: 3 
> [info] +-+---+-+---+ 
> [info] |fooId|fooTime                |barId|barTime                | 
> [info] +-+---+-+---+ 
> [info] |26   |2018-11-27 13:13:00.976|26   |2018-11-27 13:13:00.976| 
> [info] |22   |2018-11-27 13:12:56.976|22   |2018-11-27 13:12:56.976| 
> [info] |7    |2018-11-27 13:12:41.976|null |null                   | 
> [info] |9    |2018-11-27 13:12:43.976|null |null                   | 
> [info] |28   |2018-11-27 13:13:02.976|28   |2018-11-27 13:13:02.976| 
> [info] |5    |2018-11-27 13:12:39.976|null |null                   | 
> [info] |11   |2018-11-27 13:12:45.976|null |null                   | 
> [info] |13   |2018-11-27 13:12:47.976|null |null                   | 
> [info] |24   |2018-11-27 13:12:58.976|24   |2018-11-27 13:12:58.976| 
> [info] +-+---+-+---+
> {code}
> On the other hand, if we switch the ordering and now fooStream is happening 
> earlier based on time range condition:
> {code:java}
> val query = fooStream
>  .withWatermark("fooTime", "5 seconds")
>  .join(
>barStream.withWatermark("barTime", "5 seconds"),
>expr("""
>  barId = fooId AND
>  barTime >= fooTime AND
>  barTime <= fooTime + interval 5 seconds
>   

[jira] [Commented] (SPARK-26718) structured streaming fetched wrong current offset from kafka

2019-01-24 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751706#comment-16751706
 ] 

Gabor Somogyi commented on SPARK-26718:
---

+1 on [~kabhwan] suggestion

> structured streaming fetched wrong current offset from kafka
> 
>
> Key: SPARK-26718
> URL: https://issues.apache.org/jira/browse/SPARK-26718
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Ryne Yang
>Priority: Major
>
> when running spark structured streaming using lib: `"org.apache.spark" %% 
> "spark-sql-kafka-0-10" % "2.4.0"`, we keep getting error regarding current 
> offset fetching:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 0.0 (TID 3, qa2-hdp-4.acuityads.org, executor 2): 
> java.lang.AssertionError: assertion failed: latest offs
> et -9223372036854775808 does not equal -1
> at scala.Predef$.assert(Predef.scala:170)
> at 
> org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.resolveRange(KafkaMicroBatchReader.scala:371)
> at 
> org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.(KafkaMicroBatchReader.scala:329)
> at 
> org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartition.createPartitionReader(KafkaMicroBatchReader.scala:314)
> at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:42)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> for some reason, looks like fetchLatestOffset returned a Long.MIN_VALUE for 
> one of the partitions. I checked the structured streaming checkpoint, that 
> was correct, it's the currentAvailableOffset was set to Long.MIN_VALUE.
> kafka broker version: 1.1.0.
> lib we used:
> {\{libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % 
> "2.4.0" }}
> how to reproduce:
> basically we started a structured streamer and subscribed a topic of 4 
> partitions. then produced some messages into topic, job crashed and logged 
> the stacktrace like above.
> also the committed offsets seem fine as we see in the logs: 
> {code:java}
> === Streaming Query ===
> Identifier: [id = c46c67ee-3514-4788-8370-a696837b21b1, runId = 
> 31878627-d473-4ee8-955d-d4d3f3f45eb9]
> Current Committed Offsets: {KafkaV2[Subscribe[REVENUEEVENT]]: 
> {"REVENUEEVENT":{"0":1}}}
> Current Available Offsets: {KafkaV2[Subscribe[REVENUEEVENT]]: 
> {"REVENUEEVENT":{"0":-9223372036854775808}}}
> {code}
> so spark streaming recorded the correct value for partition: 0, but the 
> current available offsets returned from kafka is showing Long.MIN_VALUE. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26530) Validate heartheat arguments in HeartbeatReceiver

2019-01-24 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26530:
--

Assignee: liupengcheng

> Validate heartheat arguments in HeartbeatReceiver
> -
>
> Key: SPARK-26530
> URL: https://issues.apache.org/jira/browse/SPARK-26530
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: liupengcheng
>Assignee: liupengcheng
>Priority: Major
>
> Currently, heartbeat related arguments is not validated in spark, so if these 
> args are inproperly specified, the Application may run for a while and not 
> failed until the max executor failures reached(especially with 
> spark.dynamicAllocation.enabled=true), thus may incurs resources waste.
> We shall do validation before submit to cluster.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26721) Bug in feature importance calculation in GBM (and possibly other decision tree classifiers)

2019-01-24 Thread Daniel Jumper (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Jumper updated SPARK-26721:
--
Priority: Blocker  (was: Critical)

> Bug in feature importance calculation in GBM (and possibly other decision 
> tree classifiers)
> ---
>
> Key: SPARK-26721
> URL: https://issues.apache.org/jira/browse/SPARK-26721
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Daniel Jumper
>Priority: Blocker
>
> The feature importance calculation in 
> org.apache.spark.ml.classification.GBTClassificationModel.featureImportances 
> follows a flawed implementation from scikit-learn. An error was recently 
> discovered and updated in scikit-learn version 0.20.0. This error is 
> inherited in the spark implementation and needs to be fixed here as well.
> As described in the scikit-learn release notes 
> ([https://scikit-learn.org/stable/whats_new.html#version-0-20-0]):
> {quote}Fix Fixed a bug in ensemble.GradientBoostingRegressor and 
> ensemble.GradientBoostingClassifier to have feature importances summed and 
> then normalized, rather than normalizing on a per-tree basis. The previous 
> behavior over-weighted the Gini importance of features that appear in later 
> stages. This issue only affected feature importances. #11176 by Gil Forsyth.
> {quote}
> Full discussion of this error and debate ultimately validating the 
> correctness of the change can be found in the comment thread of the 
> scikit-learn pull request: 
> [https://github.com/scikit-learn/scikit-learn/pull/11176] 
>  
> I believe the main change required would be to the featureImportances 
> function in mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala , 
> however, I do not have the experience to make this change myself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26721) Bug in feature importance calculation in GBM (and possibly other decision tree classifiers)

2019-01-24 Thread Daniel Jumper (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Jumper updated SPARK-26721:
--
Description: 
The feature importance calculation in 
org.apache.spark.ml.classification.GBTClassificationModel.featureImportances 
follows a flawed implementation from scikit-learn. An error was recently 
discovered and updated in scikit-learn version 0.20.0. This error is inherited 
in the spark implementation and needs to be fixed here as well.

As described in the scikit-learn release notes 
([https://scikit-learn.org/stable/whats_new.html#version-0-20-0]):
{quote}Fix Fixed a bug in ensemble.GradientBoostingRegressor and 
ensemble.GradientBoostingClassifier to have feature importances summed and then 
normalized, rather than normalizing on a per-tree basis. The previous behavior 
over-weighted the Gini importance of features that appear in later stages. This 
issue only affected feature importances. #11176 by Gil Forsyth.
{quote}
Full discussion of this error and debate ultimately validating the correctness 
of the change can be found in the comment thread of the scikit-learn pull 
request: [https://github.com/scikit-learn/scikit-learn/pull/11176] 

 

I believe the main change required would be to the featureImportances function 
in mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala , however, I 
do not have the experience to make this change myself.

  was:
The feature importance calculation in 
org.apache.spark.ml.classification.GBTClassificationModel.featureImportances 
follows a flawed implementation from scikit-learn. An error was recently 
discovered and updated in scikit-learn version 0.20.0. This error is inherited 
in the spark implementation and needs to be fixed here as well.

As described in the [scikit-learn release 
notes|[https://scikit-learn.org/stable/whats_new.html#version-0-20-0]|https://scikit-learn.org/stable/whats_new.html#version-0-20-0]:]
  :
{quote}
Fix Fixed a bug in ensemble.GradientBoostingRegressor and 
ensemble.GradientBoostingClassifier to have feature importances summed and then 
normalized, rather than normalizing on a per-tree basis. The previous behavior 
over-weighted the Gini importance of features that appear in later stages. This 
issue only affected feature importances. #11176 by Gil Forsyth.
{quote}
Full discussion of this error and debate ultimately validating the correctness 
of the change can be found in the comment thread of the scikit-learn pull 
request: [https://github.com/scikit-learn/scikit-learn/pull/11176] 

 

I believe the main change required would be to the featureImportances function 
in mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala , however, I 
do not have the experience to make this change myself.


> Bug in feature importance calculation in GBM (and possibly other decision 
> tree classifiers)
> ---
>
> Key: SPARK-26721
> URL: https://issues.apache.org/jira/browse/SPARK-26721
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Daniel Jumper
>Priority: Critical
>
> The feature importance calculation in 
> org.apache.spark.ml.classification.GBTClassificationModel.featureImportances 
> follows a flawed implementation from scikit-learn. An error was recently 
> discovered and updated in scikit-learn version 0.20.0. This error is 
> inherited in the spark implementation and needs to be fixed here as well.
> As described in the scikit-learn release notes 
> ([https://scikit-learn.org/stable/whats_new.html#version-0-20-0]):
> {quote}Fix Fixed a bug in ensemble.GradientBoostingRegressor and 
> ensemble.GradientBoostingClassifier to have feature importances summed and 
> then normalized, rather than normalizing on a per-tree basis. The previous 
> behavior over-weighted the Gini importance of features that appear in later 
> stages. This issue only affected feature importances. #11176 by Gil Forsyth.
> {quote}
> Full discussion of this error and debate ultimately validating the 
> correctness of the change can be found in the comment thread of the 
> scikit-learn pull request: 
> [https://github.com/scikit-learn/scikit-learn/pull/11176] 
>  
> I believe the main change required would be to the featureImportances 
> function in mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala , 
> however, I do not have the experience to make this change myself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26718) structured streaming fetched wrong current offset from kafka

2019-01-24 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751694#comment-16751694
 ] 

Jungtaek Lim commented on SPARK-26718:
--

[~linehrr]
Thanks for the analysis. I think allowing Long.MaxValue to end users sounds 
convenient and end users already use it, so ensuring 'off' to not being 
overflowed would be ideal. This is simply ensured via modifying the code as:

{code}
val prorate = limit * (size / total)
val prorateLong = (if (prorate < 1) Math.ceil(prorate) else 
Math.floor(prorate)).toLong
val off = if (prorateLong > Long.MaxValue - begin) Long.MaxValue else begin + 
propateLong
Math.min(end, off)
{code}

Please submit a patch if you would like to. If you wouldn't submit a patch 
please let me know that I can take this up. Thanks!

> structured streaming fetched wrong current offset from kafka
> 
>
> Key: SPARK-26718
> URL: https://issues.apache.org/jira/browse/SPARK-26718
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Ryne Yang
>Priority: Major
>
> when running spark structured streaming using lib: `"org.apache.spark" %% 
> "spark-sql-kafka-0-10" % "2.4.0"`, we keep getting error regarding current 
> offset fetching:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 0.0 (TID 3, qa2-hdp-4.acuityads.org, executor 2): 
> java.lang.AssertionError: assertion failed: latest offs
> et -9223372036854775808 does not equal -1
> at scala.Predef$.assert(Predef.scala:170)
> at 
> org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.resolveRange(KafkaMicroBatchReader.scala:371)
> at 
> org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.(KafkaMicroBatchReader.scala:329)
> at 
> org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartition.createPartitionReader(KafkaMicroBatchReader.scala:314)
> at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:42)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> for some reason, looks like fetchLatestOffset returned a Long.MIN_VALUE for 
> one of the partitions. I checked the structured streaming checkpoint, that 
> was correct, it's the currentAvailableOffset was set to Long.MIN_VALUE.
> kafka broker version: 1.1.0.
> lib we used:
> {\{libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % 
> "2.4.0" }}
> how to reproduce:
> basically we started a structured streamer and subscribed a topic of 4 
> partitions. then produced some messages into topic, job crashed and logged 
> the stacktrace like above.
> also the committed offsets seem fine as we see in the logs: 
> {code:java}
> === Streaming Query ===
> Identifier: [id = c46c67ee-3514-4788-8370-a696837b21b1, runId = 
> 31878627-d473-4ee8-955d-d4d3f3f45eb9]
> Current Committed Offsets: {KafkaV2[Subscribe[REVENUEEVENT]]: 
> {"REVENUEEVENT":{"0":1}}}
> Current Available Offsets: {KafkaV2[Subscribe[REVENUEEVENT]]: 
> {"REVENUEEVENT":{"0":-9223372036854775808}}}
> {code}
> so spark streaming recorded the correct value for partition: 0, but the 
> current available offsets returned from kafka is showing Long.MIN_VALUE. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-26721) Bug in feature importance calculation in GBM (and possibly other decision tree classifiers)

2019-01-24 Thread Daniel Jumper (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Jumper updated SPARK-26721:
--
Description: 
The feature importance calculation in 
org.apache.spark.ml.classification.GBTClassificationModel.featureImportances 
follows a flawed implementation from scikit-learn resulting in incorrect 
importance values. This error was recently discovered and updated in 
scikit-learn version 0.20.0. This error is inherited in the spark 
implementation and needs to be fixed here as well.

As described in the scikit-learn release notes 
([https://scikit-learn.org/stable/whats_new.html#version-0-20-0]):
{quote}Fix Fixed a bug in ensemble.GradientBoostingRegressor and 
ensemble.GradientBoostingClassifier to have feature importances summed and then 
normalized, rather than normalizing on a per-tree basis. The previous behavior 
over-weighted the Gini importance of features that appear in later stages. This 
issue only affected feature importances. #11176 by Gil Forsyth.
{quote}
Full discussion of this error and debate ultimately validating the correctness 
of the change can be found in the comment thread of the scikit-learn pull 
request: [https://github.com/scikit-learn/scikit-learn/pull/11176] 

 

I believe the main change required would be to the featureImportances function 
in mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala , however, I 
do not have the experience to make this change myself.

  was:
The feature importance calculation in 
org.apache.spark.ml.classification.GBTClassificationModel.featureImportances 
follows a flawed implementation from scikit-learn. An error was recently 
discovered and updated in scikit-learn version 0.20.0. This error is inherited 
in the spark implementation and needs to be fixed here as well.

As described in the scikit-learn release notes 
([https://scikit-learn.org/stable/whats_new.html#version-0-20-0]):
{quote}Fix Fixed a bug in ensemble.GradientBoostingRegressor and 
ensemble.GradientBoostingClassifier to have feature importances summed and then 
normalized, rather than normalizing on a per-tree basis. The previous behavior 
over-weighted the Gini importance of features that appear in later stages. This 
issue only affected feature importances. #11176 by Gil Forsyth.
{quote}
Full discussion of this error and debate ultimately validating the correctness 
of the change can be found in the comment thread of the scikit-learn pull 
request: [https://github.com/scikit-learn/scikit-learn/pull/11176] 

 

I believe the main change required would be to the featureImportances function 
in mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala , however, I 
do not have the experience to make this change myself.


> Bug in feature importance calculation in GBM (and possibly other decision 
> tree classifiers)
> ---
>
> Key: SPARK-26721
> URL: https://issues.apache.org/jira/browse/SPARK-26721
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Daniel Jumper
>Priority: Blocker
>
> The feature importance calculation in 
> org.apache.spark.ml.classification.GBTClassificationModel.featureImportances 
> follows a flawed implementation from scikit-learn resulting in incorrect 
> importance values. This error was recently discovered and updated in 
> scikit-learn version 0.20.0. This error is inherited in the spark 
> implementation and needs to be fixed here as well.
> As described in the scikit-learn release notes 
> ([https://scikit-learn.org/stable/whats_new.html#version-0-20-0]):
> {quote}Fix Fixed a bug in ensemble.GradientBoostingRegressor and 
> ensemble.GradientBoostingClassifier to have feature importances summed and 
> then normalized, rather than normalizing on a per-tree basis. The previous 
> behavior over-weighted the Gini importance of features that appear in later 
> stages. This issue only affected feature importances. #11176 by Gil Forsyth.
> {quote}
> Full discussion of this error and debate ultimately validating the 
> correctness of the change can be found in the comment thread of the 
> scikit-learn pull request: 
> [https://github.com/scikit-learn/scikit-learn/pull/11176] 
>  
> I believe the main change required would be to the featureImportances 
> function in mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala , 
> however, I do not have the experience to make this change myself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26722) add SPARK_TEST_KEY=1 to pull request builder and spark-master-test-sbt-hadoop-2.7

2019-01-24 Thread shane knapp (JIRA)
shane knapp created SPARK-26722:
---

 Summary: add SPARK_TEST_KEY=1 to pull request builder and 
spark-master-test-sbt-hadoop-2.7
 Key: SPARK-26722
 URL: https://issues.apache.org/jira/browse/SPARK-26722
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: shane knapp
Assignee: shane knapp


from https://github.com/apache/spark/pull/23117:

we need to add the {{SPARK_TEST_KEY=1}} env var to both the GHPRB and 
{{spark-master-test-sbt-hadoop-2.7}} builds.

this is done for the PRB, and was manually added to the 
{{spark-master-test-sbt-hadoop-2.7}} build.

i will leave this open until i finish porting the JJB configs in to the main 
spark repo (for the {{spark-master-test-sbt-hadoop-2.7}} build).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26721) Bug in feature importance calculation in GBM (and possibly other decision tree classifiers)

2019-01-24 Thread Daniel Jumper (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751691#comment-16751691
 ] 

Daniel Jumper commented on SPARK-26721:
---

updated the priority to Blocker as this is a correctness issue.

> Bug in feature importance calculation in GBM (and possibly other decision 
> tree classifiers)
> ---
>
> Key: SPARK-26721
> URL: https://issues.apache.org/jira/browse/SPARK-26721
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Daniel Jumper
>Priority: Blocker
>
> The feature importance calculation in 
> org.apache.spark.ml.classification.GBTClassificationModel.featureImportances 
> follows a flawed implementation from scikit-learn. An error was recently 
> discovered and updated in scikit-learn version 0.20.0. This error is 
> inherited in the spark implementation and needs to be fixed here as well.
> As described in the scikit-learn release notes 
> ([https://scikit-learn.org/stable/whats_new.html#version-0-20-0]):
> {quote}Fix Fixed a bug in ensemble.GradientBoostingRegressor and 
> ensemble.GradientBoostingClassifier to have feature importances summed and 
> then normalized, rather than normalizing on a per-tree basis. The previous 
> behavior over-weighted the Gini importance of features that appear in later 
> stages. This issue only affected feature importances. #11176 by Gil Forsyth.
> {quote}
> Full discussion of this error and debate ultimately validating the 
> correctness of the change can be found in the comment thread of the 
> scikit-learn pull request: 
> [https://github.com/scikit-learn/scikit-learn/pull/11176] 
>  
> I believe the main change required would be to the featureImportances 
> function in mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala , 
> however, I do not have the experience to make this change myself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26530) Validate heartheat arguments in HeartbeatReceiver

2019-01-24 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26530.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23445
[https://github.com/apache/spark/pull/23445]

> Validate heartheat arguments in HeartbeatReceiver
> -
>
> Key: SPARK-26530
> URL: https://issues.apache.org/jira/browse/SPARK-26530
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: liupengcheng
>Assignee: liupengcheng
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, heartbeat related arguments is not validated in spark, so if these 
> args are inproperly specified, the Application may run for a while and not 
> failed until the max executor failures reached(especially with 
> spark.dynamicAllocation.enabled=true), thus may incurs resources waste.
> We shall do validation before submit to cluster.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26721) Bug in feature importance calculation in GBM (and possibly other decision tree classifiers)

2019-01-24 Thread Daniel Jumper (JIRA)
Daniel Jumper created SPARK-26721:
-

 Summary: Bug in feature importance calculation in GBM (and 
possibly other decision tree classifiers)
 Key: SPARK-26721
 URL: https://issues.apache.org/jira/browse/SPARK-26721
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.4.0
Reporter: Daniel Jumper


The feature importance calculation in 
org.apache.spark.ml.classification.GBTClassificationModel.featureImportances 
follows a flawed implementation from scikit-learn. An error was recently 
discovered and updated in scikit-learn version 0.20.0. This error is inherited 
in the spark implementation and needs to be fixed here as well.

As described in the [scikit-learn release 
notes|[https://scikit-learn.org/stable/whats_new.html#version-0-20-0]|https://scikit-learn.org/stable/whats_new.html#version-0-20-0]:]
  :
{quote}
Fix Fixed a bug in ensemble.GradientBoostingRegressor and 
ensemble.GradientBoostingClassifier to have feature importances summed and then 
normalized, rather than normalizing on a per-tree basis. The previous behavior 
over-weighted the Gini importance of features that appear in later stages. This 
issue only affected feature importances. #11176 by Gil Forsyth.
{quote}
Full discussion of this error and debate ultimately validating the correctness 
of the change can be found in the comment thread of the scikit-learn pull 
request: [https://github.com/scikit-learn/scikit-learn/pull/11176] 

 

I believe the main change required would be to the featureImportances function 
in mllib/src/main/scala/org/apache/spark/ml/tree/treeModels.scala , however, I 
do not have the experience to make this change myself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26720) Remove unused methods from DateTimeUtils

2019-01-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26720:


Assignee: (was: Apache Spark)

> Remove unused methods from DateTimeUtils
> 
>
> Key: SPARK-26720
> URL: https://issues.apache.org/jira/browse/SPARK-26720
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Trivial
>
> A few methods of the DateTimeUtils object are not used any more in Spark's 
> code base. The ticket aims to remove such methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26720) Remove unused methods from DateTimeUtils

2019-01-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26720:


Assignee: Apache Spark

> Remove unused methods from DateTimeUtils
> 
>
> Key: SPARK-26720
> URL: https://issues.apache.org/jira/browse/SPARK-26720
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Trivial
>
> A few methods of the DateTimeUtils object are not used any more in Spark's 
> code base. The ticket aims to remove such methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26187) Stream-stream left outer join returns outer nulls for already matched rows

2019-01-24 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751672#comment-16751672
 ] 

Jungtaek Lim commented on SPARK-26187:
--

Thanks [~ChernikovP], your example helped much to track down the issue. Would 
you mind if we close this as "duplicated" as SPARK-26154 came earlier and 
reporter concerns about it?

> Stream-stream left outer join returns outer nulls for already matched rows
> --
>
> Key: SPARK-26187
> URL: https://issues.apache.org/jira/browse/SPARK-26187
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Pavel Chernikov
>Priority: Major
>
> This is basically the same issue as SPARK-26154, but with slightly easier 
> reproducible and concrete example:
> {code:java}
> val rateStream = session.readStream
>  .format("rate")
>  .option("rowsPerSecond", 1)
>  .option("numPartitions", 1)
>  .load()
> import org.apache.spark.sql.functions._
> val fooStream = rateStream
>  .select(col("value").as("fooId"), col("timestamp").as("fooTime"))
> val barStream = rateStream
>  // Introduce misses for ease of debugging
>  .where(col("value") % 2 === 0)
>  .select(col("value").as("barId"), col("timestamp").as("barTime")){code}
> If barStream is configured to happen earlier than fooStream, based on time 
> range condition, than everything is all right, no previously matched records 
> are flushed with outer NULLs:
> {code:java}
> val query = fooStream
>  .withWatermark("fooTime", "5 seconds")
>  .join(
>barStream.withWatermark("barTime", "5 seconds"),
>expr("""
>  barId = fooId AND
>  fooTime >= barTime AND
>  fooTime <= barTime + interval 5 seconds
> """),
>joinType = "leftOuter"
>  )
>  .writeStream
>  .format("console")
>  .option("truncate", false)
>  .start(){code}
> It's easy to observe that only odd rows are flushed with NULLs on the right:
> {code:java}
> [info] Batch: 1 
> [info] +-+---+-+---+ 
> [info] |fooId|fooTime                |barId|barTime                | 
> [info] +-+---+-+---+ 
> [info] |0    |2018-11-27 13:12:34.976|0    |2018-11-27 13:12:34.976| 
> [info] |6    |2018-11-27 13:12:40.976|6    |2018-11-27 13:12:40.976| 
> [info] |10   |2018-11-27 13:12:44.976|10   |2018-11-27 13:12:44.976| 
> [info] |8    |2018-11-27 13:12:42.976|8    |2018-11-27 13:12:42.976| 
> [info] |2    |2018-11-27 13:12:36.976|2    |2018-11-27 13:12:36.976| 
> [info] |4    |2018-11-27 13:12:38.976|4    |2018-11-27 13:12:38.976| 
> [info] +-+---+-+---+ 
> [info] Batch: 2 
> [info] +-+---+-+---+ 
> [info] |fooId|fooTime                |barId|barTime                | 
> [info] +-+---+-+---+ 
> [info] |1    |2018-11-27 13:12:35.976|null |null                   | 
> [info] |3    |2018-11-27 13:12:37.976|null |null                   | 
> [info] |12   |2018-11-27 13:12:46.976|12   |2018-11-27 13:12:46.976| 
> [info] |18   |2018-11-27 13:12:52.976|18   |2018-11-27 13:12:52.976| 
> [info] |14   |2018-11-27 13:12:48.976|14   |2018-11-27 13:12:48.976| 
> [info] |20   |2018-11-27 13:12:54.976|20   |2018-11-27 13:12:54.976| 
> [info] |16   |2018-11-27 13:12:50.976|16   |2018-11-27 13:12:50.976| 
> [info] +-+---+-+---+ 
> [info] Batch: 3 
> [info] +-+---+-+---+ 
> [info] |fooId|fooTime                |barId|barTime                | 
> [info] +-+---+-+---+ 
> [info] |26   |2018-11-27 13:13:00.976|26   |2018-11-27 13:13:00.976| 
> [info] |22   |2018-11-27 13:12:56.976|22   |2018-11-27 13:12:56.976| 
> [info] |7    |2018-11-27 13:12:41.976|null |null                   | 
> [info] |9    |2018-11-27 13:12:43.976|null |null                   | 
> [info] |28   |2018-11-27 13:13:02.976|28   |2018-11-27 13:13:02.976| 
> [info] |5    |2018-11-27 13:12:39.976|null |null                   | 
> [info] |11   |2018-11-27 13:12:45.976|null |null                   | 
> [info] |13   |2018-11-27 13:12:47.976|null |null                   | 
> [info] |24   |2018-11-27 13:12:58.976|24   |2018-11-27 13:12:58.976| 
> [info] +-+---+-+---+
> {code}
> On the other hand, if we switch the ordering and now fooStream is happening 
> earlier based on time range condition:
> {code:java}
> val query = fooStream
>  .withWatermark("fooTime", "5 seconds")
>  .join(
>barStream.withWatermark("barTime", "5 seconds"),
>expr("""
>  barId = fooId AND
>  barTime >= fooTime AND
>  

[jira] [Commented] (SPARK-26154) Stream-stream joins - left outer join gives inconsistent output

2019-01-24 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751670#comment-16751670
 ] 

Jungtaek Lim commented on SPARK-26154:
--

As issue reporter concerns about handling duplicated issue, I just changed my 
patch to point to this issue, and will mark SPARK-26187 as duplicated.

> Stream-stream joins - left outer join gives inconsistent output
> ---
>
> Key: SPARK-26154
> URL: https://issues.apache.org/jira/browse/SPARK-26154
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2
> Environment: Spark version - Spark 2.3.2
> OS- Suse 11
>Reporter: Haripriya
>Priority: Major
>
> Stream-stream joins using left outer join gives inconsistent  output 
> The data processed once, is being processed again and gives null value. In 
> Batch 2, the input data  "3" is processed. But again in batch 6, null value 
> is provided for same data
> Steps
> In spark-shell
> {code:java}
> scala> import org.apache.spark.sql.functions.{col, expr}
> import org.apache.spark.sql.functions.{col, expr}
> scala> import org.apache.spark.sql.streaming.Trigger
> import org.apache.spark.sql.streaming.Trigger
> scala> val lines_stream1 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic1").
>  |   option("includeTimestamp", true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data"),col("timestamp") 
> as("recordTime")).
>  |   select("data","recordTime").
>  |   withWatermark("recordTime", "5 seconds ")
> lines_stream1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data: string, recordTime: timestamp]
> scala> val lines_stream2 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic2").
>  |   option("includeTimestamp", value = true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data1"),col("timestamp") 
> as("recordTime1")).
>  |   select("data1","recordTime1").
>  |   withWatermark("recordTime1", "10 seconds ")
> lines_stream2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data1: string, recordTime1: timestamp]
> scala> val query = lines_stream1.join(lines_stream2, expr (
>  |   """
>  | | data == data1 and
>  | | recordTime1 >= recordTime and
>  | | recordTime1 <= recordTime + interval 5 seconds
>  |   """.stripMargin),"left").
>  |   writeStream.
>  |   option("truncate","false").
>  |   outputMode("append").
>  |   format("console").option("checkpointLocation", 
> "/tmp/leftouter/").
>  |   trigger(Trigger.ProcessingTime ("5 seconds")).
>  |   start()
> query: org.apache.spark.sql.streaming.StreamingQuery = 
> org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@1a48f55b
> {code}
> Step2 : Start producing data
> kafka-console-producer.sh --broker-list ip:9092 --topic topic1
>  >1
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >bb
>  >cc
> kafka-console-producer.sh --broker-list ip:9092 --topic topic2
>  >2
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >cc
>  >ee
>  >ee
>  
> Output obtained:
> {code:java}
> Batch: 0
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 1
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 2
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> ++---+-+---+
> |3   |2018-11-22 20:09:35.053|3|2018-11-22 20:09:36.506|
> |2   |2018-11-22 20:09:31.613|2|2018-11-22 20:09:33.116|
> ++---+-+---+
> ---
> Batch: 3
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> 

[jira] [Created] (SPARK-26720) Remove unused methods from DateTimeUtils

2019-01-24 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26720:
--

 Summary: Remove unused methods from DateTimeUtils
 Key: SPARK-26720
 URL: https://issues.apache.org/jira/browse/SPARK-26720
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


A few methods of the DateTimeUtils object are not used any more in Spark's code 
base. The ticket aims to remove such methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26154) Stream-stream joins - left outer join gives inconsistent output

2019-01-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26154:


Assignee: Apache Spark

> Stream-stream joins - left outer join gives inconsistent output
> ---
>
> Key: SPARK-26154
> URL: https://issues.apache.org/jira/browse/SPARK-26154
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2
> Environment: Spark version - Spark 2.3.2
> OS- Suse 11
>Reporter: Haripriya
>Assignee: Apache Spark
>Priority: Major
>
> Stream-stream joins using left outer join gives inconsistent  output 
> The data processed once, is being processed again and gives null value. In 
> Batch 2, the input data  "3" is processed. But again in batch 6, null value 
> is provided for same data
> Steps
> In spark-shell
> {code:java}
> scala> import org.apache.spark.sql.functions.{col, expr}
> import org.apache.spark.sql.functions.{col, expr}
> scala> import org.apache.spark.sql.streaming.Trigger
> import org.apache.spark.sql.streaming.Trigger
> scala> val lines_stream1 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic1").
>  |   option("includeTimestamp", true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data"),col("timestamp") 
> as("recordTime")).
>  |   select("data","recordTime").
>  |   withWatermark("recordTime", "5 seconds ")
> lines_stream1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data: string, recordTime: timestamp]
> scala> val lines_stream2 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic2").
>  |   option("includeTimestamp", value = true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data1"),col("timestamp") 
> as("recordTime1")).
>  |   select("data1","recordTime1").
>  |   withWatermark("recordTime1", "10 seconds ")
> lines_stream2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data1: string, recordTime1: timestamp]
> scala> val query = lines_stream1.join(lines_stream2, expr (
>  |   """
>  | | data == data1 and
>  | | recordTime1 >= recordTime and
>  | | recordTime1 <= recordTime + interval 5 seconds
>  |   """.stripMargin),"left").
>  |   writeStream.
>  |   option("truncate","false").
>  |   outputMode("append").
>  |   format("console").option("checkpointLocation", 
> "/tmp/leftouter/").
>  |   trigger(Trigger.ProcessingTime ("5 seconds")).
>  |   start()
> query: org.apache.spark.sql.streaming.StreamingQuery = 
> org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@1a48f55b
> {code}
> Step2 : Start producing data
> kafka-console-producer.sh --broker-list ip:9092 --topic topic1
>  >1
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >bb
>  >cc
> kafka-console-producer.sh --broker-list ip:9092 --topic topic2
>  >2
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >cc
>  >ee
>  >ee
>  
> Output obtained:
> {code:java}
> Batch: 0
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 1
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 2
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> ++---+-+---+
> |3   |2018-11-22 20:09:35.053|3|2018-11-22 20:09:36.506|
> |2   |2018-11-22 20:09:31.613|2|2018-11-22 20:09:33.116|
> ++---+-+---+
> ---
> Batch: 3
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> ++---+-+---+
> |4   |2018-11-22 20:09:38.654|4|2018-11-22 20:09:39.818|
> 

[jira] [Commented] (SPARK-26154) Stream-stream joins - left outer join gives inconsistent output

2019-01-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751665#comment-16751665
 ] 

Apache Spark commented on SPARK-26154:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/23634

> Stream-stream joins - left outer join gives inconsistent output
> ---
>
> Key: SPARK-26154
> URL: https://issues.apache.org/jira/browse/SPARK-26154
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2
> Environment: Spark version - Spark 2.3.2
> OS- Suse 11
>Reporter: Haripriya
>Priority: Major
>
> Stream-stream joins using left outer join gives inconsistent  output 
> The data processed once, is being processed again and gives null value. In 
> Batch 2, the input data  "3" is processed. But again in batch 6, null value 
> is provided for same data
> Steps
> In spark-shell
> {code:java}
> scala> import org.apache.spark.sql.functions.{col, expr}
> import org.apache.spark.sql.functions.{col, expr}
> scala> import org.apache.spark.sql.streaming.Trigger
> import org.apache.spark.sql.streaming.Trigger
> scala> val lines_stream1 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic1").
>  |   option("includeTimestamp", true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data"),col("timestamp") 
> as("recordTime")).
>  |   select("data","recordTime").
>  |   withWatermark("recordTime", "5 seconds ")
> lines_stream1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data: string, recordTime: timestamp]
> scala> val lines_stream2 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic2").
>  |   option("includeTimestamp", value = true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data1"),col("timestamp") 
> as("recordTime1")).
>  |   select("data1","recordTime1").
>  |   withWatermark("recordTime1", "10 seconds ")
> lines_stream2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data1: string, recordTime1: timestamp]
> scala> val query = lines_stream1.join(lines_stream2, expr (
>  |   """
>  | | data == data1 and
>  | | recordTime1 >= recordTime and
>  | | recordTime1 <= recordTime + interval 5 seconds
>  |   """.stripMargin),"left").
>  |   writeStream.
>  |   option("truncate","false").
>  |   outputMode("append").
>  |   format("console").option("checkpointLocation", 
> "/tmp/leftouter/").
>  |   trigger(Trigger.ProcessingTime ("5 seconds")).
>  |   start()
> query: org.apache.spark.sql.streaming.StreamingQuery = 
> org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@1a48f55b
> {code}
> Step2 : Start producing data
> kafka-console-producer.sh --broker-list ip:9092 --topic topic1
>  >1
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >bb
>  >cc
> kafka-console-producer.sh --broker-list ip:9092 --topic topic2
>  >2
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >cc
>  >ee
>  >ee
>  
> Output obtained:
> {code:java}
> Batch: 0
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 1
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 2
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> ++---+-+---+
> |3   |2018-11-22 20:09:35.053|3|2018-11-22 20:09:36.506|
> |2   |2018-11-22 20:09:31.613|2|2018-11-22 20:09:33.116|
> ++---+-+---+
> ---
> Batch: 3
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> ++---+-+---+
> |4   |2018-11-22 

[jira] [Assigned] (SPARK-26154) Stream-stream joins - left outer join gives inconsistent output

2019-01-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26154:


Assignee: (was: Apache Spark)

> Stream-stream joins - left outer join gives inconsistent output
> ---
>
> Key: SPARK-26154
> URL: https://issues.apache.org/jira/browse/SPARK-26154
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2
> Environment: Spark version - Spark 2.3.2
> OS- Suse 11
>Reporter: Haripriya
>Priority: Major
>
> Stream-stream joins using left outer join gives inconsistent  output 
> The data processed once, is being processed again and gives null value. In 
> Batch 2, the input data  "3" is processed. But again in batch 6, null value 
> is provided for same data
> Steps
> In spark-shell
> {code:java}
> scala> import org.apache.spark.sql.functions.{col, expr}
> import org.apache.spark.sql.functions.{col, expr}
> scala> import org.apache.spark.sql.streaming.Trigger
> import org.apache.spark.sql.streaming.Trigger
> scala> val lines_stream1 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic1").
>  |   option("includeTimestamp", true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data"),col("timestamp") 
> as("recordTime")).
>  |   select("data","recordTime").
>  |   withWatermark("recordTime", "5 seconds ")
> lines_stream1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data: string, recordTime: timestamp]
> scala> val lines_stream2 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic2").
>  |   option("includeTimestamp", value = true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data1"),col("timestamp") 
> as("recordTime1")).
>  |   select("data1","recordTime1").
>  |   withWatermark("recordTime1", "10 seconds ")
> lines_stream2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data1: string, recordTime1: timestamp]
> scala> val query = lines_stream1.join(lines_stream2, expr (
>  |   """
>  | | data == data1 and
>  | | recordTime1 >= recordTime and
>  | | recordTime1 <= recordTime + interval 5 seconds
>  |   """.stripMargin),"left").
>  |   writeStream.
>  |   option("truncate","false").
>  |   outputMode("append").
>  |   format("console").option("checkpointLocation", 
> "/tmp/leftouter/").
>  |   trigger(Trigger.ProcessingTime ("5 seconds")).
>  |   start()
> query: org.apache.spark.sql.streaming.StreamingQuery = 
> org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@1a48f55b
> {code}
> Step2 : Start producing data
> kafka-console-producer.sh --broker-list ip:9092 --topic topic1
>  >1
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >bb
>  >cc
> kafka-console-producer.sh --broker-list ip:9092 --topic topic2
>  >2
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >cc
>  >ee
>  >ee
>  
> Output obtained:
> {code:java}
> Batch: 0
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 1
> ---
> ++--+-+---+
> |data|recordTime|data1|recordTime1|
> ++--+-+---+
> ++--+-+---+
> ---
> Batch: 2
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> ++---+-+---+
> |3   |2018-11-22 20:09:35.053|3|2018-11-22 20:09:36.506|
> |2   |2018-11-22 20:09:31.613|2|2018-11-22 20:09:33.116|
> ++---+-+---+
> ---
> Batch: 3
> ---
> ++---+-+---+
> |data|recordTime |data1|recordTime1|
> ++---+-+---+
> |4   |2018-11-22 20:09:38.654|4|2018-11-22 20:09:39.818|
> ++---+-+---+
> 

[jira] [Commented] (SPARK-26711) JSON Schema inference takes 15 times longer

2019-01-24 Thread Bruce Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751655#comment-16751655
 ] 

Bruce Robbins commented on SPARK-26711:
---

Re: 7 minutes vs. 50 seconds:

Looking at the code, it appears the difference is this:

Before the timestamp inference change, options.prefersDecimal was checked 
before attempting to convert the String to a BigDecimal. If 
options.prefersDecimal is disabled, we would not bother with the conversion.

After the timestamp inference change, we always attempt to convert the String 
to a BigDecimal regardless of the setting of options.prefersDecimal (we still 
use options.prefersDecimal to determine what type to return)

My guess is that attempting to convert every string to a BigDecimal is very 
expensive.

> JSON Schema inference takes 15 times longer
> ---
>
> Key: SPARK-26711
> URL: https://issues.apache.org/jira/browse/SPARK-26711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Priority: Major
>
> I noticed that the first benchmark/case of JSONBenchmark ("JSON schema 
> inferring", "No encoding") was taking an hour to run, when it used to run in 
> 4-5 minutes.
> The culprit seems to be this commit: 
> [https://github.com/apache/spark/commit/d72571e51d]
> A quick look using a profiler, and it seems to be spending 99% of its time 
> doing some kind of exception handling in JsonInferSchema.scala.
> You can reproduce in the spark-shell by recreating the data used by the 
> benchmark
> {noformat}
> scala> :paste
> val rowsNum = 100 * 1000 * 1000
> spark.sparkContext.range(0, rowsNum, 1)
> .map(_ => "a")
> .toDF("fieldA")
> .write
> .option("encoding", "UTF-8")
> .json("utf8.json")
> // Entering paste mode (ctrl-D to finish)
> // Exiting paste mode, now interpreting.
> rowsNum: Int = 1
> scala> 
> {noformat}
> Then you can run the test by hand starting spark-shell as so (emulating 
> SqlBasedBenchmark):
> {noformat}
>  bin/spark-shell --driver-memory 8g \
>   --conf "spark.sql.autoBroadcastJoinThreshold=1" \
>   --conf "spark.sql.shuffle.partitions=1" --master "local[1]"
> {noformat}
> On commit d72571e51d:
> {noformat}
> scala> val start = System.currentTimeMillis; spark.read.json("utf8.json");  
> System.currentTimeMillis-start
> start: Long = 1548297682225
> res0: Long = 815978 <== 13.6 minutes
> scala>
> {noformat}
> On the previous commit (86100df54b):
> {noformat}
> scala> val start = System.currentTimeMillis; spark.read.json("utf8.json");  
> System.currentTimeMillis-start
> start: Long = 1548298927151
> res0: Long = 50087 <= 50 seconds
> scala> 
> {noformat}
> I also tried {{spark.read.option("inferTimestamp", 
> false).json("utf8.json")}}, but that option didn't seem to make a difference 
> in run time. Edit: {{inferTimestamp}} does, in fact, have an impact: It 
> halves the run time. However, that means even with {{inferTimestamp}}, the 
> run time is still 7 times slower than before.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26682) Task attempt ID collision causes lost data

2019-01-24 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751651#comment-16751651
 ] 

Shixiong Zhu commented on SPARK-26682:
--

For future reference, data loss could happen when one task modified the other 
task's temp output file.

> Task attempt ID collision causes lost data
> --
>
> Key: SPARK-26682
> URL: https://issues.apache.org/jira/browse/SPARK-26682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.3.2, 2.4.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Blocker
>  Labels: data-loss
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> We recently tracked missing data to a collision in the fake Hadoop task 
> attempt ID created when using Hadoop OutputCommitters. This is similar to 
> SPARK-24589.
> A stage had one task fail to get one shard from a shuffle, causing a 
> FetchFailedException and Spark resubmitted the stage. Because only one task 
> was affected, the original stage attempt continued running tasks that had 
> been resubmitted. Another task ran two attempts concurrently on the same 
> executor, but had the same attempt number because they were from different 
> stage attempts. Because the attempt number was the same, the task used the 
> same temp locations. That caused one attempt to fail because a file path 
> already existed, and that attempt then removed the shared temp location and 
> deleted the other task's data. When the second attempt succeeded, it 
> committed partial data.
> The problem was that both attempts had the same partition and attempt 
> numbers, despite being run in different stages, and that was used to create a 
> Hadoop task attempt ID on which the temp location was based. The fix is to 
> use Spark's global task attempt ID, which is a counter, instead of attempt 
> number because attempt number is reused in stage attempts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26682) Task attempt ID collision causes lost data

2019-01-24 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-26682:
---
Fix Version/s: 2.3.3

> Task attempt ID collision causes lost data
> --
>
> Key: SPARK-26682
> URL: https://issues.apache.org/jira/browse/SPARK-26682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.3.2, 2.4.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Blocker
>  Labels: data-loss
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> We recently tracked missing data to a collision in the fake Hadoop task 
> attempt ID created when using Hadoop OutputCommitters. This is similar to 
> SPARK-24589.
> A stage had one task fail to get one shard from a shuffle, causing a 
> FetchFailedException and Spark resubmitted the stage. Because only one task 
> was affected, the original stage attempt continued running tasks that had 
> been resubmitted. Another task ran two attempts concurrently on the same 
> executor, but had the same attempt number because they were from different 
> stage attempts. Because the attempt number was the same, the task used the 
> same temp locations. That caused one attempt to fail because a file path 
> already existed, and that attempt then removed the shared temp location and 
> deleted the other task's data. When the second attempt succeeded, it 
> committed partial data.
> The problem was that both attempts had the same partition and attempt 
> numbers, despite being run in different stages, and that was used to create a 
> Hadoop task attempt ID on which the temp location was based. The fix is to 
> use Spark's global task attempt ID, which is a counter, instead of attempt 
> number because attempt number is reused in stage attempts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26654) Use Timestamp/DateFormatter in CatalogColumnStat

2019-01-24 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751634#comment-16751634
 ] 

Maxim Gekk commented on SPARK-26654:


[~cloud_fan][~hvanhovell][~srowen] I do believe saving statistics for 
TimestampType columns without time zone can cause inaccurate results if the 
statistics are read back in spark session with different time zone. So, it can 
impact on planning badly. This can be fixed by adding time zone during 
serialization of TimestampType column but it will change timestamp format (and 
old versions of Spark cannot read back if the versions will be not changed) or 
store original timezone separately together with statistics somewhere.

> Use Timestamp/DateFormatter in CatalogColumnStat
> 
>
> Key: SPARK-26654
> URL: https://issues.apache.org/jira/browse/SPARK-26654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Need to switch fromExternalString on Timestamp/DateFormatters, in particular:
> https://github.com/apache/spark/blob/3b7395fe025a4c9a591835e53ac6ca05be6868f1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L481-L482



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26718) structured streaming fetched wrong current offset from kafka

2019-01-24 Thread Ryne Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751629#comment-16751629
 ] 

Ryne Yang commented on SPARK-26718:
---

A simple fix would be a if statement to check if the integer overflowed, and 
then throw some exception to notify the user. 

> structured streaming fetched wrong current offset from kafka
> 
>
> Key: SPARK-26718
> URL: https://issues.apache.org/jira/browse/SPARK-26718
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Ryne Yang
>Priority: Major
>
> when running spark structured streaming using lib: `"org.apache.spark" %% 
> "spark-sql-kafka-0-10" % "2.4.0"`, we keep getting error regarding current 
> offset fetching:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 0.0 (TID 3, qa2-hdp-4.acuityads.org, executor 2): 
> java.lang.AssertionError: assertion failed: latest offs
> et -9223372036854775808 does not equal -1
> at scala.Predef$.assert(Predef.scala:170)
> at 
> org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.resolveRange(KafkaMicroBatchReader.scala:371)
> at 
> org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.(KafkaMicroBatchReader.scala:329)
> at 
> org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartition.createPartitionReader(KafkaMicroBatchReader.scala:314)
> at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:42)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> for some reason, looks like fetchLatestOffset returned a Long.MIN_VALUE for 
> one of the partitions. I checked the structured streaming checkpoint, that 
> was correct, it's the currentAvailableOffset was set to Long.MIN_VALUE.
> kafka broker version: 1.1.0.
> lib we used:
> {\{libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % 
> "2.4.0" }}
> how to reproduce:
> basically we started a structured streamer and subscribed a topic of 4 
> partitions. then produced some messages into topic, job crashed and logged 
> the stacktrace like above.
> also the committed offsets seem fine as we see in the logs: 
> {code:java}
> === Streaming Query ===
> Identifier: [id = c46c67ee-3514-4788-8370-a696837b21b1, runId = 
> 31878627-d473-4ee8-955d-d4d3f3f45eb9]
> Current Committed Offsets: {KafkaV2[Subscribe[REVENUEEVENT]]: 
> {"REVENUEEVENT":{"0":1}}}
> Current Available Offsets: {KafkaV2[Subscribe[REVENUEEVENT]]: 
> {"REVENUEEVENT":{"0":-9223372036854775808}}}
> {code}
> so spark streaming recorded the correct value for partition: 0, but the 
> current available offsets returned from kafka is showing Long.MIN_VALUE. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26718) structured streaming fetched wrong current offset from kafka

2019-01-24 Thread Ryne Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751625#comment-16751625
 ] 

Ryne Yang commented on SPARK-26718:
---

found the issue, it's the rateLimit calculation. if anyone set the 
`maxOffsetsPerTrigger` to Long.MaxValue. then this could happen.

in class KafkaMicroBatchReader.scala: 
{code:java}
private def rateLimit(
limit: Long,
from: PartitionOffsetMap,
until: PartitionOffsetMap): PartitionOffsetMap = {
val fromNew = 
kafkaOffsetReader.fetchEarliestOffsets(until.keySet.diff(from.keySet).toSeq)
val sizes = until.flatMap {
case (tp, end) =>
// If begin isn't defined, something's wrong, but let alert logic in getBatch 
handle it
from.get(tp).orElse(fromNew.get(tp)).flatMap { begin =>
val size = end - begin
logDebug(s"rateLimit $tp size is $size")
if (size > 0) Some(tp -> size) else None
}
}
val total = sizes.values.sum.toDouble
if (total < 1) {
until
} else {
until.map {
case (tp, end) =>
tp -> sizes.get(tp).map { size =>
val begin = from.get(tp).getOrElse(fromNew(tp))
val prorate = limit * (size / total)
// Don't completely starve small topicpartitions
val off = begin + (if (prorate < 1) Math.ceil(prorate) else 
Math.floor(prorate)).toLong
// Paranoia, make sure not to return an offset that's past end
Math.min(end, off)
}.getOrElse(end)
}
}
}
{code}
this rateLimit function is the trouble where if limit is set to Long.MaxValue, 
it could have integer overflow, showing in below example: 
{code:java}
val begin = 100
val limit = Long.MaxValue
val size = 5933L
val total = 5933L.toDouble

val prorate = limit * (size / total)
// Don't completely starve small topicpartitions
val off = begin + (if (prorate < 1) Math.ceil(prorate) else 
Math.floor(prorate)).toLong

println(off)

// prints -9223372036854775709{code}
root cause is `limit * (size/total)', it would lose precision due to Double 
type and then `begin + Math.floor(prorate)` will overflow.  

 

> structured streaming fetched wrong current offset from kafka
> 
>
> Key: SPARK-26718
> URL: https://issues.apache.org/jira/browse/SPARK-26718
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Ryne Yang
>Priority: Major
>
> when running spark structured streaming using lib: `"org.apache.spark" %% 
> "spark-sql-kafka-0-10" % "2.4.0"`, we keep getting error regarding current 
> offset fetching:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 0.0 (TID 3, qa2-hdp-4.acuityads.org, executor 2): 
> java.lang.AssertionError: assertion failed: latest offs
> et -9223372036854775808 does not equal -1
> at scala.Predef$.assert(Predef.scala:170)
> at 
> org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.resolveRange(KafkaMicroBatchReader.scala:371)
> at 
> org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.(KafkaMicroBatchReader.scala:329)
> at 
> org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartition.createPartitionReader(KafkaMicroBatchReader.scala:314)
> at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:42)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> for some reason, looks like fetchLatestOffset returned a Long.MIN_VALUE for 
> one of the partitions. I checked the structured streaming checkpoint, that 
> 

[jira] [Commented] (SPARK-25767) Error reported in Spark logs when using the org.apache.spark:spark-sql_2.11:2.3.2 Java library

2019-01-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751605#comment-16751605
 ] 

Apache Spark commented on SPARK-25767:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/23642

> Error reported in Spark logs when using the 
> org.apache.spark:spark-sql_2.11:2.3.2 Java library
> --
>
> Key: SPARK-25767
> URL: https://issues.apache.org/jira/browse/SPARK-25767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.2
>Reporter: Thomas Brugiere
>Assignee: Peter Toth
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
> Attachments: fileA.csv, fileB.csv, fileC.csv
>
>
> Hi,
> Here is a bug I found using the latest version of spark-sql_2.11:2.2.0. Note 
> that this case was also tested with spark-sql_2.11:2.3.2 and the bug is also 
> present.
> This issue is a duplicate of the SPARK-25582 issue that I had to close after 
> an accidental manipulation from another developer (was linked to a wrong PR)
> You will find attached three small sample CSV files with the minimal content 
> to raise the bug.
> Find below a reproducer code:
> {code:java}
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import scala.collection.JavaConverters;
> import scala.collection.Seq;
> import java.util.Arrays;
> public class SparkBug {
> private static  Seq arrayToSeq(T[] input) {
> return 
> JavaConverters.asScalaIteratorConverter(Arrays.asList(input).iterator()).asScala().toSeq();
> }
> public static void main(String[] args) throws Exception {
> SparkConf conf = new 
> SparkConf().setAppName("SparkBug").setMaster("local");
> SparkSession sparkSession = 
> SparkSession.builder().config(conf).getOrCreate();
> Dataset df_a = sparkSession.read().option("header", 
> true).csv("local/fileA.csv").dropDuplicates();
> Dataset df_b = sparkSession.read().option("header", 
> true).csv("local/fileB.csv").dropDuplicates();
> Dataset df_c = sparkSession.read().option("header", 
> true).csv("local/fileC.csv").dropDuplicates();
> String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", 
> "colE", "colF"};
> String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", 
> "colE"};
> Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), 
> "left");
> Dataset df_inventory_2 = df_inventory_1.join(df_c, 
> arrayToSeq(key_join_2), "left");
> df_inventory_2.show();
> }
> }
> {code}
> When running this code, I can see the exception below:
> {code:java}
> 18/10/18 09:25:49 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
>     at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
>     at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
>     at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
>     at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
>     at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
>     at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
>     at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>     at 

[jira] [Commented] (SPARK-26608) Remove Jenkins jobs for `branch-2.2`

2019-01-24 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751597#comment-16751597
 ] 

Dongjoon Hyun commented on SPARK-26608:
---

Thank you, [~shaneknapp]! :D

> Remove Jenkins jobs for `branch-2.2`
> 
>
> Key: SPARK-26608
> URL: https://issues.apache.org/jira/browse/SPARK-26608
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.2.3
>Reporter: Dongjoon Hyun
>Assignee: shane knapp
>Priority: Major
> Attachments: Screen Shot 2019-01-11 at 8.47.27 PM.png, Screen Shot 
> 2019-01-14 at 11.39.05 AM.png, Screen Shot 2019-01-14 at 11.42.46 AM.png, 
> Screen Shot 2019-01-22 at 12.47.43 PM.png
>
>
> This issue aims to remove the following Jenkins jobs for `branch-2.2` because 
> of EOL.
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.6/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.7/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.6/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.7/]
> As of today, the branch is healthy.
> !Screen Shot 2019-01-11 at 8.47.27 PM.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26608) Remove Jenkins jobs for `branch-2.2`

2019-01-24 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp resolved SPARK-26608.
-
Resolution: Fixed

> Remove Jenkins jobs for `branch-2.2`
> 
>
> Key: SPARK-26608
> URL: https://issues.apache.org/jira/browse/SPARK-26608
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.2.3
>Reporter: Dongjoon Hyun
>Assignee: shane knapp
>Priority: Major
> Attachments: Screen Shot 2019-01-11 at 8.47.27 PM.png, Screen Shot 
> 2019-01-14 at 11.39.05 AM.png, Screen Shot 2019-01-14 at 11.42.46 AM.png, 
> Screen Shot 2019-01-22 at 12.47.43 PM.png
>
>
> This issue aims to remove the following Jenkins jobs for `branch-2.2` because 
> of EOL.
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.6/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.7/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.6/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.7/]
> As of today, the branch is healthy.
> !Screen Shot 2019-01-11 at 8.47.27 PM.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26608) Remove Jenkins jobs for `branch-2.2`

2019-01-24 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751588#comment-16751588
 ] 

shane knapp commented on SPARK-26608:
-

alright, config changes are merged...  i'm going to delete these jobs on 
jenkins and resolve this issue.

> Remove Jenkins jobs for `branch-2.2`
> 
>
> Key: SPARK-26608
> URL: https://issues.apache.org/jira/browse/SPARK-26608
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.2.3
>Reporter: Dongjoon Hyun
>Assignee: shane knapp
>Priority: Major
> Attachments: Screen Shot 2019-01-11 at 8.47.27 PM.png, Screen Shot 
> 2019-01-14 at 11.39.05 AM.png, Screen Shot 2019-01-14 at 11.42.46 AM.png, 
> Screen Shot 2019-01-22 at 12.47.43 PM.png
>
>
> This issue aims to remove the following Jenkins jobs for `branch-2.2` because 
> of EOL.
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.6/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.7/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.6/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.7/]
> As of today, the branch is healthy.
> !Screen Shot 2019-01-11 at 8.47.27 PM.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26719) Get rid of java.util.Calendar in DateTimeUtils

2019-01-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26719:


Assignee: Apache Spark

> Get rid of java.util.Calendar in DateTimeUtils
> --
>
> Key: SPARK-26719
> URL: https://issues.apache.org/jira/browse/SPARK-26719
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> java.util.Calendar uses the hybrid calendar (Julian+Gregorian). The ticket 
> aims to replace it by classes from the java.time namespace that are based on 
> Proleptic Gregorian calendar.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26719) Get rid of java.util.Calendar in DateTimeUtils

2019-01-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26719:


Assignee: (was: Apache Spark)

> Get rid of java.util.Calendar in DateTimeUtils
> --
>
> Key: SPARK-26719
> URL: https://issues.apache.org/jira/browse/SPARK-26719
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> java.util.Calendar uses the hybrid calendar (Julian+Gregorian). The ticket 
> aims to replace it by classes from the java.time namespace that are based on 
> Proleptic Gregorian calendar.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26719) Get rid of java.util.Calendar in DateTimeUtils

2019-01-24 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26719:
--

 Summary: Get rid of java.util.Calendar in DateTimeUtils
 Key: SPARK-26719
 URL: https://issues.apache.org/jira/browse/SPARK-26719
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


java.util.Calendar uses the hybrid calendar (Julian+Gregorian). The ticket aims 
to replace it by classes from the java.time namespace that are based on 
Proleptic Gregorian calendar.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26718) structured streaming fetched wrong current offset from kafka

2019-01-24 Thread Ryne Yang (JIRA)
Ryne Yang created SPARK-26718:
-

 Summary: structured streaming fetched wrong current offset from 
kafka
 Key: SPARK-26718
 URL: https://issues.apache.org/jira/browse/SPARK-26718
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Ryne Yang


when running spark structured streaming using lib: `"org.apache.spark" %% 
"spark-sql-kafka-0-10" % "2.4.0"`, we keep getting error regarding current 
offset fetching:
{code:java}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
0.0 (TID 3, qa2-hdp-4.acuityads.org, executor 2): java.lang.AssertionError: 
assertion failed: latest offs
et -9223372036854775808 does not equal -1
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.resolveRange(KafkaMicroBatchReader.scala:371)
at 
org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.(KafkaMicroBatchReader.scala:329)
at 
org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartition.createPartitionReader(KafkaMicroBatchReader.scala:314)
at 
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:42)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
for some reason, looks like fetchLatestOffset returned a Long.MIN_VALUE for one 
of the partitions. I checked the structured streaming checkpoint, that was 
correct, it's the currentAvailableOffset was set to Long.MIN_VALUE.

kafka broker version: 1.1.0.
lib we used:

{\{libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % 
"2.4.0" }}

how to reproduce:
basically we started a structured streamer and subscribed a topic of 4 
partitions. then produced some messages into topic, job crashed and logged the 
stacktrace like above.

also the committed offsets seem fine as we see in the logs: 
{code:java}
=== Streaming Query ===
Identifier: [id = c46c67ee-3514-4788-8370-a696837b21b1, runId = 
31878627-d473-4ee8-955d-d4d3f3f45eb9]
Current Committed Offsets: {KafkaV2[Subscribe[REVENUEEVENT]]: 
{"REVENUEEVENT":{"0":1}}}
Current Available Offsets: {KafkaV2[Subscribe[REVENUEEVENT]]: 
{"REVENUEEVENT":{"0":-9223372036854775808}}}
{code}
so spark streaming recorded the correct value for partition: 0, but the current 
available offsets returned from kafka is showing Long.MIN_VALUE. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25590) kubernetes-model-2.0.0.jar masks default Spark logging config

2019-01-24 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751425#comment-16751425
 ] 

Marcelo Vanzin commented on SPARK-25590:


I filed https://github.com/fabric8io/kubernetes-client/issues/1327 for the fix 
on the fabric8 libs.

> kubernetes-model-2.0.0.jar masks default Spark logging config
> -
>
> Key: SPARK-25590
> URL: https://issues.apache.org/jira/browse/SPARK-25590
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> That jar file, which is packaged when the k8s profile is enabled, has a log4j 
> configuration embedded in it:
> {noformat}
> $ jar tf /path/to/kubernetes-model-2.0.0.jar | grep log4j
> log4j.properties
> {noformat}
> What this causes is that Spark will always use that log4j configuration 
> instead of its own default (log4j-defaults.properties), unless the user 
> overrides it by somehow adding their own in the classpath before the 
> kubernetes one.
> You can see that by running spark-shell. With the k8s jar in:
> {noformat}
> $ ./bin/spark-shell 
> ...
> Setting default log level to "WARN"
> {noformat}
> Removing the k8s jar:
> {noformat}
> $ ./bin/spark-shell 
> ...
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> {noformat}
> The proper fix would be for the k8s jar to not ship that file, and then just 
> upgrade the dependency in Spark, but if there's something easy we can do in 
> the meantime...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18484) case class datasets - ability to specify decimal precision and scale

2019-01-24 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751422#comment-16751422
 ] 

Marco Gaido commented on SPARK-18484:
-

[~bonazzaf] please do not delete comments, as they may be useful for other 
people with the same doubts.

That's right: there is no way currently to specify the precision and scale of a 
{{BigDecimal}} object, so by default (38, 18) is taken. It is not an issue 
because you can do everything using dataframes, you just need to specify the 
schema.

> case class datasets - ability to specify decimal precision and scale
> 
>
> Key: SPARK-18484
> URL: https://issues.apache.org/jira/browse/SPARK-18484
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Damian Momot
>Priority: Major
>
> Currently when using decimal type (BigDecimal in scala case class) there's no 
> way to enforce precision and scale. This is quite critical when saving data - 
> regarding space usage and compatibility with external systems (for example 
> Hive table) because spark saves data as Decimal(38,18)
> {code}
> case class TestClass(id: String, money: BigDecimal)
> val testDs = spark.createDataset(Seq(
>   TestClass("1", BigDecimal("22.50")),
>   TestClass("2", BigDecimal("500.66"))
> ))
> testDs.printSchema()
> {code}
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- money: decimal(38,18) (nullable = true)
> {code}
> Workaround is to convert dataset to dataframe before saving and manually cast 
> to specific decimal scale/precision:
> {code}
> import org.apache.spark.sql.types.DecimalType
> val testDf = testDs.toDF()
> testDf
>   .withColumn("money", testDf("money").cast(DecimalType(10,2)))
>   .printSchema()
> {code}
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- money: decimal(10,2) (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26687) Building Spark Images has non-intuitive behaviour with paths to custom Dockerfiles

2019-01-24 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26687.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23613
[https://github.com/apache/spark/pull/23613]

> Building Spark Images has non-intuitive behaviour with paths to custom 
> Dockerfiles
> --
>
> Key: SPARK-26687
> URL: https://issues.apache.org/jira/browse/SPARK-26687
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Rob Vesse
>Assignee: Rob Vesse
>Priority: Major
> Fix For: 3.0.0
>
>
> With the changes from SPARK-26025 
> (https://github.com/apache/spark/pull/23019) we use a pared down Docker build 
> context which significantly improves build times.  However the way this is 
> implemented leads to non-intuitive behaviour when supplying custom Docker 
> file paths.  This is because of the following code snippets:
> {code}
> (cd $(img_ctx_dir base) && docker build $NOCACHEARG "${BUILD_ARGS[@]}" \
> -t $(image_ref spark) \
> -f "$BASEDOCKERFILE" .)
> {code}
> Since the script changes to the temporary build context directory and then 
> runs {{docker build}} there any path given for the Docker file is taken as 
> relative to the temporary build context directory rather than to the 
> directory where the user invoked the script.  This produces somewhat 
> unhelpful errors e.g.
> {noformat}
> > ./bin/docker-image-tool.sh -r rvesse -t badpath -p 
> > resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile
> >  build
> Sending build context to Docker daemon  218.4MB
> Step 1/15 : FROM openjdk:8-alpine
>  ---> 5801f7d008e5
> Step 2/15 : ARG spark_uid=185
>  ---> Using cache
>  ---> 5fd63df1ca39
> ...
> Successfully tagged rvesse/spark:badpath
> unable to prepare context: unable to evaluate symlinks in Dockerfile path: 
> lstat 
> /Users/rvesse/Documents/Work/Code/spark/target/tmp/docker/pyspark/resource-managers:
>  no such file or directory
> Failed to build PySpark Docker image, please refer to Docker build output for 
> details.
> {noformat}
> Here we can see that the relative path that was valid where the user typed 
> the command was not valid inside the build context directory.
> To resolve this we need to ensure that we are resolving relative paths to 
> Docker files appropriately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26687) Building Spark Images has non-intuitive behaviour with paths to custom Dockerfiles

2019-01-24 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26687:
--

Assignee: Rob Vesse

> Building Spark Images has non-intuitive behaviour with paths to custom 
> Dockerfiles
> --
>
> Key: SPARK-26687
> URL: https://issues.apache.org/jira/browse/SPARK-26687
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Rob Vesse
>Assignee: Rob Vesse
>Priority: Major
>
> With the changes from SPARK-26025 
> (https://github.com/apache/spark/pull/23019) we use a pared down Docker build 
> context which significantly improves build times.  However the way this is 
> implemented leads to non-intuitive behaviour when supplying custom Docker 
> file paths.  This is because of the following code snippets:
> {code}
> (cd $(img_ctx_dir base) && docker build $NOCACHEARG "${BUILD_ARGS[@]}" \
> -t $(image_ref spark) \
> -f "$BASEDOCKERFILE" .)
> {code}
> Since the script changes to the temporary build context directory and then 
> runs {{docker build}} there any path given for the Docker file is taken as 
> relative to the temporary build context directory rather than to the 
> directory where the user invoked the script.  This produces somewhat 
> unhelpful errors e.g.
> {noformat}
> > ./bin/docker-image-tool.sh -r rvesse -t badpath -p 
> > resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile
> >  build
> Sending build context to Docker daemon  218.4MB
> Step 1/15 : FROM openjdk:8-alpine
>  ---> 5801f7d008e5
> Step 2/15 : ARG spark_uid=185
>  ---> Using cache
>  ---> 5fd63df1ca39
> ...
> Successfully tagged rvesse/spark:badpath
> unable to prepare context: unable to evaluate symlinks in Dockerfile path: 
> lstat 
> /Users/rvesse/Documents/Work/Code/spark/target/tmp/docker/pyspark/resource-managers:
>  no such file or directory
> Failed to build PySpark Docker image, please refer to Docker build output for 
> details.
> {noformat}
> Here we can see that the relative path that was valid where the user typed 
> the command was not valid inside the build context directory.
> To resolve this we need to ensure that we are resolving relative paths to 
> Docker files appropriately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26717) Support PodPriority for spark driver and executor on kubernetes

2019-01-24 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26717.

Resolution: Not A Problem

Pretty sure this is covered by pod templates (SPARK-24434).

> Support PodPriority for spark driver and executor on kubernetes
> ---
>
> Key: SPARK-26717
> URL: https://issues.apache.org/jira/browse/SPARK-26717
> Project: Spark
>  Issue Type: Wish
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Li Gao
>Priority: Major
>
> Hi,
> I'd like to see whether we could enhance the current Spark 2.4 on k8s support 
> to bring in the `PodPriority` feature thats available since k8s v1.11+
> [https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/]
> This is helpful in a shared cluster environment to ensure SLA for jobs with 
> different priorities.
> Thanks!
> Li



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26717) Support PodPriority for spark driver and executor on kubernetes

2019-01-24 Thread Li Gao (JIRA)
Li Gao created SPARK-26717:
--

 Summary: Support PodPriority for spark driver and executor on 
kubernetes
 Key: SPARK-26717
 URL: https://issues.apache.org/jira/browse/SPARK-26717
 Project: Spark
  Issue Type: Wish
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Li Gao


Hi,

I'd like to see whether we could enhance the current Spark 2.4 on k8s support 
to bring in the `PodPriority` feature thats available since k8s v1.11+

[https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/]

This is helpful in a shared cluster environment to ensure SLA for jobs with 
different priorities.

Thanks!

Li



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26690) Checkpoints of Dataframes are not visible in the SQL UI

2019-01-24 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-26690:
-

Assignee: Tom van Bussel

> Checkpoints of Dataframes are not visible in the SQL UI
> ---
>
> Key: SPARK-26690
> URL: https://issues.apache.org/jira/browse/SPARK-26690
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Tom van Bussel
>Assignee: Tom van Bussel
>Priority: Major
>
> Checkpoints and local checkpoints of dataframes do not show up in the SQL UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26690) Checkpoints of Dataframes are not visible in the SQL UI

2019-01-24 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-26690.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

> Checkpoints of Dataframes are not visible in the SQL UI
> ---
>
> Key: SPARK-26690
> URL: https://issues.apache.org/jira/browse/SPARK-26690
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Tom van Bussel
>Assignee: Tom van Bussel
>Priority: Major
> Fix For: 3.0.0
>
>
> Checkpoints and local checkpoints of dataframes do not show up in the SQL UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18484) case class datasets - ability to specify decimal precision and scale

2019-01-24 Thread Franco Bonazza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751174#comment-16751174
 ] 

Franco Bonazza commented on SPARK-18484:


What if you have a DataFrame with higher precision e.g. 38, 17, this 
effectively breaks df.as[TestClass], it busts on truncation because I can't 
specify the schema of the resulting Dataset. Am I missing something? Doesn't 
seem like a non issue to me. The only work around I see is not using Datasets.

> case class datasets - ability to specify decimal precision and scale
> 
>
> Key: SPARK-18484
> URL: https://issues.apache.org/jira/browse/SPARK-18484
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Damian Momot
>Priority: Major
>
> Currently when using decimal type (BigDecimal in scala case class) there's no 
> way to enforce precision and scale. This is quite critical when saving data - 
> regarding space usage and compatibility with external systems (for example 
> Hive table) because spark saves data as Decimal(38,18)
> {code}
> case class TestClass(id: String, money: BigDecimal)
> val testDs = spark.createDataset(Seq(
>   TestClass("1", BigDecimal("22.50")),
>   TestClass("2", BigDecimal("500.66"))
> ))
> testDs.printSchema()
> {code}
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- money: decimal(38,18) (nullable = true)
> {code}
> Workaround is to convert dataset to dataframe before saving and manually cast 
> to specific decimal scale/precision:
> {code}
> import org.apache.spark.sql.types.DecimalType
> val testDf = testDs.toDF()
> testDf
>   .withColumn("money", testDf("money").cast(DecimalType(10,2)))
>   .printSchema()
> {code}
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- money: decimal(10,2) (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26716) Refactor supportDataType API: the supported types of read/write should be consistent

2019-01-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26716:


Assignee: (was: Apache Spark)

> Refactor supportDataType API:  the supported types of read/write should be 
> consistent
> -
>
> Key: SPARK-26716
> URL: https://issues.apache.org/jira/browse/SPARK-26716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> 1. rename to supportsDataType
> 2. remove parameter isReadPath. The supported types of read/write should be 
> consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26699) Dataset column output discrepancies

2019-01-24 Thread Praveena (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751159#comment-16751159
 ] 

Praveena commented on SPARK-26699:
--

I am trying to understand why its behaving differently on Local and Cluster 
mode. 

Please let me know the emailing list, so i can reach them. Thanks in advance

> Dataset column output discrepancies 
> 
>
> Key: SPARK-26699
> URL: https://issues.apache.org/jira/browse/SPARK-26699
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output
>Affects Versions: 2.3.2
>Reporter: Praveena
>Priority: Major
>
> Hi,
>  
> When i run my job in Local mode (meaning as standalone in Eclipse) with same 
> parquet input files, the output is -
>  
> locations
>  
>  [[[true, [[, phys...
>  [[[true, [[, phys...
>  [[[true, [[, phys...
>  null
>  [[[true, [[, phys...
>  [[[true, [[, phys...
>  [[[true, [[, phys...
>  [[[true, [[, phys...
>  [[[true, [[, phys...
>  [[[true, [[, phys...
>  
> But when i run the same code base with same input parquet files in the YARN 
> cluster mode, my output is as below -
> 
>  locations
>  
>  [*WrappedArray*([tr...
>  [*WrappedArray*([tr...
>  [WrappedArray([tr...
>  null
>  [WrappedArray([tr...
>  [WrappedArray([tr...
>  [WrappedArray([tr...
>  [WrappedArray([tr...
>  [WrappedArray([tr...
>  [WrappedArray([tr...
> Its appending WrappedArray :(
> I am using Apache Spark 2.3.2 version and the EMR Version is 5.19.0. What 
> could be the reason for discrepancies in the output of certain Table columns ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26716) Refactor supportDataType API: the supported types of read/write should be consistent

2019-01-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26716:


Assignee: Apache Spark

> Refactor supportDataType API:  the supported types of read/write should be 
> consistent
> -
>
> Key: SPARK-26716
> URL: https://issues.apache.org/jira/browse/SPARK-26716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> 1. rename to supportsDataType
> 2. remove parameter isReadPath. The supported types of read/write should be 
> consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26716) Refactor supportDataType API: the supported types of read/write should be consistent

2019-01-24 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-26716:
--

 Summary: Refactor supportDataType API:  the supported types of 
read/write should be consistent
 Key: SPARK-26716
 URL: https://issues.apache.org/jira/browse/SPARK-26716
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


1. rename to supportsDataType
2. remove parameter isReadPath. The supported types of read/write should be 
consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26713) PipedRDD may holds stdin writer and stdout read threads even if the task is finished

2019-01-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26713:


Assignee: Apache Spark

> PipedRDD may holds stdin writer and stdout read threads even if the task is 
> finished
> 
>
> Key: SPARK-26713
> URL: https://issues.apache.org/jira/browse/SPARK-26713
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.3, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Xianjin YE
>Assignee: Apache Spark
>Priority: Major
>
> During an investigation of OOM of one internal production job, I found that 
> PipedRDD leaks memory. After some digging, the problem lies down to the fact 
> that PipedRDD doesn't release stdin writer and stdout threads even if the 
> task is finished.
>  
> PipedRDD creates two threads: stdin writer and stdout reader. If we are lucky 
> and the task is finished normally, these two threads exit normally. If the 
> subprocess(pipe command) is failed, the task will be marked failed, however 
> the stdin writer will be still running until it consumes its parent RDD's 
> iterator. There is even a race condition with ShuffledRDD + PipedRDD: the 
> ShuffleBlockFetchIterator is cleaned up at task completion and hangs stdin 
> writer thread, which leaks memory. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26713) PipedRDD may holds stdin writer and stdout read threads even if the task is finished

2019-01-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26713:


Assignee: (was: Apache Spark)

> PipedRDD may holds stdin writer and stdout read threads even if the task is 
> finished
> 
>
> Key: SPARK-26713
> URL: https://issues.apache.org/jira/browse/SPARK-26713
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.3, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Xianjin YE
>Priority: Major
>
> During an investigation of OOM of one internal production job, I found that 
> PipedRDD leaks memory. After some digging, the problem lies down to the fact 
> that PipedRDD doesn't release stdin writer and stdout threads even if the 
> task is finished.
>  
> PipedRDD creates two threads: stdin writer and stdout reader. If we are lucky 
> and the task is finished normally, these two threads exit normally. If the 
> subprocess(pipe command) is failed, the task will be marked failed, however 
> the stdin writer will be still running until it consumes its parent RDD's 
> iterator. There is even a race condition with ShuffledRDD + PipedRDD: the 
> ShuffleBlockFetchIterator is cleaned up at task completion and hangs stdin 
> writer thread, which leaks memory. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26389) temp checkpoint folder at executor should be deleted on graceful shutdown

2019-01-24 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751082#comment-16751082
 ] 

Gabor Somogyi commented on SPARK-26389:
---

I've lowered the prio and will file a PR for this soon.

> temp checkpoint folder at executor should be deleted on graceful shutdown
> -
>
> Key: SPARK-26389
> URL: https://issues.apache.org/jira/browse/SPARK-26389
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Fengyu Cao
>Priority: Minor
>
> {{spark-submit --master mesos:// -conf 
> spark.streaming.stopGracefullyOnShutdown=true  framework>}}
> CTRL-C, framework shutdown
> {{18/12/18 10:27:36 ERROR MicroBatchExecution: Query [id = 
> f512e17a-df88-4414-a5cd-a23550cf1e7f, runId = 
> 24d99723-8d61-48c0-beab-af432f7a19d3] terminated with error 
> org.apache.spark.SparkException: Writing job aborted.}}
> {{/tmp/temporary- on executor not deleted due to 
> org.apache.spark.SparkException: Writing job aborted., and this temp 
> checkpoint can't used to recovery.}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26715) If linux container executor is not set for a GPU cluster GpuResourceHandlerImpl is not initialized and NPE is thrown

2019-01-24 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-26715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antal Bálint Steinbach resolved SPARK-26715.

Resolution: Invalid

YARN issue.

> If linux container executor is not set for a GPU cluster 
> GpuResourceHandlerImpl is not initialized and NPE is thrown
> 
>
> Key: SPARK-26715
> URL: https://issues.apache.org/jira/browse/SPARK-26715
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Antal Bálint Steinbach
>Priority: Major
>
> If GPU plugin is set for the NodeManger it is possible to run jobs with GPU.
> But if Linux container Executor is not configured a NPE is thrown when 
> calling 
> GpuResourcePlugin.getNMResourceInfo.
> Also, there are no warns in the log if GPU is misconfigured like this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26389) temp checkpoint folder at executor should be deleted on graceful shutdown

2019-01-24 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-26389:
--
Priority: Minor  (was: Major)

> temp checkpoint folder at executor should be deleted on graceful shutdown
> -
>
> Key: SPARK-26389
> URL: https://issues.apache.org/jira/browse/SPARK-26389
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Fengyu Cao
>Priority: Minor
>
> {{spark-submit --master mesos:// -conf 
> spark.streaming.stopGracefullyOnShutdown=true  framework>}}
> CTRL-C, framework shutdown
> {{18/12/18 10:27:36 ERROR MicroBatchExecution: Query [id = 
> f512e17a-df88-4414-a5cd-a23550cf1e7f, runId = 
> 24d99723-8d61-48c0-beab-af432f7a19d3] terminated with error 
> org.apache.spark.SparkException: Writing job aborted.}}
> {{/tmp/temporary- on executor not deleted due to 
> org.apache.spark.SparkException: Writing job aborted., and this temp 
> checkpoint can't used to recovery.}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26715) If linux container executor is not set for a GPU cluster GpuResourceHandlerImpl is not initialized and NPE is thrown

2019-01-24 Thread JIRA
Antal Bálint Steinbach created SPARK-26715:
--

 Summary: If linux container executor is not set for a GPU cluster 
GpuResourceHandlerImpl is not initialized and NPE is thrown
 Key: SPARK-26715
 URL: https://issues.apache.org/jira/browse/SPARK-26715
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 3.0.0
Reporter: Antal Bálint Steinbach


If GPU plugin is set for the NodeManger it is possible to run jobs with GPU.

But if Linux container Executor is not configured a NPE is thrown when calling 

GpuResourcePlugin.getNMResourceInfo.

Also, there are no warns in the log if GPU is misconfigured like this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17333) Make pyspark interface friendly with static analysis

2019-01-24 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751040#comment-16751040
 ] 

Maciej Szymkiewicz edited comment on SPARK-17333 at 1/24/19 11:58 AM:
--

[~Alexander_Gorokhov] Personally I maintain relatively complete set of 
annotations 
([https://github.com/zero323/pyspark-stubs|https://github.com/zero323/pyspark-stubs)])
 and in the past declared that I am happy to donate it and help with merge 
([http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html|http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html)]).
 This topic has been also raised on another occasion 
([http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html|http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html).]).

 

In practice though I don't think there is interest in that at the moment, and 
based my experience, there is quite significant maintenance overhead. While 
adding simple annotations is trivial, useful ones can tricky, and the ecosystem 
is not exactly stable at the moment, so things tend to brake or exhibit 
significantly different behavior from tool to tool.


was (Author: zero323):
[~Alexander_Gorokhov] Personally I maintain relatively complete set of 
annotations 
([https://github.com/zero323/pyspark-stubs|https://github.com/zero323/pyspark-stubs)])
 and in the past declared that I am happy to donate it and help with merge 
([http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html|http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html)]).
 This topic has been also raised on another occasion 
([http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html|http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html).])

> Make pyspark interface friendly with static analysis
> 
>
> Key: SPARK-17333
> URL: https://issues.apache.org/jira/browse/SPARK-17333
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Assaf Mendelson
>Priority: Trivial
>
> Static analysis tools such as those common to IDE for auto completion and 
> error marking, tend to have poor results with pyspark.
> This is cause by two separate issues:
> The first is that many elements are created programmatically such as the max 
> function in pyspark.sql.functions.
> The second is that we tend to use pyspark in a functional manner, meaning 
> that we chain many actions (e.g. df.filter().groupby().agg()) and since 
> python has no type information this can become difficult to understand.
> I would suggest changing the interface to improve it. 
> The way I see it we can either change the interface or provide interface 
> enhancements.
> Changing the interface means defining (when possible) all functions directly, 
> i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py 
> and then generating the functions programmatically by using _create_function, 
> create the function directly. 
> def max(col):
>"""
>docstring
>"""
>_create_function(max,"docstring")
> Second we can add type indications to all functions as defined in pep 484 or 
> pycharm's legacy type hinting 
> (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy).
> So for example max might look like this:
> def max(col):
>"""
>does  a max.
>   :type col: Column
>   :rtype Column
>"""
> This would provide a wide range of support as these types of hints, while old 
> are pretty common.
> A second option is to use PEP 3107 to define interfaces (pyi files)
> in this case we might have a functions.pyi file which would contain something 
> like:
> def max(col: Column) -> Column:
> """
> Aggregate function: returns the maximum value of the expression in a 
> group.
> """
> ...
> This has the advantage of easier to understand types and not touching the 
> code (only supported code) but has the disadvantage of being separately 
> managed (i.e. greater chance of doing a mistake) and the fact that some 
> configuration would be needed in the IDE/static analysis tool instead of 
> working out of the box.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17333) Make pyspark interface friendly with static analysis

2019-01-24 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751040#comment-16751040
 ] 

Maciej Szymkiewicz commented on SPARK-17333:


[~Alexander_Gorokhov] Personally I maintain relatively complete set of 
annotations ([https://github.com/zero323/pyspark-stubs)] and in the past 
declared that I am happy to donate it and help with merge 
([http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html)].
 This topic has been also raised on another occasion 
([http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html)|http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html).]

> Make pyspark interface friendly with static analysis
> 
>
> Key: SPARK-17333
> URL: https://issues.apache.org/jira/browse/SPARK-17333
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Assaf Mendelson
>Priority: Trivial
>
> Static analysis tools such as those common to IDE for auto completion and 
> error marking, tend to have poor results with pyspark.
> This is cause by two separate issues:
> The first is that many elements are created programmatically such as the max 
> function in pyspark.sql.functions.
> The second is that we tend to use pyspark in a functional manner, meaning 
> that we chain many actions (e.g. df.filter().groupby().agg()) and since 
> python has no type information this can become difficult to understand.
> I would suggest changing the interface to improve it. 
> The way I see it we can either change the interface or provide interface 
> enhancements.
> Changing the interface means defining (when possible) all functions directly, 
> i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py 
> and then generating the functions programmatically by using _create_function, 
> create the function directly. 
> def max(col):
>"""
>docstring
>"""
>_create_function(max,"docstring")
> Second we can add type indications to all functions as defined in pep 484 or 
> pycharm's legacy type hinting 
> (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy).
> So for example max might look like this:
> def max(col):
>"""
>does  a max.
>   :type col: Column
>   :rtype Column
>"""
> This would provide a wide range of support as these types of hints, while old 
> are pretty common.
> A second option is to use PEP 3107 to define interfaces (pyi files)
> in this case we might have a functions.pyi file which would contain something 
> like:
> def max(col: Column) -> Column:
> """
> Aggregate function: returns the maximum value of the expression in a 
> group.
> """
> ...
> This has the advantage of easier to understand types and not touching the 
> code (only supported code) but has the disadvantage of being separately 
> managed (i.e. greater chance of doing a mistake) and the fact that some 
> configuration would be needed in the IDE/static analysis tool instead of 
> working out of the box.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17333) Make pyspark interface friendly with static analysis

2019-01-24 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751040#comment-16751040
 ] 

Maciej Szymkiewicz edited comment on SPARK-17333 at 1/24/19 11:54 AM:
--

[~Alexander_Gorokhov] Personally I maintain relatively complete set of 
annotations 
([https://github.com/zero323/pyspark-stubs|https://github.com/zero323/pyspark-stubs)])
 and in the past declared that I am happy to donate it and help with merge 
([http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html|http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html)]).
 This topic has been also raised on another occasion 
([http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html|http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html).])


was (Author: zero323):
[~Alexander_Gorokhov] Personally I maintain relatively complete set of 
annotations 
([https://github.com/zero323/pyspark-stubs|https://github.com/zero323/pyspark-stubs)])
 and in the past declared that I am happy to donate it and help with merge 
([http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html)].
 This topic has been also raised on another occasion 
([http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html)|http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html).]

> Make pyspark interface friendly with static analysis
> 
>
> Key: SPARK-17333
> URL: https://issues.apache.org/jira/browse/SPARK-17333
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Assaf Mendelson
>Priority: Trivial
>
> Static analysis tools such as those common to IDE for auto completion and 
> error marking, tend to have poor results with pyspark.
> This is cause by two separate issues:
> The first is that many elements are created programmatically such as the max 
> function in pyspark.sql.functions.
> The second is that we tend to use pyspark in a functional manner, meaning 
> that we chain many actions (e.g. df.filter().groupby().agg()) and since 
> python has no type information this can become difficult to understand.
> I would suggest changing the interface to improve it. 
> The way I see it we can either change the interface or provide interface 
> enhancements.
> Changing the interface means defining (when possible) all functions directly, 
> i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py 
> and then generating the functions programmatically by using _create_function, 
> create the function directly. 
> def max(col):
>"""
>docstring
>"""
>_create_function(max,"docstring")
> Second we can add type indications to all functions as defined in pep 484 or 
> pycharm's legacy type hinting 
> (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy).
> So for example max might look like this:
> def max(col):
>"""
>does  a max.
>   :type col: Column
>   :rtype Column
>"""
> This would provide a wide range of support as these types of hints, while old 
> are pretty common.
> A second option is to use PEP 3107 to define interfaces (pyi files)
> in this case we might have a functions.pyi file which would contain something 
> like:
> def max(col: Column) -> Column:
> """
> Aggregate function: returns the maximum value of the expression in a 
> group.
> """
> ...
> This has the advantage of easier to understand types and not touching the 
> code (only supported code) but has the disadvantage of being separately 
> managed (i.e. greater chance of doing a mistake) and the fact that some 
> configuration would be needed in the IDE/static analysis tool instead of 
> working out of the box.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17333) Make pyspark interface friendly with static analysis

2019-01-24 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751040#comment-16751040
 ] 

Maciej Szymkiewicz edited comment on SPARK-17333 at 1/24/19 11:53 AM:
--

[~Alexander_Gorokhov] Personally I maintain relatively complete set of 
annotations 
([https://github.com/zero323/pyspark-stubs|https://github.com/zero323/pyspark-stubs)])and
 in the past declared that I am happy to donate it and help with merge 
([http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html)].
 This topic has been also raised on another occasion 
([http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html)|http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html).]


was (Author: zero323):
[~Alexander_Gorokhov] Personally I maintain relatively complete set of 
annotations ([https://github.com/zero323/pyspark-stubs)] and in the past 
declared that I am happy to donate it and help with merge 
([http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html)].
 This topic has been also raised on another occasion 
([http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html)|http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html).]

> Make pyspark interface friendly with static analysis
> 
>
> Key: SPARK-17333
> URL: https://issues.apache.org/jira/browse/SPARK-17333
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Assaf Mendelson
>Priority: Trivial
>
> Static analysis tools such as those common to IDE for auto completion and 
> error marking, tend to have poor results with pyspark.
> This is cause by two separate issues:
> The first is that many elements are created programmatically such as the max 
> function in pyspark.sql.functions.
> The second is that we tend to use pyspark in a functional manner, meaning 
> that we chain many actions (e.g. df.filter().groupby().agg()) and since 
> python has no type information this can become difficult to understand.
> I would suggest changing the interface to improve it. 
> The way I see it we can either change the interface or provide interface 
> enhancements.
> Changing the interface means defining (when possible) all functions directly, 
> i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py 
> and then generating the functions programmatically by using _create_function, 
> create the function directly. 
> def max(col):
>"""
>docstring
>"""
>_create_function(max,"docstring")
> Second we can add type indications to all functions as defined in pep 484 or 
> pycharm's legacy type hinting 
> (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy).
> So for example max might look like this:
> def max(col):
>"""
>does  a max.
>   :type col: Column
>   :rtype Column
>"""
> This would provide a wide range of support as these types of hints, while old 
> are pretty common.
> A second option is to use PEP 3107 to define interfaces (pyi files)
> in this case we might have a functions.pyi file which would contain something 
> like:
> def max(col: Column) -> Column:
> """
> Aggregate function: returns the maximum value of the expression in a 
> group.
> """
> ...
> This has the advantage of easier to understand types and not touching the 
> code (only supported code) but has the disadvantage of being separately 
> managed (i.e. greater chance of doing a mistake) and the fact that some 
> configuration would be needed in the IDE/static analysis tool instead of 
> working out of the box.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17333) Make pyspark interface friendly with static analysis

2019-01-24 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751040#comment-16751040
 ] 

Maciej Szymkiewicz edited comment on SPARK-17333 at 1/24/19 11:53 AM:
--

[~Alexander_Gorokhov] Personally I maintain relatively complete set of 
annotations 
([https://github.com/zero323/pyspark-stubs|https://github.com/zero323/pyspark-stubs)])
 and in the past declared that I am happy to donate it and help with merge 
([http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html)].
 This topic has been also raised on another occasion 
([http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html)|http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html).]


was (Author: zero323):
[~Alexander_Gorokhov] Personally I maintain relatively complete set of 
annotations 
([https://github.com/zero323/pyspark-stubs|https://github.com/zero323/pyspark-stubs)])and
 in the past declared that I am happy to donate it and help with merge 
([http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html)].
 This topic has been also raised on another occasion 
([http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html)|http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html).]

> Make pyspark interface friendly with static analysis
> 
>
> Key: SPARK-17333
> URL: https://issues.apache.org/jira/browse/SPARK-17333
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Assaf Mendelson
>Priority: Trivial
>
> Static analysis tools such as those common to IDE for auto completion and 
> error marking, tend to have poor results with pyspark.
> This is cause by two separate issues:
> The first is that many elements are created programmatically such as the max 
> function in pyspark.sql.functions.
> The second is that we tend to use pyspark in a functional manner, meaning 
> that we chain many actions (e.g. df.filter().groupby().agg()) and since 
> python has no type information this can become difficult to understand.
> I would suggest changing the interface to improve it. 
> The way I see it we can either change the interface or provide interface 
> enhancements.
> Changing the interface means defining (when possible) all functions directly, 
> i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py 
> and then generating the functions programmatically by using _create_function, 
> create the function directly. 
> def max(col):
>"""
>docstring
>"""
>_create_function(max,"docstring")
> Second we can add type indications to all functions as defined in pep 484 or 
> pycharm's legacy type hinting 
> (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy).
> So for example max might look like this:
> def max(col):
>"""
>does  a max.
>   :type col: Column
>   :rtype Column
>"""
> This would provide a wide range of support as these types of hints, while old 
> are pretty common.
> A second option is to use PEP 3107 to define interfaces (pyi files)
> in this case we might have a functions.pyi file which would contain something 
> like:
> def max(col: Column) -> Column:
> """
> Aggregate function: returns the maximum value of the expression in a 
> group.
> """
> ...
> This has the advantage of easier to understand types and not touching the 
> code (only supported code) but has the disadvantage of being separately 
> managed (i.e. greater chance of doing a mistake) and the fact that some 
> configuration would be needed in the IDE/static analysis tool instead of 
> working out of the box.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25713) Implement copy() for ColumnarArray

2019-01-24 Thread Artsiom Yudovin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751028#comment-16751028
 ] 

Artsiom Yudovin commented on SPARK-25713:
-

[~cloud_fan] 

> Implement copy() for ColumnarArray
> --
>
> Key: SPARK-25713
> URL: https://issues.apache.org/jira/browse/SPARK-25713
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Liwen Sun
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26710) ImageSchemaSuite has some errors when running it in local laptop

2019-01-24 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated SPARK-26710:

Attachment: wx20190124-192...@2x.png
wx20190124-192...@2x.png

> ImageSchemaSuite has some errors when running it in local laptop
> 
>
> Key: SPARK-26710
> URL: https://issues.apache.org/jira/browse/SPARK-26710
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: xubo245
>Priority: Major
> Attachments: wx20190124-192...@2x.png, wx20190124-192...@2x.png
>
>
> ImageSchemaSuite and org.apache.spark.ml.source.image.ImageFileFormatSuite 
> has some errors when running it in local laptop
>  !wx20190124-192...@2x.png!  !wx20190124-192...@2x.png! 
> {code:java}
> execute, tree:
> Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#17L])
>+- *(1) Project
>   +- *(1) Scan ExistingRDD[image#10]
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#17L])
>+- *(1) Project
>   +- *(1) Scan ExistingRDD[image#10]
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:129)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDD(WholeStageCodegenExec.scala:488)
>   at 
> org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs(WholeStageCodegenExec.scala:429)
>   at 
> org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs$(WholeStageCodegenExec.scala:428)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:472)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:154)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:719)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296)
>   at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:2756)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:2755)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3291)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3287)
>   at org.apache.spark.sql.Dataset.count(Dataset.scala:2755)
>   at 
> org.apache.spark.ml.image.ImageSchemaSuite.$anonfun$new$2(ImageSchemaSuite.scala:53)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:104)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at 

[jira] [Updated] (SPARK-26710) ImageSchemaSuite has some errors when running it in local laptop

2019-01-24 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated SPARK-26710:

Description: 
ImageSchemaSuite and org.apache.spark.ml.source.image.ImageFileFormatSuite has 
some errors when running it in local laptop
 !wx20190124-192...@2x.png!  !wx20190124-192...@2x.png! 

{code:java}

execute, tree:
Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#17L])
   +- *(1) Project
  +- *(1) Scan ExistingRDD[image#10]

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#17L])
   +- *(1) Project
  +- *(1) Scan ExistingRDD[image#10]

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:129)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.InputAdapter.inputRDD(WholeStageCodegenExec.scala:488)
at 
org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs(WholeStageCodegenExec.scala:429)
at 
org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs$(WholeStageCodegenExec.scala:428)
at 
org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:472)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:154)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:719)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296)
at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:2756)
at 
org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:2755)
at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3291)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3287)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2755)
at 
org.apache.spark.ml.image.ImageSchemaSuite.$anonfun$new$2(ImageSchemaSuite.scala:53)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:104)
at 
org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
at 
org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
at 
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
at 

  1   2   >