[jira] [Updated] (SPARK-25708) HAVING without GROUP BY means global aggregate

2018-11-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25708:

Description: 
According to the SQL standard, when a query contains `HAVING`, it indicates an 
aggregate operator. For more details please refer to 
https://blog.jooq.org/2014/12/04/do-you-really-understand-sqls-group-by-and-having-clauses/

However, in Spark SQL parser, we treat HAVING as a normal filter when there is 
no GROUP BY, which breaks SQL semantic and lead to wrong result. This PR fixes 
the parser.

> HAVING without GROUP BY means global aggregate
> --
>
> Key: SPARK-25708
> URL: https://issues.apache.org/jira/browse/SPARK-25708
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness, release-notes
> Fix For: 2.4.0
>
>
> According to the SQL standard, when a query contains `HAVING`, it indicates 
> an aggregate operator. For more details please refer to 
> https://blog.jooq.org/2014/12/04/do-you-really-understand-sqls-group-by-and-having-clauses/
> However, in Spark SQL parser, we treat HAVING as a normal filter when there 
> is no GROUP BY, which breaks SQL semantic and lead to wrong result. This PR 
> fixes the parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26182) Cost increases when optimizing scalaUDF

2018-11-29 Thread Jiayi Liao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16704347#comment-16704347
 ] 

Jiayi Liao commented on SPARK-26182:


[~fanweiwen] We're doing some optimizations on our own streaming business. I'm 
appreciate if you're able to tell me the reason for this.

> Cost increases when optimizing scalaUDF
> ---
>
> Key: SPARK-26182
> URL: https://issues.apache.org/jira/browse/SPARK-26182
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.4.0
>Reporter: Jiayi Liao
>Priority: Major
>
> Let's assume that we have a udf called splitUDF which outputs a map data.
>  The SQL
> {code:java}
> select
> g['a'], g['b']
> from
>( select splitUDF(x) as g from table) tbl
> {code}
> will be optimized to the same logical plan of
> {code:java}
> select splitUDF(x)['a'], splitUDF(x)['b'] from table
> {code}
> which means that the splitUDF is executed twice instead of once.
> The optimization is from CollapseProject. 
>  I'm not sure whether this is a bug or not. Please tell me if I was wrong 
> about this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26206) Spark structured streaming with kafka integration fails in update mode

2018-11-29 Thread indraneel r (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16704277#comment-16704277
 ] 

indraneel r commented on SPARK-26206:
-

Will check on spark-shell but not sure if it will make any difference. Tried 
the code on scala 2.11 as well. Its the same issue

> Spark structured streaming with kafka integration fails in update mode 
> ---
>
> Key: SPARK-26206
> URL: https://issues.apache.org/jira/browse/SPARK-26206
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Operating system : MacOS Mojave
>  spark version : 2.4.0
> spark-sql-kafka-0-10 : 2.4.0
>  kafka version 1.1.1
> scala version : 2.12.7
>Reporter: indraneel r
>Priority: Blocker
>
> Spark structured streaming with kafka integration fails in update mode with 
> compilation exception in code generation. 
>  Here's the code that was executed:
> {code:java}
> // code placeholder
> override def main(args: Array[String]): Unit = {
>   val spark = SparkSession
>     .builder
>     .master("local[*]")
>     .appName("SparkStreamingTest")
>     .getOrCreate()
>  
>   val kafkaParams = Map[String, String](
>    "kafka.bootstrap.servers" -> "localhost:9092",
>    "startingOffsets" -> "earliest",
>    "subscribe" -> "test_events")
>  
>   val schema = Encoders.product[UserEvent].schema
>   val query = spark.readStream.format("kafka")
>     .options(kafkaParams)
>     .load()
>     .selectExpr("CAST(value AS STRING) as message")
>     .select(from_json(col("message"), schema).as("json"))
>     .select("json.*")
>     .groupBy(window(col("event_time"), "10 minutes"))
>     .count()
>     .writeStream
>     .foreachBatch { (batch: Dataset[Row], batchId: Long) =>
>   println(s"batch : ${batchId}")
>   batch.show(false)
>     }
>     .outputMode("update")
>     .start()
>     query.awaitTermination()
> }{code}
> It succeeds for batch 0 but fails for batch 1 with following exception when 
> more data is arrives in the stream.
> {code:java}
> 18/11/28 22:07:08 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 25, Column 18: A method named "putLong" is not declared in any enclosing 
> class nor any supertype, nor through a static import
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)
>     at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8997)
>     at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:5060)
>     at org.codehaus.janino.UnitCompiler.access$9100(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4421)
>     at 
> org.codehaus.janino.UnitCompiler$16.visitMethodInvocation(UnitCompiler.java:4394)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4394)
>     at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:5575)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3781)
>     at org.codehaus.janino.UnitCompiler.access$5900(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3760)
>     at 
> org.codehaus.janino.UnitCompiler$13.visitMethodInvocation(UnitCompiler.java:3732)
>     at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:5062)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3732)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2360)
>     at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:215)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1494)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1487)
>     at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2871)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1487)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1567)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3388)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1357)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1330)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:822)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitComp

[jira] [Created] (SPARK-26228) OOM issue encountered when computing Gramian matrix

2018-11-29 Thread Chen Lin (JIRA)
Chen Lin created SPARK-26228:


 Summary: OOM issue encountered when computing Gramian matrix 
 Key: SPARK-26228
 URL: https://issues.apache.org/jira/browse/SPARK-26228
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.3.0
Reporter: Chen Lin


/**
 * Computes the Gramian matrix `A^T A`.
 *
 * @note This cannot be computed on matrices with more than 65535 columns.
 */

As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
supports computing on matrices with no more than 65535 columns.

 

However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
when computing on matrices with 16000 columns.

 

The root casue seems that the TreeAggregate writes a  very long buffer array 
(16000*16000*8) which exceeds jvm limit(2^31 - 1).

 

Does RowMatrix really supports computing on matrices with no more than 65535 
columns?

I doubt that computeGramianMatrix has a very serious performance issue.

Do anyone has done some performance expriments before?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26228) OOM issue encountered when computing Gramian matrix

2018-11-29 Thread Chen Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Lin updated SPARK-26228:
-
Description: 
/**

 * Computes the Gramian matrix `A^T A`.
  *

 * @note This cannot be computed on matrices with more than 65535 columns.
  */

As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
supports computing on matrices with no more than 65535 columns.

However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
when computing on matrices with 16000 columns.

The root casue seems that the TreeAggregate writes a  very long buffer array 
(16000*16000*8) which exceeds jvm limit(2^31 - 1).

Does RowMatrix really supports computing on matrices with no more than 65535 
columns?

I doubt that computeGramianMatrix has a very serious performance issue.

Do anyone has done some performance expriments before?

  was:
/**

 * Computes the Gramian matrix `A^T A`.
  *

 * @note This cannot be computed on matrices with more than 65535 columns.
  */

As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
supports computing on matrices with no more than 65535 columns.

 

However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
when computing on matrices with 16000 columns.

 

The root casue seems that the TreeAggregate writes a  very long buffer array 
(16000*16000*8) which exceeds jvm limit(2^31 - 1).

 

Does RowMatrix really supports computing on matrices with no more than 65535 
columns?

I doubt that computeGramianMatrix has a very serious performance issue.

Do anyone has done some performance expriments before?


> OOM issue encountered when computing Gramian matrix 
> 
>
> Key: SPARK-26228
> URL: https://issues.apache.org/jira/browse/SPARK-26228
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Chen Lin
>Priority: Major
>
> /**
>  * Computes the Gramian matrix `A^T A`.
>   *
>  * @note This cannot be computed on matrices with more than 65535 columns.
>   */
> As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
> supports computing on matrices with no more than 65535 columns.
> However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
> when computing on matrices with 16000 columns.
> The root casue seems that the TreeAggregate writes a  very long buffer array 
> (16000*16000*8) which exceeds jvm limit(2^31 - 1).
> Does RowMatrix really supports computing on matrices with no more than 65535 
> columns?
> I doubt that computeGramianMatrix has a very serious performance issue.
> Do anyone has done some performance expriments before?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26228) OOM issue encountered when computing Gramian matrix

2018-11-29 Thread Chen Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Lin updated SPARK-26228:
-
Description: 
{quote}/**

 * Computes the Gramian matrix `A^T A`.
  *

 * @note This cannot be computed on matrices with more than 65535 columns.
  */
{quote}
As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
supports computing on matrices with no more than 65535 columns.

However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
when computing on matrices with 16000 columns.

The root casue seems that the TreeAggregate writes a  very long buffer array 
(16000*16000*8) which exceeds jvm limit(2^31 - 1).

Does RowMatrix really supports computing on matrices with no more than 65535 
columns?

I doubt that computeGramianMatrix has a very serious performance issue.

Do anyone has done some performance expriments before?

  was:
/**

 * Computes the Gramian matrix `A^T A`.
  *

 * @note This cannot be computed on matrices with more than 65535 columns.
  */

As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
supports computing on matrices with no more than 65535 columns.

However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
when computing on matrices with 16000 columns.

The root casue seems that the TreeAggregate writes a  very long buffer array 
(16000*16000*8) which exceeds jvm limit(2^31 - 1).

Does RowMatrix really supports computing on matrices with no more than 65535 
columns?

I doubt that computeGramianMatrix has a very serious performance issue.

Do anyone has done some performance expriments before?


> OOM issue encountered when computing Gramian matrix 
> 
>
> Key: SPARK-26228
> URL: https://issues.apache.org/jira/browse/SPARK-26228
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Chen Lin
>Priority: Major
>
> {quote}/**
>  * Computes the Gramian matrix `A^T A`.
>   *
>  * @note This cannot be computed on matrices with more than 65535 columns.
>   */
> {quote}
> As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
> supports computing on matrices with no more than 65535 columns.
> However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
> when computing on matrices with 16000 columns.
> The root casue seems that the TreeAggregate writes a  very long buffer array 
> (16000*16000*8) which exceeds jvm limit(2^31 - 1).
> Does RowMatrix really supports computing on matrices with no more than 65535 
> columns?
> I doubt that computeGramianMatrix has a very serious performance issue.
> Do anyone has done some performance expriments before?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26228) OOM issue encountered when computing Gramian matrix

2018-11-29 Thread Chen Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Lin updated SPARK-26228:
-
Description: 
/**

 * Computes the Gramian matrix `A^T A`.
  *

 * @note This cannot be computed on matrices with more than 65535 columns.
  */

As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
supports computing on matrices with no more than 65535 columns.

 

However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
when computing on matrices with 16000 columns.

 

The root casue seems that the TreeAggregate writes a  very long buffer array 
(16000*16000*8) which exceeds jvm limit(2^31 - 1).

 

Does RowMatrix really supports computing on matrices with no more than 65535 
columns?

I doubt that computeGramianMatrix has a very serious performance issue.

Do anyone has done some performance expriments before?

  was:
/**

 * Computes the Gramian matrix `A^T A`.
 *

 *@note This cannot be computed on matrices with more than 65535 columns.
 */

As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
supports computing on matrices with no more than 65535 columns.

 

However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
when computing on matrices with 16000 columns.

 

The root casue seems that the TreeAggregate writes a  very long buffer array 
(16000*16000*8) which exceeds jvm limit(2^31 - 1).

 

Does RowMatrix really supports computing on matrices with no more than 65535 
columns?

I doubt that computeGramianMatrix has a very serious performance issue.

Do anyone has done some performance expriments before?


> OOM issue encountered when computing Gramian matrix 
> 
>
> Key: SPARK-26228
> URL: https://issues.apache.org/jira/browse/SPARK-26228
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Chen Lin
>Priority: Major
>
> /**
>  * Computes the Gramian matrix `A^T A`.
>   *
>  * @note This cannot be computed on matrices with more than 65535 columns.
>   */
> As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
> supports computing on matrices with no more than 65535 columns.
>  
> However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
> when computing on matrices with 16000 columns.
>  
> The root casue seems that the TreeAggregate writes a  very long buffer array 
> (16000*16000*8) which exceeds jvm limit(2^31 - 1).
>  
> Does RowMatrix really supports computing on matrices with no more than 65535 
> columns?
> I doubt that computeGramianMatrix has a very serious performance issue.
> Do anyone has done some performance expriments before?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26228) OOM issue encountered when computing Gramian matrix

2018-11-29 Thread Chen Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Lin updated SPARK-26228:
-
Description: 
/**

 * Computes the Gramian matrix `A^T A`.
 *

 *@note This cannot be computed on matrices with more than 65535 columns.
 */

As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
supports computing on matrices with no more than 65535 columns.

 

However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
when computing on matrices with 16000 columns.

 

The root casue seems that the TreeAggregate writes a  very long buffer array 
(16000*16000*8) which exceeds jvm limit(2^31 - 1).

 

Does RowMatrix really supports computing on matrices with no more than 65535 
columns?

I doubt that computeGramianMatrix has a very serious performance issue.

Do anyone has done some performance expriments before?

  was:
/**
 * Computes the Gramian matrix `A^T A`.
 *
 * @note This cannot be computed on matrices with more than 65535 columns.
 */

As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
supports computing on matrices with no more than 65535 columns.

 

However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
when computing on matrices with 16000 columns.

 

The root casue seems that the TreeAggregate writes a  very long buffer array 
(16000*16000*8) which exceeds jvm limit(2^31 - 1).

 

Does RowMatrix really supports computing on matrices with no more than 65535 
columns?

I doubt that computeGramianMatrix has a very serious performance issue.

Do anyone has done some performance expriments before?


> OOM issue encountered when computing Gramian matrix 
> 
>
> Key: SPARK-26228
> URL: https://issues.apache.org/jira/browse/SPARK-26228
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Chen Lin
>Priority: Major
>
> /**
>  * Computes the Gramian matrix `A^T A`.
>  *
>  *@note This cannot be computed on matrices with more than 65535 columns.
>  */
> As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
> supports computing on matrices with no more than 65535 columns.
>  
> However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
> when computing on matrices with 16000 columns.
>  
> The root casue seems that the TreeAggregate writes a  very long buffer array 
> (16000*16000*8) which exceeds jvm limit(2^31 - 1).
>  
> Does RowMatrix really supports computing on matrices with no more than 65535 
> columns?
> I doubt that computeGramianMatrix has a very serious performance issue.
> Do anyone has done some performance expriments before?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility

2018-11-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26188.
-
   Resolution: Fixed
Fix Version/s: 2.4.1
   3.0.0

Issue resolved by pull request 23165
[https://github.com/apache/spark/pull/23165]

> Spark 2.4.0 Partitioning behavior breaks backwards compatibility
> 
>
> Key: SPARK-26188
> URL: https://issues.apache.org/jira/browse/SPARK-26188
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Damien Doucet-Girard
>Assignee: Gengliang Wang
>Priority: Critical
> Fix For: 3.0.0, 2.4.1
>
>
> My team uses spark to partition and output parquet files to amazon S3. We 
> typically use 256 partitions, from 00 to ff.
> We've observed that in spark 2.3.2 and prior, it reads the partitions as 
> strings by default. However, in spark 2.4.0 and later, the type of each 
> partition is inferred by default, and partitions such as 00 become 0 and 4d 
> become 4.0.
>  Here is a log sample of this behavior from one of our jobs:
>  2.4.0:
> {code:java}
> 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, 
> range: 0-662, partition values: [0]
> 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, 
> range: 0-662, partition values: [ef]
> 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, 
> range: 0-662, partition values: [4a]
> 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, 
> range: 0-662, partition values: [74]
> 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, 
> range: 0-662, partition values: [f5]
> 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, 
> range: 0-662, partition values: [50]
> 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, 
> range: 0-662, partition values: [70]
> 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, 
> range: 0-662, partition values: [b9]
> 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, 
> range: 0-662, partition values: [d2]
> 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, 
> range: 0-662, partition values: [51]
> 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, 
> range: 0-662, partition values: [84]
> 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, 
> range: 0-662, partition values: [b5]
> 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, 
> range: 0-662, partition values: [88]
> 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, 
> range: 0-662, partition values: [4.0]
> 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, 
> range: 0-662, partition values: [ac]
> 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, 
> range: 0-662, partition values: [24]
> 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, 
> range: 0-662, partition values: [fd]
> 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, 
> range: 0-662, partition values: [52]
> 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, 
> range: 0-662, partition values: [ab]
> 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, 
> range: 0-662, partition values: [f8]
> 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-h

[jira] [Assigned] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility

2018-11-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26188:
---

Assignee: Gengliang Wang

> Spark 2.4.0 Partitioning behavior breaks backwards compatibility
> 
>
> Key: SPARK-26188
> URL: https://issues.apache.org/jira/browse/SPARK-26188
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Damien Doucet-Girard
>Assignee: Gengliang Wang
>Priority: Critical
> Fix For: 2.4.1, 3.0.0
>
>
> My team uses spark to partition and output parquet files to amazon S3. We 
> typically use 256 partitions, from 00 to ff.
> We've observed that in spark 2.3.2 and prior, it reads the partitions as 
> strings by default. However, in spark 2.4.0 and later, the type of each 
> partition is inferred by default, and partitions such as 00 become 0 and 4d 
> become 4.0.
>  Here is a log sample of this behavior from one of our jobs:
>  2.4.0:
> {code:java}
> 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, 
> range: 0-662, partition values: [0]
> 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, 
> range: 0-662, partition values: [ef]
> 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, 
> range: 0-662, partition values: [4a]
> 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, 
> range: 0-662, partition values: [74]
> 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, 
> range: 0-662, partition values: [f5]
> 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, 
> range: 0-662, partition values: [50]
> 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, 
> range: 0-662, partition values: [70]
> 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, 
> range: 0-662, partition values: [b9]
> 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, 
> range: 0-662, partition values: [d2]
> 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, 
> range: 0-662, partition values: [51]
> 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, 
> range: 0-662, partition values: [84]
> 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, 
> range: 0-662, partition values: [b5]
> 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, 
> range: 0-662, partition values: [88]
> 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, 
> range: 0-662, partition values: [4.0]
> 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, 
> range: 0-662, partition values: [ac]
> 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, 
> range: 0-662, partition values: [24]
> 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, 
> range: 0-662, partition values: [fd]
> 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, 
> range: 0-662, partition values: [52]
> 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, 
> range: 0-662, partition values: [ab]
> 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, 
> range: 0-662, partition values: [f8]
> 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: 
> s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, 
> range: 0-662, partition values: [7a]
> 18/11/27 14:02:50 INFO FileScanRDD: Reading File path: 
> s3a

[jira] [Commented] (SPARK-26227) from_[csv|json] should accept schema_of_[csv|json] in R API

2018-11-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16704230#comment-16704230
 ] 

Apache Spark commented on SPARK-26227:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/23184

> from_[csv|json] should accept schema_of_[csv|json] in R API
> ---
>
> Key: SPARK-26227
> URL: https://issues.apache.org/jira/browse/SPARK-26227
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> We should make {{from_csv}} and {{from_json}} accept {{structType}}, 
> DDL-formatted string, and {{schema_of_[csv|json]}} as schema so that we can 
> utilise both {{schema_of_json}} and {{schema_of_csv}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26060) Track SparkConf entries and make SET command reject such entries.

2018-11-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26060.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23031
[https://github.com/apache/spark/pull/23031]

> Track SparkConf entries and make SET command reject such entries.
> -
>
> Key: SPARK-26060
> URL: https://issues.apache.org/jira/browse/SPARK-26060
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently the {{SET}} command works without any warnings even if the 
> specified key is for {{SparkConf}} entries and it has no effect because the 
> command does not update {{SparkConf}}, but the behavior might confuse users. 
> We should track {{SparkConf}} entries and make the command reject for such 
> entries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26060) Track SparkConf entries and make SET command reject such entries.

2018-11-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26060:
---

Assignee: Takuya Ueshin

> Track SparkConf entries and make SET command reject such entries.
> -
>
> Key: SPARK-26060
> URL: https://issues.apache.org/jira/browse/SPARK-26060
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently the {{SET}} command works without any warnings even if the 
> specified key is for {{SparkConf}} entries and it has no effect because the 
> command does not update {{SparkConf}}, but the behavior might confuse users. 
> We should track {{SparkConf}} entries and make the command reject for such 
> entries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26227) from_[csv|json] should accept schema_of_[csv|json] in R API

2018-11-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16704231#comment-16704231
 ] 

Apache Spark commented on SPARK-26227:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/23184

> from_[csv|json] should accept schema_of_[csv|json] in R API
> ---
>
> Key: SPARK-26227
> URL: https://issues.apache.org/jira/browse/SPARK-26227
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> We should make {{from_csv}} and {{from_json}} accept {{structType}}, 
> DDL-formatted string, and {{schema_of_[csv|json]}} as schema so that we can 
> utilise both {{schema_of_json}} and {{schema_of_csv}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26227) from_[csv|json] should accept schema_of_[csv|json] in R API

2018-11-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26227:


Assignee: Apache Spark

> from_[csv|json] should accept schema_of_[csv|json] in R API
> ---
>
> Key: SPARK-26227
> URL: https://issues.apache.org/jira/browse/SPARK-26227
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> We should make {{from_csv}} and {{from_json}} accept {{structType}}, 
> DDL-formatted string, and {{schema_of_[csv|json]}} as schema so that we can 
> utilise both {{schema_of_json}} and {{schema_of_csv}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26227) from_[csv|json] should accept schema_of_[csv|json] in R API

2018-11-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26227:


Assignee: (was: Apache Spark)

> from_[csv|json] should accept schema_of_[csv|json] in R API
> ---
>
> Key: SPARK-26227
> URL: https://issues.apache.org/jira/browse/SPARK-26227
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> We should make {{from_csv}} and {{from_json}} accept {{structType}}, 
> DDL-formatted string, and {{schema_of_[csv|json]}} as schema so that we can 
> utilise both {{schema_of_json}} and {{schema_of_csv}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26227) from_[csv|json] should accept schema_of_[csv|json] in R API

2018-11-29 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-26227:


 Summary: from_[csv|json] should accept schema_of_[csv|json] in R 
API
 Key: SPARK-26227
 URL: https://issues.apache.org/jira/browse/SPARK-26227
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


We should make {{from_csv}} and {{from_json}} accept {{structType}}, 
DDL-formatted string, and {{schema_of_[csv|json]}} as schema so that we can 
utilise both {{schema_of_json}} and {{schema_of_csv}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25446) Add schema_of_json() and schema_of_csv() to R

2018-11-29 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25446.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22939
[https://github.com/apache/spark/pull/22939]

> Add schema_of_json() and schema_of_csv() to R
> -
>
> Key: SPARK-25446
> URL: https://issues.apache.org/jira/browse/SPARK-25446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> The function schem_of_json() and schem_of_csv() are exposed in Scala/Java and 
> Python but not in R. Need to add the function to R too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25501) Kafka delegation token support

2018-11-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25501.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22598
[https://github.com/apache/spark/pull/22598]

> Kafka delegation token support
> --
>
> Key: SPARK-25501
> URL: https://issues.apache.org/jira/browse/SPARK-25501
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
>  Labels: SPIP
> Fix For: 3.0.0
>
>
> In kafka version 1.1 delegation token support is released. As spark updated 
> it's kafka client to 2.0.0 now it's possible to implement delegation token 
> support. Please see description: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25501) Kafka delegation token support

2018-11-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25501:
--

Assignee: Gabor Somogyi

> Kafka delegation token support
> --
>
> Key: SPARK-25501
> URL: https://issues.apache.org/jira/browse/SPARK-25501
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
>  Labels: SPIP
> Fix For: 3.0.0
>
>
> In kafka version 1.1 delegation token support is released. As spark updated 
> it's kafka client to 2.0.0 now it's possible to implement delegation token 
> support. Please see description: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-48+Delegation+token+support+for+Kafka



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization

2018-11-29 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16704092#comment-16704092
 ] 

Bryan Cutler commented on SPARK-26200:
--

I think this is a duplicate of 
https://issues.apache.org/jira/browse/SPARK-24915 except the columns are mixed 
up instead of an exception being thrown, could you please confirm 
[~davidlyness]?

> Column values are incorrectly transposed when a field in a PySpark Row 
> requires serialization
> -
>
> Key: SPARK-26200
> URL: https://issues.apache.org/jira/browse/SPARK-26200
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Spark version 2.4.0
> Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144
> The same issue is observed when PySpark is run on both macOS 10.13.6 and 
> CentOS 7, so this appears to be a cross-platform issue.
>Reporter: David Lyness
>Priority: Major
>  Labels: correctness
>
> h2. Description of issue
> Whenever a field in a PySpark {{Row}} requires serialization (such as a 
> {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below 
> will assign column values *in alphabetical order*, rather than assigning each 
> column value to its specified columns.
> h3. Code to reproduce:
> {code:java}
> import datetime
> from pyspark.sql import Row
> from pyspark.sql.session import SparkSession
> from pyspark.sql.types import DateType, StringType, StructField, StructType
> spark = SparkSession.builder.getOrCreate()
> schema = StructType([
> StructField("date_column", DateType()),
> StructField("my_b_column", StringType()),
> StructField("my_a_column", StringType()),
> ])
> spark.createDataFrame([Row(
> date_column=datetime.date.today(),
> my_b_column="my_b_value",
> my_a_column="my_a_value"
> )], schema).show()
> {code}
> h3. Expected result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_b_value| my_a_value|
> +---+---+---+{noformat}
> h3. Actual result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_a_value| my_b_value|
> +---+---+---+{noformat}
> (Note that {{my_a_value}} and {{my_b_value}} are transposed.)
> h2. Analysis of issue
> Reviewing [the relevant code on 
> GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622],
>  there are two relevant conditional blocks:
>  
> {code:java}
> if self._needSerializeAnyField:
> # Block 1, does not work correctly
> else:
> # Block 2, works correctly
> {code}
> {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and 
> a dictionary of named columns. In Block 2, there is a conditional that works 
> specifically to serialize a {{Row}} object:
>  
> {code:java}
> elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False):
> return tuple(obj[n] for n in self.names)
> {code}
> There is no such condition in Block 1, so we fall into this instead:
>  
> {code:java}
> elif isinstance(obj, (tuple, list)):
> return tuple(f.toInternal(v) if c else v
> for f, v, c in zip(self.fields, obj, self._needConversion))
> {code}
> The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will 
> return a different ordering than the schema fields. So we end up with:
> {code:java}
> (date, date, True),
> (b, a, False),
> (a, b, False)
> {code}
> h2. Workarounds
> Correct behaviour is observed if you use a Python {{list}} or {{dict}} 
> instead of PySpark's {{Row}} object:
>  
> {code:java}
> # Using a list works
> spark.createDataFrame([[
> datetime.date.today(),
> "my_b_value",
> "my_a_value"
> ]], schema)
> # Using a dict also works
> spark.createDataFrame([{
> "date_column": datetime.date.today(),
> "my_b_column": "my_b_value",
> "my_a_column": "my_a_value"
> }], schema){code}
> Correct behaviour is also observed if you have no fields that require 
> serialization; in this example, changing {{date_column}} to {{StringType}} 
> avoids the correctness issue.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization

2018-11-29 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16704092#comment-16704092
 ] 

Bryan Cutler edited comment on SPARK-26200 at 11/30/18 12:56 AM:
-

I think this is a duplicate of SPARK-24915 except the columns are mixed up 
instead of an exception being thrown, could you please confirm [~davidlyness]?


was (Author: bryanc):
I think this is a duplicate of 
https://issues.apache.org/jira/browse/SPARK-24915 except the columns are mixed 
up instead of an exception being thrown, could you please confirm 
[~davidlyness]?

> Column values are incorrectly transposed when a field in a PySpark Row 
> requires serialization
> -
>
> Key: SPARK-26200
> URL: https://issues.apache.org/jira/browse/SPARK-26200
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Spark version 2.4.0
> Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144
> The same issue is observed when PySpark is run on both macOS 10.13.6 and 
> CentOS 7, so this appears to be a cross-platform issue.
>Reporter: David Lyness
>Priority: Major
>  Labels: correctness
>
> h2. Description of issue
> Whenever a field in a PySpark {{Row}} requires serialization (such as a 
> {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below 
> will assign column values *in alphabetical order*, rather than assigning each 
> column value to its specified columns.
> h3. Code to reproduce:
> {code:java}
> import datetime
> from pyspark.sql import Row
> from pyspark.sql.session import SparkSession
> from pyspark.sql.types import DateType, StringType, StructField, StructType
> spark = SparkSession.builder.getOrCreate()
> schema = StructType([
> StructField("date_column", DateType()),
> StructField("my_b_column", StringType()),
> StructField("my_a_column", StringType()),
> ])
> spark.createDataFrame([Row(
> date_column=datetime.date.today(),
> my_b_column="my_b_value",
> my_a_column="my_a_value"
> )], schema).show()
> {code}
> h3. Expected result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_b_value| my_a_value|
> +---+---+---+{noformat}
> h3. Actual result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_a_value| my_b_value|
> +---+---+---+{noformat}
> (Note that {{my_a_value}} and {{my_b_value}} are transposed.)
> h2. Analysis of issue
> Reviewing [the relevant code on 
> GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622],
>  there are two relevant conditional blocks:
>  
> {code:java}
> if self._needSerializeAnyField:
> # Block 1, does not work correctly
> else:
> # Block 2, works correctly
> {code}
> {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and 
> a dictionary of named columns. In Block 2, there is a conditional that works 
> specifically to serialize a {{Row}} object:
>  
> {code:java}
> elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False):
> return tuple(obj[n] for n in self.names)
> {code}
> There is no such condition in Block 1, so we fall into this instead:
>  
> {code:java}
> elif isinstance(obj, (tuple, list)):
> return tuple(f.toInternal(v) if c else v
> for f, v, c in zip(self.fields, obj, self._needConversion))
> {code}
> The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will 
> return a different ordering than the schema fields. So we end up with:
> {code:java}
> (date, date, True),
> (b, a, False),
> (a, b, False)
> {code}
> h2. Workarounds
> Correct behaviour is observed if you use a Python {{list}} or {{dict}} 
> instead of PySpark's {{Row}} object:
>  
> {code:java}
> # Using a list works
> spark.createDataFrame([[
> datetime.date.today(),
> "my_b_value",
> "my_a_value"
> ]], schema)
> # Using a dict also works
> spark.createDataFrame([{
> "date_column": datetime.date.today(),
> "my_b_column": "my_b_value",
> "my_a_column": "my_a_value"
> }], schema){code}
> Correct behaviour is also observed if you have no fields that require 
> serialization; in this example, changing {{date_column}} to {{StringType}} 
> avoids the correctness issue.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.o

[jira] [Assigned] (SPARK-25977) Parsing decimals from CSV using locale

2018-11-29 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25977:


Assignee: Maxim Gekk

> Parsing decimals from CSV using locale
> --
>
> Key: SPARK-25977
> URL: https://issues.apache.org/jira/browse/SPARK-25977
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Support the locale option to parse decimals from CSV input. Currently CSV 
> parser can handle decimals that contain only dots - '.' which is incorrect 
> format in locales like ru-RU, for example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25977) Parsing decimals from CSV using locale

2018-11-29 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25977.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22979
[https://github.com/apache/spark/pull/22979]

> Parsing decimals from CSV using locale
> --
>
> Key: SPARK-25977
> URL: https://issues.apache.org/jira/browse/SPARK-25977
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Support the locale option to parse decimals from CSV input. Currently CSV 
> parser can handle decimals that contain only dots - '.' which is incorrect 
> format in locales like ru-RU, for example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26226) Update query tracker to report timeline for phases, rather than duration

2018-11-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16704010#comment-16704010
 ] 

Apache Spark commented on SPARK-26226:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23183

> Update query tracker to report timeline for phases, rather than duration
> 
>
> Key: SPARK-26226
> URL: https://issues.apache.org/jira/browse/SPARK-26226
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> It'd be more useful to report start and end time for each phrase, rather than 
> only a single duration. This way we can look at timelines and figure out if 
> there is any unaccounted time.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26226) Update query tracker to report timeline for phases, rather than duration

2018-11-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26226:


Assignee: Apache Spark  (was: Reynold Xin)

> Update query tracker to report timeline for phases, rather than duration
> 
>
> Key: SPARK-26226
> URL: https://issues.apache.org/jira/browse/SPARK-26226
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Major
>
> It'd be more useful to report start and end time for each phrase, rather than 
> only a single duration. This way we can look at timelines and figure out if 
> there is any unaccounted time.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26226) Update query tracker to report timeline for phases, rather than duration

2018-11-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16704011#comment-16704011
 ] 

Apache Spark commented on SPARK-26226:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23183

> Update query tracker to report timeline for phases, rather than duration
> 
>
> Key: SPARK-26226
> URL: https://issues.apache.org/jira/browse/SPARK-26226
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> It'd be more useful to report start and end time for each phrase, rather than 
> only a single duration. This way we can look at timelines and figure out if 
> there is any unaccounted time.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26226) Update query tracker to report timeline for phases, rather than duration

2018-11-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26226:


Assignee: Reynold Xin  (was: Apache Spark)

> Update query tracker to report timeline for phases, rather than duration
> 
>
> Key: SPARK-26226
> URL: https://issues.apache.org/jira/browse/SPARK-26226
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> It'd be more useful to report start and end time for each phrase, rather than 
> only a single duration. This way we can look at timelines and figure out if 
> there is any unaccounted time.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26226) Update query tracker to report timeline for phases, rather than duration

2018-11-29 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-26226:
---

Assignee: Reynold Xin

> Update query tracker to report timeline for phases, rather than duration
> 
>
> Key: SPARK-26226
> URL: https://issues.apache.org/jira/browse/SPARK-26226
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> It'd be more useful to report start and end time for each phrase, rather than 
> only a single duration. This way we can look at timelines and figure out if 
> there is any unaccounted time.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26226) Update query tracker to report timeline for phases, rather than duration

2018-11-29 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-26226:
---

 Summary: Update query tracker to report timeline for phases, 
rather than duration
 Key: SPARK-26226
 URL: https://issues.apache.org/jira/browse/SPARK-26226
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Reynold Xin


It'd be more useful to report start and end time for each phrase, rather than 
only a single duration. This way we can look at timelines and figure out if 
there is any unaccounted time.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26221) Improve Spark SQL instrumentation and metrics

2018-11-29 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26221:

Description: 
This is an umbrella ticket for various small improvements for better metrics 
and instrumentation. Some thoughts:

 

Differentiate query plan that’s writing data out, vs returning data to the 
driver
 * I.e. ETL & report generation vs interactive analysis
 * This is related to the data sink item below. We need to make sure from the 
query plan we can tell what a query is doing





Data sink: Have an operator for data sink, with metrics that can tell us:
 * Write time
 * Number of records written
 * Size of output written
 * Number of partitions modified
 * Metastore update time

 * Also track number of records for collect / limit





Scan
 * Track file listing time (start and end so we can construct timeline, not 
just duration)
 * Track metastore operation time

 * Track IO decoding time for row-based input sources; Need to make sure 
overhead is low





Shuffle
 * Track read time and write time

 * Decide if we can measure serialization and deserialization





Client fetch time
 * Sometimes a query take long to run because it is blocked on the client 
fetching result (e.g. using a result iterator). Record the time blocked on 
client so we can remove it in measuring query execution time.





Make it easy to correlate queries with jobs, stages, and tasks belonging to a 
single query





Better logging:
 * Enable logging the query execution id and TID in executor logs, and query 
execution id in driver logs.

  was:This is an umbrella ticket for various small improvements for better 
metrics and instrumentation.


> Improve Spark SQL instrumentation and metrics
> -
>
> Key: SPARK-26221
> URL: https://issues.apache.org/jira/browse/SPARK-26221
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> This is an umbrella ticket for various small improvements for better metrics 
> and instrumentation. Some thoughts:
>  
> Differentiate query plan that’s writing data out, vs returning data to the 
> driver
>  * I.e. ETL & report generation vs interactive analysis
>  * This is related to the data sink item below. We need to make sure from the 
> query plan we can tell what a query is doing
> Data sink: Have an operator for data sink, with metrics that can tell us:
>  * Write time
>  * Number of records written
>  * Size of output written
>  * Number of partitions modified
>  * Metastore update time
>  * Also track number of records for collect / limit
> Scan
>  * Track file listing time (start and end so we can construct timeline, not 
> just duration)
>  * Track metastore operation time
>  * Track IO decoding time for row-based input sources; Need to make sure 
> overhead is low
> Shuffle
>  * Track read time and write time
>  * Decide if we can measure serialization and deserialization
> Client fetch time
>  * Sometimes a query take long to run because it is blocked on the client 
> fetching result (e.g. using a result iterator). Record the time blocked on 
> client so we can remove it in measuring query execution time.
> Make it easy to correlate queries with jobs, stages, and tasks belonging to a 
> single query
> Better logging:
>  * Enable logging the query execution id and TID in executor logs, and query 
> execution id in driver logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26221) Improve Spark SQL instrumentation and metrics

2018-11-29 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26221:

Description: 
This is an umbrella ticket for various small improvements for better metrics 
and instrumentation. Some thoughts:

 

Differentiate query plan that’s writing data out, vs returning data to the 
driver
 * I.e. ETL & report generation vs interactive analysis
 * This is related to the data sink item below. We need to make sure from the 
query plan we can tell what a query is doing

Data sink: Have an operator for data sink, with metrics that can tell us:
 * Write time
 * Number of records written
 * Size of output written
 * Number of partitions modified
 * Metastore update time
 * Also track number of records for collect / limit

Scan
 * Track file listing time (start and end so we can construct timeline, not 
just duration)
 * Track metastore operation time
 * Track IO decoding time for row-based input sources; Need to make sure 
overhead is low

Shuffle
 * Track read time and write time
 * Decide if we can measure serialization and deserialization

Client fetch time
 * Sometimes a query take long to run because it is blocked on the client 
fetching result (e.g. using a result iterator). Record the time blocked on 
client so we can remove it in measuring query execution time.

Make it easy to correlate queries with jobs, stages, and tasks belonging to a 
single query

Better logging:
 * Enable logging the query execution id and TID in executor logs, and query 
execution id in driver logs.

  was:
This is an umbrella ticket for various small improvements for better metrics 
and instrumentation. Some thoughts:

 

Differentiate query plan that’s writing data out, vs returning data to the 
driver
 * I.e. ETL & report generation vs interactive analysis
 * This is related to the data sink item below. We need to make sure from the 
query plan we can tell what a query is doing





Data sink: Have an operator for data sink, with metrics that can tell us:
 * Write time
 * Number of records written
 * Size of output written
 * Number of partitions modified
 * Metastore update time

 * Also track number of records for collect / limit





Scan
 * Track file listing time (start and end so we can construct timeline, not 
just duration)
 * Track metastore operation time

 * Track IO decoding time for row-based input sources; Need to make sure 
overhead is low





Shuffle
 * Track read time and write time

 * Decide if we can measure serialization and deserialization





Client fetch time
 * Sometimes a query take long to run because it is blocked on the client 
fetching result (e.g. using a result iterator). Record the time blocked on 
client so we can remove it in measuring query execution time.





Make it easy to correlate queries with jobs, stages, and tasks belonging to a 
single query





Better logging:
 * Enable logging the query execution id and TID in executor logs, and query 
execution id in driver logs.


> Improve Spark SQL instrumentation and metrics
> -
>
> Key: SPARK-26221
> URL: https://issues.apache.org/jira/browse/SPARK-26221
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> This is an umbrella ticket for various small improvements for better metrics 
> and instrumentation. Some thoughts:
>  
> Differentiate query plan that’s writing data out, vs returning data to the 
> driver
>  * I.e. ETL & report generation vs interactive analysis
>  * This is related to the data sink item below. We need to make sure from the 
> query plan we can tell what a query is doing
> Data sink: Have an operator for data sink, with metrics that can tell us:
>  * Write time
>  * Number of records written
>  * Size of output written
>  * Number of partitions modified
>  * Metastore update time
>  * Also track number of records for collect / limit
> Scan
>  * Track file listing time (start and end so we can construct timeline, not 
> just duration)
>  * Track metastore operation time
>  * Track IO decoding time for row-based input sources; Need to make sure 
> overhead is low
> Shuffle
>  * Track read time and write time
>  * Decide if we can measure serialization and deserialization
> Client fetch time
>  * Sometimes a query take long to run because it is blocked on the client 
> fetching result (e.g. using a result iterator). Record the time blocked on 
> client so we can remove it in measuring query execution time.
> Make it easy to correlate queries with jobs, stages, and tasks belonging to a 
> single query
> Better logging:
>  * Enable logging the query execution id and TID in executor logs, and query 
> execution id in driver logs.



--
This message was sent by At

[jira] [Updated] (SPARK-26129) Instrumentation for query planning time

2018-11-29 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26129:

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-26221

> Instrumentation for query planning time
> ---
>
> Key: SPARK-26129
> URL: https://issues.apache.org/jira/browse/SPARK-26129
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
> Fix For: 3.0.0
>
>
> We currently don't have good visibility into query planning time (analysis vs 
> optimization vs physical planning). This patch adds a simple utility to track 
> the runtime of various rules and various planning phases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26221) Improve Spark SQL instrumentation and metrics

2018-11-29 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26221:

Description: 
This is an umbrella ticket for various small improvements for better metrics 
and instrumentation. Some thoughts:

 

Differentiate query plan that’s writing data out, vs returning data to the 
driver
 * I.e. ETL & report generation vs interactive analysis
 * This is related to the data sink item below. We need to make sure from the 
query plan we can tell what a query is doing

Data sink: Have an operator for data sink, with metrics that can tell us:
 * Write time
 * Number of records written
 * Size of output written
 * Number of partitions modified
 * Metastore update time
 * Also track number of records for collect / limit

Scan
 * Track file listing time (start and end so we can construct timeline, not 
just duration)
 * Track metastore operation time
 * Track IO decoding time for row-based input sources; Need to make sure 
overhead is low

Shuffle
 * Track read time and write time
 * Decide if we can measure serialization and deserialization

Client fetch time
 * Sometimes a query take long to run because it is blocked on the client 
fetching result (e.g. using a result iterator). Record the time blocked on 
client so we can remove it in measuring query execution time.

Make it easy to correlate queries with jobs, stages, and tasks belonging to a 
single query, e.g. dump execution id in task logs?

Better logging:
 * Enable logging the query execution id and TID in executor logs, and query 
execution id in driver logs.

  was:
This is an umbrella ticket for various small improvements for better metrics 
and instrumentation. Some thoughts:

 

Differentiate query plan that’s writing data out, vs returning data to the 
driver
 * I.e. ETL & report generation vs interactive analysis
 * This is related to the data sink item below. We need to make sure from the 
query plan we can tell what a query is doing

Data sink: Have an operator for data sink, with metrics that can tell us:
 * Write time
 * Number of records written
 * Size of output written
 * Number of partitions modified
 * Metastore update time
 * Also track number of records for collect / limit

Scan
 * Track file listing time (start and end so we can construct timeline, not 
just duration)
 * Track metastore operation time
 * Track IO decoding time for row-based input sources; Need to make sure 
overhead is low

Shuffle
 * Track read time and write time
 * Decide if we can measure serialization and deserialization

Client fetch time
 * Sometimes a query take long to run because it is blocked on the client 
fetching result (e.g. using a result iterator). Record the time blocked on 
client so we can remove it in measuring query execution time.

Make it easy to correlate queries with jobs, stages, and tasks belonging to a 
single query

Better logging:
 * Enable logging the query execution id and TID in executor logs, and query 
execution id in driver logs.


> Improve Spark SQL instrumentation and metrics
> -
>
> Key: SPARK-26221
> URL: https://issues.apache.org/jira/browse/SPARK-26221
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> This is an umbrella ticket for various small improvements for better metrics 
> and instrumentation. Some thoughts:
>  
> Differentiate query plan that’s writing data out, vs returning data to the 
> driver
>  * I.e. ETL & report generation vs interactive analysis
>  * This is related to the data sink item below. We need to make sure from the 
> query plan we can tell what a query is doing
> Data sink: Have an operator for data sink, with metrics that can tell us:
>  * Write time
>  * Number of records written
>  * Size of output written
>  * Number of partitions modified
>  * Metastore update time
>  * Also track number of records for collect / limit
> Scan
>  * Track file listing time (start and end so we can construct timeline, not 
> just duration)
>  * Track metastore operation time
>  * Track IO decoding time for row-based input sources; Need to make sure 
> overhead is low
> Shuffle
>  * Track read time and write time
>  * Decide if we can measure serialization and deserialization
> Client fetch time
>  * Sometimes a query take long to run because it is blocked on the client 
> fetching result (e.g. using a result iterator). Record the time blocked on 
> client so we can remove it in measuring query execution time.
> Make it easy to correlate queries with jobs, stages, and tasks belonging to a 
> single query, e.g. dump execution id in task logs?
> Better logging:
>  * Enable logging the query execution id and TID in executor logs, and query 
> execution id 

[jira] [Updated] (SPARK-26223) Scan: track metastore operation time

2018-11-29 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26223:

Description: 
The Scan node should report how much time it spent in metastore operations. 
Similar to file listing, would be great to also report start and end time for 
constructing a timeline.

 

  was:The Scan node should report how much time it spent in metastore 
operations.


> Scan: track metastore operation time
> 
>
> Key: SPARK-26223
> URL: https://issues.apache.org/jira/browse/SPARK-26223
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>
> The Scan node should report how much time it spent in metastore operations. 
> Similar to file listing, would be great to also report start and end time for 
> constructing a timeline.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26225) Scan: track decoding time for row-based data sources

2018-11-29 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-26225:
---

 Summary: Scan: track decoding time for row-based data sources
 Key: SPARK-26225
 URL: https://issues.apache.org/jira/browse/SPARK-26225
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Reynold Xin


Scan node should report decoding time for each record, if it is not too much 
overhead.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26209) Allow for dataframe bucketization without Hive

2018-11-29 Thread Sam hendley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703824#comment-16703824
 ] 

Sam hendley commented on SPARK-26209:
-

Seems like dataframe needs a side-channel for the 'invariants' of the dataframe 
like it's bucketing state or how it was partitioned. It could be used for lots 
of other optimizations like bucketing and windowing. Relying on an external 
system to hold that data makes the whole system less cohesive. There are a lot 
of cool things that could happen if operations mutated that invariant state as 
they perform operations. I am guessing that the invariants already exists 
'implicitly' in the DAG graph but managing that state explictly and being able 
to ser/der that state would help make some of these complex optimization like 
bucketization easier to apply. 

For this specific case the side-channel would just contain the 'bucketed' and 
'sorted field' that are stored in Hive. As soon as we do some operation that 
shuffles this data to other partitions or otherwise makes it non-bucketable we 
would clear this state. When we called bucketBy/sortBy etc it would readd the 
correct metadata.

It seems like we could use things like the FileMetaData.key_value_metadata 
fields to store this metadata. Could add this same functionality to Parquet 
Dataframes?

> Allow for dataframe bucketization without Hive
> --
>
> Key: SPARK-26209
> URL: https://issues.apache.org/jira/browse/SPARK-26209
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Java API, SQL
>Affects Versions: 2.4.0
>Reporter: Walt Elder
>Priority: Minor
>
> As a DataFrame author, I can elect to bucketize my output without involving 
> Hive or HMS, so that my hive-less environment can benefit from this 
> query-optimization technique. 
>  
> https://issues.apache.org/jira/browse/SPARK-19256?focusedCommentId=16345397&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16345397
>  identifies this as a shortcoming with the umbrella feature in provided via 
> SPARK-19256.
>  
> In short, relying on Hive to store metadata *precludes* environments which 
> don't have/use hive from making use of bucketization features. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26224) Results in stackOverFlowError when trying to add 3000 new columns using withColumn function of dataframe.

2018-11-29 Thread Dorjee Tsering (JIRA)
Dorjee Tsering created SPARK-26224:
--

 Summary: Results in stackOverFlowError when trying to add 3000 new 
columns using withColumn function of dataframe.
 Key: SPARK-26224
 URL: https://issues.apache.org/jira/browse/SPARK-26224
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
 Environment: On macbook, used Intellij editor. Ran the above sample 
code as unit test.
Reporter: Dorjee Tsering


Reproduction step:

Run this sample code on your laptop. I am trying to add 3000 new columns to a 
base dataframe with 1 column.

 

 
{code:java}
import spark.implicits._

val newColumnsToBeAdded : Seq[StructField] = for (i <- 1 to 3000) yield new 
StructField("field_" + i, DataTypes.LongType)

val baseDataFrame: DataFrame = Seq((1)).toDF("employee_id")

val result = newColumnsToBeAdded.foldLeft(baseDataFrame)((df, newColumn) => 
df.withColumn(newColumn.name, lit(0)))

result.show(false)
 
{code}
Ends up with following stacktrace:

java.lang.StackOverflowError
 at 
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
 at 
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
 at scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
 at scala.collection.immutable.List.map(List.scala:296)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25905) BlockManager should expose getRemoteManagedBuffer to avoid creating bytebuffers

2018-11-29 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-25905:


Assignee: Wing Yew Poon

> BlockManager should expose getRemoteManagedBuffer to avoid creating 
> bytebuffers
> ---
>
> Key: SPARK-25905
> URL: https://issues.apache.org/jira/browse/SPARK-25905
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Wing Yew Poon
>Priority: Major
>  Labels: memory-analysis
> Fix For: 3.0.0
>
>
> The block manager currently only lets you get a handle on remote data as a 
> {{ChunkedByteBuffer}}.  But with remote reads of cached data, you really only 
> need an input stream view of the data, which is already available with the 
> {{ManagedBuffer}} which is fetched.  By forcing conversion to a 
> {{ChunkedByteBuffer}}, we end up using more memory than necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25905) BlockManager should expose getRemoteManagedBuffer to avoid creating bytebuffers

2018-11-29 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-25905.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23058
[https://github.com/apache/spark/pull/23058]

> BlockManager should expose getRemoteManagedBuffer to avoid creating 
> bytebuffers
> ---
>
> Key: SPARK-25905
> URL: https://issues.apache.org/jira/browse/SPARK-25905
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
> Fix For: 3.0.0
>
>
> The block manager currently only lets you get a handle on remote data as a 
> {{ChunkedByteBuffer}}.  But with remote reads of cached data, you really only 
> need an input stream view of the data, which is already available with the 
> {{ManagedBuffer}} which is fetched.  By forcing conversion to a 
> {{ChunkedByteBuffer}}, we end up using more memory than necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26223) Scan: track metastore operation time

2018-11-29 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-26223:
---

 Summary: Scan: track metastore operation time
 Key: SPARK-26223
 URL: https://issues.apache.org/jira/browse/SPARK-26223
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Reynold Xin


The Scan node should report how much time it spent in metastore operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26222) Scan: track file listing time

2018-11-29 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-26222:
---

 Summary: Scan: track file listing time
 Key: SPARK-26222
 URL: https://issues.apache.org/jira/browse/SPARK-26222
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Reynold Xin


We should track file listing time and add it to the scan node's SQL metric, so 
we have visibility how much is spent in file listing. It'd be useful to track 
not just duration, but also start and end time so we can construct a timeline.

This requires a little bit design to define what file listing time means, when 
we are reading from cache, vs not cache.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26221) Improve Spark SQL instrumentation and metrics

2018-11-29 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26221:

Description: This is an umbrella ticket for various small improvements for 
better metrics and instrumentation.  (was: This creates an umbrella ticket for 
various small improvements for better metrics and instrumentation.

 

 )

> Improve Spark SQL instrumentation and metrics
> -
>
> Key: SPARK-26221
> URL: https://issues.apache.org/jira/browse/SPARK-26221
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> This is an umbrella ticket for various small improvements for better metrics 
> and instrumentation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26221) Improve Spark SQL instrumentation and metrics

2018-11-29 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-26221:
---

 Summary: Improve Spark SQL instrumentation and metrics
 Key: SPARK-26221
 URL: https://issues.apache.org/jira/browse/SPARK-26221
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 2.4.0
Reporter: Reynold Xin
Assignee: Reynold Xin


This creates an umbrella ticket for various small improvements for better 
metrics and instrumentation.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26219) Executor summary is not getting updated for failure jobs in history server UI

2018-11-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26219:


Assignee: (was: Apache Spark)

> Executor summary is not getting updated for failure jobs in history server UI
> -
>
> Key: SPARK-26219
> URL: https://issues.apache.org/jira/browse/SPARK-26219
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shahid
>Priority: Major
> Attachments: Screenshot from 2018-11-29 22-13-34.png, Screenshot from 
> 2018-11-29 22-13-44.png
>
>
> Test step to reproduce:
> {code:java}
> bin/spark-shell --master yarn --conf spark.executor.instances=3
> sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() 
> {code}
> 1)Open the application from History UI 
> 2) Go to the executor tab
> From History UI:
> !Screenshot from 2018-11-29 22-13-34.png! 
> From Live UI:
>  !Screenshot from 2018-11-29 22-13-44.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26100) [History server ]Jobs table and Aggregate metrics table are showing lesser number of tasks

2018-11-29 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703693#comment-16703693
 ] 

shahid commented on SPARK-26100:


Sorry, I have wrongly linked the JIRA

> [History server ]Jobs table and Aggregate metrics table are showing lesser 
> number of tasks 
> ---
>
> Key: SPARK-26100
> URL: https://issues.apache.org/jira/browse/SPARK-26100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: shahid
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
> Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from 
> 2018-11-17 16-55-09.png
>
>
> Test step to reproduce:
> 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}
> 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() }}
>  
> 3) Open Application from the history server UI
> Jobs table and Aggregated metrics are showing lesser number of tasks.
>  !Screenshot from 2018-11-17 16-55-09.png! 
>  
>   !Screenshot from 2018-11-17 16-54-42.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26166) CrossValidator.fit() bug,training and validation dataset may overlap

2018-11-29 Thread Xinyong Tian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinyong Tian updated SPARK-26166:
-
Description: 
In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column

df = dataset.select("*", rand(seed).alias(randCol))

Should add

df.checkpoint()

If  df is  not checkpointed, it will be recomputed each time when train and 
validation dataframe need to be created. The order of rows in df,which 
rand(seed)  is dependent on, is not deterministic . Thus each time random 
column value could be different for a specific row even with seed. Note , 
checkpoint() can not be replaced with cached(), because when a node fails, 
cached table need be  recomputed, thus random number could be different.

This might especially  be a problem when input 'dataset' dataframe is resulted 
from a query including 'where' clause. see below.

[https://dzone.com/articles/non-deterministic-order-for-select-with-limit]

 

 

 

  was:
In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column

df = dataset.select("*", rand(seed).alias(randCol))

Should add

df.checkpoint()

If  df is  not checkpointed, it will be recomputed each time when train and 
validation dataframe need to be created. The order of rows in df,which 
rand(seed)  is dependent on, is not deterministic . Thus each time random 
column value could be different for a specific row even with seed. Note , 
checkpoint() can not be replaced with cached(), because when a node fails, 
cached table might stilled be  recomputed, thus random number could be 
different.

This might especially  be a problem when input 'dataset' dataframe is resulted 
from a query including 'where' clause. see below.

[https://dzone.com/articles/non-deterministic-order-for-select-with-limit]

 

 

 


> CrossValidator.fit() bug,training and validation dataset may overlap
> 
>
> Key: SPARK-26166
> URL: https://issues.apache.org/jira/browse/SPARK-26166
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Xinyong Tian
>Priority: Major
>
> In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column
> df = dataset.select("*", rand(seed).alias(randCol))
> Should add
> df.checkpoint()
> If  df is  not checkpointed, it will be recomputed each time when train and 
> validation dataframe need to be created. The order of rows in df,which 
> rand(seed)  is dependent on, is not deterministic . Thus each time random 
> column value could be different for a specific row even with seed. Note , 
> checkpoint() can not be replaced with cached(), because when a node fails, 
> cached table need be  recomputed, thus random number could be different.
> This might especially  be a problem when input 'dataset' dataframe is 
> resulted from a query including 'where' clause. see below.
> [https://dzone.com/articles/non-deterministic-order-for-select-with-limit]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26166) CrossValidator.fit() bug,training and validation dataset may overlap

2018-11-29 Thread Xinyong Tian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinyong Tian updated SPARK-26166:
-
Description: 
In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column

df = dataset.select("*", rand(seed).alias(randCol))

Should add

df.checkpoint()

If  df is  not checkpointed, it will be recomputed each time when train and 
validation dataframe need to be created. The order of rows in df,which 
rand(seed)  is dependent on, is not deterministic . Thus each time random 
column value could be different for a specific row even with seed. Note , 
checkpoint() can not be replaced with cached(), because when a node fails, 
cached table might stilled be  recomputed, thus random number could be 
different.

This might especially  be a problem when input 'dataset' dataframe is resulted 
from a query including 'where' clause. see below.

[https://dzone.com/articles/non-deterministic-order-for-select-with-limit]

 

 

 

  was:
In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column

df = dataset.select("*", rand(seed).alias(randCol))

Should add

df.cache()

If  df not cached, it will be reselect each time when train and validation 
dataframe need to be created. The order of rows in df,which rand(seed)  is 
dependent on, is not deterministic . Thus each time random column value could 
be different for a specific row even with seed.

This might especially  be a problem when input 'dataset' dataframe is resulted 
from a query including 'where' clause. see below.

https://dzone.com/articles/non-deterministic-order-for-select-with-limit

 


> CrossValidator.fit() bug,training and validation dataset may overlap
> 
>
> Key: SPARK-26166
> URL: https://issues.apache.org/jira/browse/SPARK-26166
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Xinyong Tian
>Priority: Major
>
> In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column
> df = dataset.select("*", rand(seed).alias(randCol))
> Should add
> df.checkpoint()
> If  df is  not checkpointed, it will be recomputed each time when train and 
> validation dataframe need to be created. The order of rows in df,which 
> rand(seed)  is dependent on, is not deterministic . Thus each time random 
> column value could be different for a specific row even with seed. Note , 
> checkpoint() can not be replaced with cached(), because when a node fails, 
> cached table might stilled be  recomputed, thus random number could be 
> different.
> This might especially  be a problem when input 'dataset' dataframe is 
> resulted from a query including 'where' clause. see below.
> [https://dzone.com/articles/non-deterministic-order-for-select-with-limit]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26177) Automated formatting for Scala code

2018-11-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703700#comment-16703700
 ] 

Apache Spark commented on SPARK-26177:
--

User 'koeninger' has created a pull request for this issue:
https://github.com/apache/spark/pull/23182

> Automated formatting for Scala code
> ---
>
> Key: SPARK-26177
> URL: https://issues.apache.org/jira/browse/SPARK-26177
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Cody Koeninger
>Assignee: Cody Koeninger
>Priority: Major
> Fix For: 3.0.0
>
>
> Provide an easy way for contributors to run scalafmt only on files that 
> differ from git master
> See discussion at
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Automated-formatting-td25791.html]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26219) Executor summary is not getting updated for failure jobs in history server UI

2018-11-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703694#comment-16703694
 ] 

Apache Spark commented on SPARK-26219:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/23181

> Executor summary is not getting updated for failure jobs in history server UI
> -
>
> Key: SPARK-26219
> URL: https://issues.apache.org/jira/browse/SPARK-26219
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shahid
>Priority: Major
> Attachments: Screenshot from 2018-11-29 22-13-34.png, Screenshot from 
> 2018-11-29 22-13-44.png
>
>
> Test step to reproduce:
> {code:java}
> bin/spark-shell --master yarn --conf spark.executor.instances=3
> sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() 
> {code}
> 1)Open the application from History UI 
> 2) Go to the executor tab
> From History UI:
> !Screenshot from 2018-11-29 22-13-34.png! 
> From Live UI:
>  !Screenshot from 2018-11-29 22-13-44.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26219) Executor summary is not getting updated for failure jobs in history server UI

2018-11-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703696#comment-16703696
 ] 

Apache Spark commented on SPARK-26219:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/23181

> Executor summary is not getting updated for failure jobs in history server UI
> -
>
> Key: SPARK-26219
> URL: https://issues.apache.org/jira/browse/SPARK-26219
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shahid
>Priority: Major
> Attachments: Screenshot from 2018-11-29 22-13-34.png, Screenshot from 
> 2018-11-29 22-13-44.png
>
>
> Test step to reproduce:
> {code:java}
> bin/spark-shell --master yarn --conf spark.executor.instances=3
> sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() 
> {code}
> 1)Open the application from History UI 
> 2) Go to the executor tab
> From History UI:
> !Screenshot from 2018-11-29 22-13-34.png! 
> From Live UI:
>  !Screenshot from 2018-11-29 22-13-44.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26219) Executor summary is not getting updated for failure jobs in history server UI

2018-11-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26219:


Assignee: Apache Spark

> Executor summary is not getting updated for failure jobs in history server UI
> -
>
> Key: SPARK-26219
> URL: https://issues.apache.org/jira/browse/SPARK-26219
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shahid
>Assignee: Apache Spark
>Priority: Major
> Attachments: Screenshot from 2018-11-29 22-13-34.png, Screenshot from 
> 2018-11-29 22-13-44.png
>
>
> Test step to reproduce:
> {code:java}
> bin/spark-shell --master yarn --conf spark.executor.instances=3
> sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() 
> {code}
> 1)Open the application from History UI 
> 2) Go to the executor tab
> From History UI:
> !Screenshot from 2018-11-29 22-13-34.png! 
> From Live UI:
>  !Screenshot from 2018-11-29 22-13-44.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26158) Enhance the accuracy of covariance in RowMatrix for DenseVector

2018-11-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26158:
-

Assignee: Liang Li

> Enhance the accuracy of covariance in RowMatrix for DenseVector
> ---
>
> Key: SPARK-26158
> URL: https://issues.apache.org/jira/browse/SPARK-26158
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: Liang Li
>Assignee: Liang Li
>Priority: Minor
> Fix For: 3.0.0
>
>
> Compare Spark computeCovariance function in RowMatrix for DenseVector and 
> Numpy's function cov,
> *Find two problem, below is the result:*
> *1)The Spark function computeCovariance in RowMatrix is not accuracy*
> input data
> 1.0,2.0,3.0,4.0,5.0
> 2.0,3.0,1.0,2.0,6.0
> Numpy function cov result:
> [[2.5   1.75]
>  [ 1.75  3.7 ]]
> RowMatrix function computeCovariance result:
> 2.5   1.75  
> 1.75  3.701
>  
> 2)For some input case, the result is not good
> generate input data by below logic
> data1 = np.random.normal(loc=10, scale=0.09, size=1000)
> data2 = np.random.normal(loc=20, scale=0.02,size=1000)
>  
> Numpy function cov result:
> [[  8.10536442e-11  -4.35439574e-15]
> [ -4.35439574e-15   3.99928264e-12]]
>  
> RowMatrix function computeCovariance result:
> -0.0027484893798828125  0.001491546630859375 
> 0.001491546630859375    8.087158203125E-4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26158) Enhance the accuracy of covariance in RowMatrix for DenseVector

2018-11-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26158.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23126
[https://github.com/apache/spark/pull/23126]

> Enhance the accuracy of covariance in RowMatrix for DenseVector
> ---
>
> Key: SPARK-26158
> URL: https://issues.apache.org/jira/browse/SPARK-26158
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: Liang Li
>Assignee: Liang Li
>Priority: Minor
> Fix For: 3.0.0
>
>
> Compare Spark computeCovariance function in RowMatrix for DenseVector and 
> Numpy's function cov,
> *Find two problem, below is the result:*
> *1)The Spark function computeCovariance in RowMatrix is not accuracy*
> input data
> 1.0,2.0,3.0,4.0,5.0
> 2.0,3.0,1.0,2.0,6.0
> Numpy function cov result:
> [[2.5   1.75]
>  [ 1.75  3.7 ]]
> RowMatrix function computeCovariance result:
> 2.5   1.75  
> 1.75  3.701
>  
> 2)For some input case, the result is not good
> generate input data by below logic
> data1 = np.random.normal(loc=10, scale=0.09, size=1000)
> data2 = np.random.normal(loc=20, scale=0.02,size=1000)
>  
> Numpy function cov result:
> [[  8.10536442e-11  -4.35439574e-15]
> [ -4.35439574e-15   3.99928264e-12]]
>  
> RowMatrix function computeCovariance result:
> -0.0027484893798828125  0.001491546630859375 
> 0.001491546630859375    8.087158203125E-4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26183) ConcurrentModificationException when using Spark collectionAccumulator

2018-11-29 Thread Thincrs (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703667#comment-16703667
 ] 

Thincrs commented on SPARK-26183:
-

testing thincrs

> ConcurrentModificationException when using Spark collectionAccumulator
> --
>
> Key: SPARK-26183
> URL: https://issues.apache.org/jira/browse/SPARK-26183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rob Dawson
>Priority: Major
>
> I'm trying to run a Spark-based application on an Azure HDInsight on-demand 
> cluster, and am seeing lots of SparkExceptions (caused by 
> ConcurrentModificationExceptions) being logged. The application runs without 
> these errors when I start a local Spark instance.
> I've seen reports of [similar errors when using 
> accumulators|https://issues.apache.org/jira/browse/SPARK-17463] and my code 
> is indeed using a CollectionAccumulator, however I have placed synchronized 
> blocks everywhere I use it, and it makes no difference. The accumulator code 
> looks like this:
> {code:scala}
> class MySparkClass(sc : SparkContext) {
> val myAccumulator = sc.collectionAccumulator[MyRecord]
> override def add(record: MyRecord) = {
> synchronized {
> myAccumulator.add(record)
> }
> }
> override def endOfBatch() = {
> synchronized {
> myAccumulator.value.asScala.foreach((record: MyRecord) => {
> processIt(record)
> })
> }
> }
> }{code}
> The exceptions don't cause the application to fail, however when the code 
> tries to read values out of the accumulator it is empty and {{processIt}} is 
> never called.
> {code}
> 18/11/26 11:04:37 WARN Executor: Issue communicating with driver in 
> heartbeater
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
> at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:770)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
> at 
> java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Metho

[jira] [Updated] (SPARK-26220) Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite

2018-11-29 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-26220:
--
Description: 
There was a kafka-clients version update lately and I've seen a test failure 
like this (rarely comes):
{code:java}
KafkaRelationSuite:
- explicit earliest to latest offsets
- default starting and ending offsets
- explicit offsets
- reuse same dataframe in query
- test late binding start offsets
- bad batch query options
- read Kafka transactional messages: read_committed
- read Kafka transactional messages: read_uncommitted
KafkaSourceStressSuite:
- stress test with multiple topics and partitions *** FAILED ***
  == Results ==
  !== Correct Answer - 49 ==   == Spark Answer - 45 ==
   struct   struct
   [10][10]
   [11][11]
   [12][12]
   [13][13]
   [14][14]
   [15][15]
   [16][16]
   [17][17]
   [18][18]
   [19][19]
   [1] [1]
   [20][20]
   [21][21]
   [22][22]
   [23][23]
   [24][24]
   [25][25]
   [26][26]
  ![27][2]
  ![28][31]
  ![29][32]
  ![2] [33]
  ![30][34]
  ![31][35]
  ![32][36]
  ![33][37]
  ![34][38]
  ![35][39]
  ![36][3]
  ![37][40]
  ![38][41]
  ![39][42]
  ![3] [43]
  ![40][44]
  ![41][45]
  ![42][46]
  ![43][47]
  ![44][48]
  ![45][49]
  ![46][4]
  ![47][5]
  ![48][6]
  ![49][7]
  ![4] [8]
  ![5] [9]
  ![6]
  ![7]
  ![8]
  ![9]


  == Progress ==
 AssertOnQuery(, )
 AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), 
data = Range 0 until 2, message = )
 CheckAnswer: [1],[2]
 StopStream
 
StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@43e8e4a3,Map(),null)
 AddKafkaData(topics = Set(stress1, stress2, stress4, stress5), data = 
Range 2 until 7, message = Delete topic stress3)
 AddKafkaData(topics = Set(stress1, stress2, stress4, stress5), data = 
Range 7 until 16, message = Delete topic stress1)
 CheckAnswer: 
[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15],[16]
 StopStream
 AddKafkaData(topics = Set(stress1, stress2, stress4, stress5), data = 
Range 16 until 23, message = )
 AddKafkaData(topics = Set(stress1, stress2, stress4, stress5), data = 
Range 23 until 24, message = )
 AddKafkaData(topics = Set(stress1, stress2, stress4, stress5), data = 
Range 24 until 25, message = Add partition)
 
StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@1625a1be,Map(),null)
 AddKafkaData(topics = Set(stress1, stress2, stress5), data = Range 25 
until 26, message = Delete topic stress4)
 AddKafkaData(topics = Set(stress1, stress2, stress5), data = Range 26 
until 34, message = Add partition)
 AddKafkaData(topics = Set(stress1, stress5), data = Range 34 until 37, 
message = Delete topic stress2)
 AddKafkaData(topics = Set(stress1, stress5), data = Range 37 until 46, 
message = )
 AddKafkaData(topics = Set(stress1, stress5), data = Range 46 until 49, 
message = Add partition)
  => CheckAnswer: 
[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15],[16],[17],[18],[19],[20],[21],[22],[23],[24],[25],[26],[27],[28],[29],[30],[31],[32],[33],[34],[35],[36],[37],[38],[39],[40],[41],[42],[43],[44],[45],[46],[47],[48],[49]
 AddKafkaData(topics = Set(stress1, stress5), data = Range 49 until 55, 
message = Add partition)
 AddKafkaData(topics = Set(stress1, stress5), data = Range 55 until 58, 
message = Add partition)
 AddKafkaData(topics = Set(stress1, stress5), data = Range 58 until 62, 
message = Delete topic stress1)
 CheckAnswer: 
[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15],[16],[17],[18],[19],[20],[21],[22],[23],[24],[25],[26],[27],[28],[29],[30],[31],[32],[33],[34],[35],[36],[37],[38],[39],[40],[41],[42],[43],[44],[45],[46],[47],[48],[49],[50],[51],[52],[53],[54],[55],[56],[57],[58],[59],[60],[61],[62]
 StopStream
 AddKafkaData(topi

[jira] [Commented] (SPARK-26220) Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite

2018-11-29 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703644#comment-16703644
 ] 

Gabor Somogyi commented on SPARK-26220:
---

[~dongjoon] Don't know whether it's related to kafka version bump but seen this 
issue lately.
Maybe the problem is old but I've had bad luck because there is heavy random 
inside.


> Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite
> 
>
> Key: SPARK-26220
> URL: https://issues.apache.org/jira/browse/SPARK-26220
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> There was a kafka-clients version update lately and I've seen a test failure 
> like this (rarely comes):
> {code:java}
> KafkaRelationSuite:
> - explicit earliest to latest offsets
> - default starting and ending offsets
> - explicit offsets
> - reuse same dataframe in query
> - test late binding start offsets
> - bad batch query options
> - read Kafka transactional messages: read_committed
> - read Kafka transactional messages: read_uncommitted
> KafkaSourceStressSuite:
> - stress test with multiple topics and partitions *** FAILED ***
>   == Results ==
>   !== Correct Answer - 49 ==   == Spark Answer - 45 ==
>struct   struct
>[10][10]
>[11][11]
>[12][12]
>[13][13]
>[14][14]
>[15][15]
>[16][16]
>[17][17]
>[18][18]
>[19][19]
>[1] [1]
>[20][20]
>[21][21]
>[22][22]
>[23][23]
>[24][24]
>[25][25]
>[26][26]
>   ![27][2]
>   ![28][31]
>   ![29][32]
>   ![2] [33]
>   ![30][34]
>   ![31][35]
>   ![32][36]
>   ![33][37]
>   ![34][38]
>   ![35][39]
>   ![36][3]
>   ![37][40]
>   ![38][41]
>   ![39][42]
>   ![3] [43]
>   ![40][44]
>   ![41][45]
>   ![42][46]
>   ![43][47]
>   ![44][48]
>   ![45][49]
>   ![46][4]
>   ![47][5]
>   ![48][6]
>   ![49][7]
>   ![4] [8]
>   ![5] [9]
>   ![6]
>   ![7]
>   ![8]
>   ![9]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26220) Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite

2018-11-29 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-26220:
--
Issue Type: Bug  (was: New Feature)

> Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite
> 
>
> Key: SPARK-26220
> URL: https://issues.apache.org/jira/browse/SPARK-26220
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> There was a kafka-clients version update lately and I've seen a test failure 
> like this (rarely comes):
> {code:java}
> KafkaRelationSuite:
> - explicit earliest to latest offsets
> - default starting and ending offsets
> - explicit offsets
> - reuse same dataframe in query
> - test late binding start offsets
> - bad batch query options
> - read Kafka transactional messages: read_committed
> - read Kafka transactional messages: read_uncommitted
> KafkaSourceStressSuite:
> - stress test with multiple topics and partitions *** FAILED ***
>   == Results ==
>   !== Correct Answer - 49 ==   == Spark Answer - 45 ==
>struct   struct
>[10][10]
>[11][11]
>[12][12]
>[13][13]
>[14][14]
>[15][15]
>[16][16]
>[17][17]
>[18][18]
>[19][19]
>[1] [1]
>[20][20]
>[21][21]
>[22][22]
>[23][23]
>[24][24]
>[25][25]
>[26][26]
>   ![27][2]
>   ![28][31]
>   ![29][32]
>   ![2] [33]
>   ![30][34]
>   ![31][35]
>   ![32][36]
>   ![33][37]
>   ![34][38]
>   ![35][39]
>   ![36][3]
>   ![37][40]
>   ![38][41]
>   ![39][42]
>   ![3] [43]
>   ![40][44]
>   ![41][45]
>   ![42][46]
>   ![43][47]
>   ![44][48]
>   ![45][49]
>   ![46][4]
>   ![47][5]
>   ![48][6]
>   ![49][7]
>   ![4] [8]
>   ![5] [9]
>   ![6]
>   ![7]
>   ![8]
>   ![9]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26220) Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite

2018-11-29 Thread Gabor Somogyi (JIRA)
Gabor Somogyi created SPARK-26220:
-

 Summary: Flaky Test: 
org.apache.spark.sql.kafka010.KafkaSourceStressSuite
 Key: SPARK-26220
 URL: https://issues.apache.org/jira/browse/SPARK-26220
 Project: Spark
  Issue Type: New Feature
  Components: Tests
Affects Versions: 3.0.0
Reporter: Gabor Somogyi


There was a kafka-clients version update lately and I've seen a test failure 
like this (rarely comes):
{code:java}
KafkaRelationSuite:
- explicit earliest to latest offsets
- default starting and ending offsets
- explicit offsets
- reuse same dataframe in query
- test late binding start offsets
- bad batch query options
- read Kafka transactional messages: read_committed
- read Kafka transactional messages: read_uncommitted
KafkaSourceStressSuite:
- stress test with multiple topics and partitions *** FAILED ***
  == Results ==
  !== Correct Answer - 49 ==   == Spark Answer - 45 ==
   struct   struct
   [10][10]
   [11][11]
   [12][12]
   [13][13]
   [14][14]
   [15][15]
   [16][16]
   [17][17]
   [18][18]
   [19][19]
   [1] [1]
   [20][20]
   [21][21]
   [22][22]
   [23][23]
   [24][24]
   [25][25]
   [26][26]
  ![27][2]
  ![28][31]
  ![29][32]
  ![2] [33]
  ![30][34]
  ![31][35]
  ![32][36]
  ![33][37]
  ![34][38]
  ![35][39]
  ![36][3]
  ![37][40]
  ![38][41]
  ![39][42]
  ![3] [43]
  ![40][44]
  ![41][45]
  ![42][46]
  ![43][47]
  ![44][48]
  ![45][49]
  ![46][4]
  ![47][5]
  ![48][6]
  ![49][7]
  ![4] [8]
  ![5] [9]
  ![6]
  ![7]
  ![8]
  ![9]
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26100) [History server ]Jobs table and Aggregate metrics table are showing lesser number of tasks

2018-11-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703631#comment-16703631
 ] 

Apache Spark commented on SPARK-26100:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/23181

> [History server ]Jobs table and Aggregate metrics table are showing lesser 
> number of tasks 
> ---
>
> Key: SPARK-26100
> URL: https://issues.apache.org/jira/browse/SPARK-26100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: shahid
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
> Attachments: Screenshot from 2018-11-17 16-54-42.png, Screenshot from 
> 2018-11-17 16-55-09.png
>
>
> Test step to reproduce:
> 1) {{bin/spark-shell --master yarn --conf spark.executor.instances=3}}
> 2)\{{sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() }}
>  
> 3) Open Application from the history server UI
> Jobs table and Aggregated metrics are showing lesser number of tasks.
>  !Screenshot from 2018-11-17 16-55-09.png! 
>  
>   !Screenshot from 2018-11-17 16-54-42.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26219) Executor summary is not getting updated for failure jobs in history server UI

2018-11-29 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26219:
---
Attachment: Screenshot from 2018-11-29 22-13-44.png

> Executor summary is not getting updated for failure jobs in history server UI
> -
>
> Key: SPARK-26219
> URL: https://issues.apache.org/jira/browse/SPARK-26219
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shahid
>Priority: Major
> Attachments: Screenshot from 2018-11-29 22-13-34.png, Screenshot from 
> 2018-11-29 22-13-44.png
>
>
> Test step to reproduce:
> {code:java}
> bin/spark-shell --master yarn --conf spark.executor.instances=3
> sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() 
> {code}
> 1)Open the application from History UI
> 2) Go to the executor tab



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26219) Executor summary is not getting updated for failure jobs in history server UI

2018-11-29 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26219:
---
Description: 
Test step to reproduce:

{code:java}
bin/spark-shell --master yarn --conf spark.executor.instances=3
sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
executor")}.collect() 
{code}

1)Open the application from History UI 
2) Go to the executor tab

>From History UI:
!Screenshot from 2018-11-29 22-13-34.png! 
>From Live UI:
 !Screenshot from 2018-11-29 22-13-44.png! 


  was:
Test step to reproduce:

{code:java}
bin/spark-shell --master yarn --conf spark.executor.instances=3
sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
executor")}.collect() 
{code}

1)Open the application from History UI
2) Go to the executor tab




> Executor summary is not getting updated for failure jobs in history server UI
> -
>
> Key: SPARK-26219
> URL: https://issues.apache.org/jira/browse/SPARK-26219
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shahid
>Priority: Major
> Attachments: Screenshot from 2018-11-29 22-13-34.png, Screenshot from 
> 2018-11-29 22-13-44.png
>
>
> Test step to reproduce:
> {code:java}
> bin/spark-shell --master yarn --conf spark.executor.instances=3
> sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() 
> {code}
> 1)Open the application from History UI 
> 2) Go to the executor tab
> From History UI:
> !Screenshot from 2018-11-29 22-13-34.png! 
> From Live UI:
>  !Screenshot from 2018-11-29 22-13-44.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26219) Executor summary is not getting updated for failure jobs in history server UI

2018-11-29 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26219:
---
Attachment: Screenshot from 2018-11-29 22-13-34.png

> Executor summary is not getting updated for failure jobs in history server UI
> -
>
> Key: SPARK-26219
> URL: https://issues.apache.org/jira/browse/SPARK-26219
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shahid
>Priority: Major
> Attachments: Screenshot from 2018-11-29 22-13-34.png, Screenshot from 
> 2018-11-29 22-13-44.png
>
>
> Test step to reproduce:
> {code:java}
> bin/spark-shell --master yarn --conf spark.executor.instances=3
> sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() 
> {code}
> 1)Open the application from History UI
> 2) Go to the executor tab



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26219) Executor summary is not getting updated for failure jobs in history server UI

2018-11-29 Thread shahid (JIRA)
shahid created SPARK-26219:
--

 Summary: Executor summary is not getting updated for failure jobs 
in history server UI
 Key: SPARK-26219
 URL: https://issues.apache.org/jira/browse/SPARK-26219
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0, 2.3.2
Reporter: shahid
 Attachments: Screenshot from 2018-11-29 22-13-34.png, Screenshot from 
2018-11-29 22-13-44.png

Test step to reproduce:

{code:java}
bin/spark-shell --master yarn --conf spark.executor.instances=3
sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
executor")}.collect() 
{code}

1)Open the application from History UI
2) Go to the executor tab





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26219) Executor summary is not getting updated for failure jobs in history server UI

2018-11-29 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703619#comment-16703619
 ] 

shahid commented on SPARK-26219:


I will raise a PR

> Executor summary is not getting updated for failure jobs in history server UI
> -
>
> Key: SPARK-26219
> URL: https://issues.apache.org/jira/browse/SPARK-26219
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shahid
>Priority: Major
> Attachments: Screenshot from 2018-11-29 22-13-34.png, Screenshot from 
> 2018-11-29 22-13-44.png
>
>
> Test step to reproduce:
> {code:java}
> bin/spark-shell --master yarn --conf spark.executor.instances=3
> sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() 
> {code}
> 1)Open the application from History UI
> 2) Go to the executor tab



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-26183) ConcurrentModificationException when using Spark collectionAccumulator

2018-11-29 Thread Jairo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jairo updated SPARK-26183:
--
Comment: was deleted

(was: testing thincrs)

> ConcurrentModificationException when using Spark collectionAccumulator
> --
>
> Key: SPARK-26183
> URL: https://issues.apache.org/jira/browse/SPARK-26183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rob Dawson
>Priority: Major
>
> I'm trying to run a Spark-based application on an Azure HDInsight on-demand 
> cluster, and am seeing lots of SparkExceptions (caused by 
> ConcurrentModificationExceptions) being logged. The application runs without 
> these errors when I start a local Spark instance.
> I've seen reports of [similar errors when using 
> accumulators|https://issues.apache.org/jira/browse/SPARK-17463] and my code 
> is indeed using a CollectionAccumulator, however I have placed synchronized 
> blocks everywhere I use it, and it makes no difference. The accumulator code 
> looks like this:
> {code:scala}
> class MySparkClass(sc : SparkContext) {
> val myAccumulator = sc.collectionAccumulator[MyRecord]
> override def add(record: MyRecord) = {
> synchronized {
> myAccumulator.add(record)
> }
> }
> override def endOfBatch() = {
> synchronized {
> myAccumulator.value.asScala.foreach((record: MyRecord) => {
> processIt(record)
> })
> }
> }
> }{code}
> The exceptions don't cause the application to fail, however when the code 
> tries to read values out of the accumulator it is empty and {{processIt}} is 
> never called.
> {code}
> 18/11/26 11:04:37 WARN Executor: Issue communicating with driver in 
> heartbeater
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
> at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:770)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
> at 
> java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
>  

[jira] [Issue Comment Deleted] (SPARK-26183) ConcurrentModificationException when using Spark collectionAccumulator

2018-11-29 Thread Jairo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jairo updated SPARK-26183:
--
Comment: was deleted

(was: .../)

> ConcurrentModificationException when using Spark collectionAccumulator
> --
>
> Key: SPARK-26183
> URL: https://issues.apache.org/jira/browse/SPARK-26183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rob Dawson
>Priority: Major
>
> I'm trying to run a Spark-based application on an Azure HDInsight on-demand 
> cluster, and am seeing lots of SparkExceptions (caused by 
> ConcurrentModificationExceptions) being logged. The application runs without 
> these errors when I start a local Spark instance.
> I've seen reports of [similar errors when using 
> accumulators|https://issues.apache.org/jira/browse/SPARK-17463] and my code 
> is indeed using a CollectionAccumulator, however I have placed synchronized 
> blocks everywhere I use it, and it makes no difference. The accumulator code 
> looks like this:
> {code:scala}
> class MySparkClass(sc : SparkContext) {
> val myAccumulator = sc.collectionAccumulator[MyRecord]
> override def add(record: MyRecord) = {
> synchronized {
> myAccumulator.add(record)
> }
> }
> override def endOfBatch() = {
> synchronized {
> myAccumulator.value.asScala.foreach((record: MyRecord) => {
> processIt(record)
> })
> }
> }
> }{code}
> The exceptions don't cause the application to fail, however when the code 
> tries to read values out of the accumulator it is empty and {{processIt}} is 
> never called.
> {code}
> 18/11/26 11:04:37 WARN Executor: Issue communicating with driver in 
> heartbeater
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
> at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:770)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
> at 
> java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 

[jira] [Issue Comment Deleted] (SPARK-26183) ConcurrentModificationException when using Spark collectionAccumulator

2018-11-29 Thread Jairo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jairo updated SPARK-26183:
--
Comment: was deleted

(was: ..///)

> ConcurrentModificationException when using Spark collectionAccumulator
> --
>
> Key: SPARK-26183
> URL: https://issues.apache.org/jira/browse/SPARK-26183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rob Dawson
>Priority: Major
>
> I'm trying to run a Spark-based application on an Azure HDInsight on-demand 
> cluster, and am seeing lots of SparkExceptions (caused by 
> ConcurrentModificationExceptions) being logged. The application runs without 
> these errors when I start a local Spark instance.
> I've seen reports of [similar errors when using 
> accumulators|https://issues.apache.org/jira/browse/SPARK-17463] and my code 
> is indeed using a CollectionAccumulator, however I have placed synchronized 
> blocks everywhere I use it, and it makes no difference. The accumulator code 
> looks like this:
> {code:scala}
> class MySparkClass(sc : SparkContext) {
> val myAccumulator = sc.collectionAccumulator[MyRecord]
> override def add(record: MyRecord) = {
> synchronized {
> myAccumulator.add(record)
> }
> }
> override def endOfBatch() = {
> synchronized {
> myAccumulator.value.asScala.foreach((record: MyRecord) => {
> processIt(record)
> })
> }
> }
> }{code}
> The exceptions don't cause the application to fail, however when the code 
> tries to read values out of the accumulator it is empty and {{processIt}} is 
> never called.
> {code}
> 18/11/26 11:04:37 WARN Executor: Issue communicating with driver in 
> heartbeater
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
> at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:770)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
> at 
> java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 

[jira] [Issue Comment Deleted] (SPARK-26183) ConcurrentModificationException when using Spark collectionAccumulator

2018-11-29 Thread Jairo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jairo updated SPARK-26183:
--
Comment: was deleted

(was: .)

> ConcurrentModificationException when using Spark collectionAccumulator
> --
>
> Key: SPARK-26183
> URL: https://issues.apache.org/jira/browse/SPARK-26183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rob Dawson
>Priority: Major
>
> I'm trying to run a Spark-based application on an Azure HDInsight on-demand 
> cluster, and am seeing lots of SparkExceptions (caused by 
> ConcurrentModificationExceptions) being logged. The application runs without 
> these errors when I start a local Spark instance.
> I've seen reports of [similar errors when using 
> accumulators|https://issues.apache.org/jira/browse/SPARK-17463] and my code 
> is indeed using a CollectionAccumulator, however I have placed synchronized 
> blocks everywhere I use it, and it makes no difference. The accumulator code 
> looks like this:
> {code:scala}
> class MySparkClass(sc : SparkContext) {
> val myAccumulator = sc.collectionAccumulator[MyRecord]
> override def add(record: MyRecord) = {
> synchronized {
> myAccumulator.add(record)
> }
> }
> override def endOfBatch() = {
> synchronized {
> myAccumulator.value.asScala.foreach((record: MyRecord) => {
> processIt(record)
> })
> }
> }
> }{code}
> The exceptions don't cause the application to fail, however when the code 
> tries to read values out of the accumulator it is empty and {{processIt}} is 
> never called.
> {code}
> 18/11/26 11:04:37 WARN Executor: Issue communicating with driver in 
> heartbeater
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
> at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:770)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
> at 
> java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> j

[jira] [Issue Comment Deleted] (SPARK-26183) ConcurrentModificationException when using Spark collectionAccumulator

2018-11-29 Thread Jairo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jairo updated SPARK-26183:
--
Comment: was deleted

(was: .)

> ConcurrentModificationException when using Spark collectionAccumulator
> --
>
> Key: SPARK-26183
> URL: https://issues.apache.org/jira/browse/SPARK-26183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rob Dawson
>Priority: Major
>
> I'm trying to run a Spark-based application on an Azure HDInsight on-demand 
> cluster, and am seeing lots of SparkExceptions (caused by 
> ConcurrentModificationExceptions) being logged. The application runs without 
> these errors when I start a local Spark instance.
> I've seen reports of [similar errors when using 
> accumulators|https://issues.apache.org/jira/browse/SPARK-17463] and my code 
> is indeed using a CollectionAccumulator, however I have placed synchronized 
> blocks everywhere I use it, and it makes no difference. The accumulator code 
> looks like this:
> {code:scala}
> class MySparkClass(sc : SparkContext) {
> val myAccumulator = sc.collectionAccumulator[MyRecord]
> override def add(record: MyRecord) = {
> synchronized {
> myAccumulator.add(record)
> }
> }
> override def endOfBatch() = {
> synchronized {
> myAccumulator.value.asScala.foreach((record: MyRecord) => {
> processIt(record)
> })
> }
> }
> }{code}
> The exceptions don't cause the application to fail, however when the code 
> tries to read values out of the accumulator it is empty and {{processIt}} is 
> never called.
> {code}
> 18/11/26 11:04:37 WARN Executor: Issue communicating with driver in 
> heartbeater
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
> at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:770)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
> at 
> java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> j

[jira] [Issue Comment Deleted] (SPARK-26183) ConcurrentModificationException when using Spark collectionAccumulator

2018-11-29 Thread Jairo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jairo updated SPARK-26183:
--
Comment: was deleted

(was: .)

> ConcurrentModificationException when using Spark collectionAccumulator
> --
>
> Key: SPARK-26183
> URL: https://issues.apache.org/jira/browse/SPARK-26183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rob Dawson
>Priority: Major
>
> I'm trying to run a Spark-based application on an Azure HDInsight on-demand 
> cluster, and am seeing lots of SparkExceptions (caused by 
> ConcurrentModificationExceptions) being logged. The application runs without 
> these errors when I start a local Spark instance.
> I've seen reports of [similar errors when using 
> accumulators|https://issues.apache.org/jira/browse/SPARK-17463] and my code 
> is indeed using a CollectionAccumulator, however I have placed synchronized 
> blocks everywhere I use it, and it makes no difference. The accumulator code 
> looks like this:
> {code:scala}
> class MySparkClass(sc : SparkContext) {
> val myAccumulator = sc.collectionAccumulator[MyRecord]
> override def add(record: MyRecord) = {
> synchronized {
> myAccumulator.add(record)
> }
> }
> override def endOfBatch() = {
> synchronized {
> myAccumulator.value.asScala.foreach((record: MyRecord) => {
> processIt(record)
> })
> }
> }
> }{code}
> The exceptions don't cause the application to fail, however when the code 
> tries to read values out of the accumulator it is empty and {{processIt}} is 
> never called.
> {code}
> 18/11/26 11:04:37 WARN Executor: Issue communicating with driver in 
> heartbeater
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
> at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:770)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
> at 
> java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> j

[jira] [Commented] (SPARK-26183) ConcurrentModificationException when using Spark collectionAccumulator

2018-11-29 Thread Jairo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703596#comment-16703596
 ] 

Jairo commented on SPARK-26183:
---

testing thincrs

> ConcurrentModificationException when using Spark collectionAccumulator
> --
>
> Key: SPARK-26183
> URL: https://issues.apache.org/jira/browse/SPARK-26183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rob Dawson
>Priority: Major
>
> I'm trying to run a Spark-based application on an Azure HDInsight on-demand 
> cluster, and am seeing lots of SparkExceptions (caused by 
> ConcurrentModificationExceptions) being logged. The application runs without 
> these errors when I start a local Spark instance.
> I've seen reports of [similar errors when using 
> accumulators|https://issues.apache.org/jira/browse/SPARK-17463] and my code 
> is indeed using a CollectionAccumulator, however I have placed synchronized 
> blocks everywhere I use it, and it makes no difference. The accumulator code 
> looks like this:
> {code:scala}
> class MySparkClass(sc : SparkContext) {
> val myAccumulator = sc.collectionAccumulator[MyRecord]
> override def add(record: MyRecord) = {
> synchronized {
> myAccumulator.add(record)
> }
> }
> override def endOfBatch() = {
> synchronized {
> myAccumulator.value.asScala.foreach((record: MyRecord) => {
> processIt(record)
> })
> }
> }
> }{code}
> The exceptions don't cause the application to fail, however when the code 
> tries to read values out of the accumulator it is empty and {{processIt}} is 
> never called.
> {code}
> 18/11/26 11:04:37 WARN Executor: Issue communicating with driver in 
> heartbeater
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
> at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:770)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
> at 
> java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.in

[jira] [Commented] (SPARK-26183) ConcurrentModificationException when using Spark collectionAccumulator

2018-11-29 Thread Jairo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703593#comment-16703593
 ] 

Jairo commented on SPARK-26183:
---

..///

> ConcurrentModificationException when using Spark collectionAccumulator
> --
>
> Key: SPARK-26183
> URL: https://issues.apache.org/jira/browse/SPARK-26183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rob Dawson
>Priority: Major
>
> I'm trying to run a Spark-based application on an Azure HDInsight on-demand 
> cluster, and am seeing lots of SparkExceptions (caused by 
> ConcurrentModificationExceptions) being logged. The application runs without 
> these errors when I start a local Spark instance.
> I've seen reports of [similar errors when using 
> accumulators|https://issues.apache.org/jira/browse/SPARK-17463] and my code 
> is indeed using a CollectionAccumulator, however I have placed synchronized 
> blocks everywhere I use it, and it makes no difference. The accumulator code 
> looks like this:
> {code:scala}
> class MySparkClass(sc : SparkContext) {
> val myAccumulator = sc.collectionAccumulator[MyRecord]
> override def add(record: MyRecord) = {
> synchronized {
> myAccumulator.add(record)
> }
> }
> override def endOfBatch() = {
> synchronized {
> myAccumulator.value.asScala.foreach((record: MyRecord) => {
> processIt(record)
> })
> }
> }
> }{code}
> The exceptions don't cause the application to fail, however when the code 
> tries to read values out of the accumulator it is empty and {{processIt}} is 
> never called.
> {code}
> 18/11/26 11:04:37 WARN Executor: Issue communicating with driver in 
> heartbeater
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
> at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:770)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
> at 
> java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Metho

[jira] [Assigned] (SPARK-26015) Include a USER directive in project provided Spark Dockerfiles

2018-11-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26015:
--

Assignee: Rob Vesse

> Include a USER directive in project provided Spark Dockerfiles
> --
>
> Key: SPARK-26015
> URL: https://issues.apache.org/jira/browse/SPARK-26015
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Rob Vesse
>Assignee: Rob Vesse
>Priority: Major
> Fix For: 3.0.0
>
>
> The current Dockerfiles provided by the project for running on Kubernetes do 
> not include a [USER 
> directive|https://docs.docker.com/engine/reference/builder/#user] which means 
> that they default to running as {{root}}.  This may lead to unsuspecting 
> users running their Spark jobs with unexpected levels of privilege.
> The project should follow Docker/K8S best practises by including {{USER}} 
> directives in the Dockerfiles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26015) Include a USER directive in project provided Spark Dockerfiles

2018-11-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26015.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23017
[https://github.com/apache/spark/pull/23017]

> Include a USER directive in project provided Spark Dockerfiles
> --
>
> Key: SPARK-26015
> URL: https://issues.apache.org/jira/browse/SPARK-26015
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Rob Vesse
>Assignee: Rob Vesse
>Priority: Major
> Fix For: 3.0.0
>
>
> The current Dockerfiles provided by the project for running on Kubernetes do 
> not include a [USER 
> directive|https://docs.docker.com/engine/reference/builder/#user] which means 
> that they default to running as {{root}}.  This may lead to unsuspecting 
> users running their Spark jobs with unexpected levels of privilege.
> The project should follow Docker/K8S best practises by including {{USER}} 
> directives in the Dockerfiles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24498) Add JDK compiler for runtime codegen

2018-11-29 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-24498.
--
Resolution: Won't Do

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26184) Last updated time is not getting updated in the History Server UI

2018-11-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26184:
--

Assignee: shahid

> Last updated time is not getting updated in the History Server UI
> -
>
> Key: SPARK-26184
> URL: https://issues.apache.org/jira/browse/SPARK-26184
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: shahid
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
> Attachments: Screenshot from 2018-11-27 23-20-11.png, Screenshot from 
> 2018-11-27 23-22-38.png
>
>
> For inprogress application, last updated time is not getting updated.
>  !Screenshot from 2018-11-27 23-20-11.png! 
>  !Screenshot from 2018-11-27 23-22-38.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26184) Last updated time is not getting updated in the History Server UI

2018-11-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26184.

   Resolution: Fixed
Fix Version/s: 2.4.1
   3.0.0

Issue resolved by pull request 23158
[https://github.com/apache/spark/pull/23158]

> Last updated time is not getting updated in the History Server UI
> -
>
> Key: SPARK-26184
> URL: https://issues.apache.org/jira/browse/SPARK-26184
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: shahid
>Priority: Major
> Fix For: 3.0.0, 2.4.1
>
> Attachments: Screenshot from 2018-11-27 23-20-11.png, Screenshot from 
> 2018-11-27 23-22-38.png
>
>
> For inprogress application, last updated time is not getting updated.
>  !Screenshot from 2018-11-27 23-20-11.png! 
>  !Screenshot from 2018-11-27 23-22-38.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26186) In progress applications with last updated time is lesser than the cleaning interval are getting removed during cleaning logs

2018-11-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26186.

   Resolution: Fixed
Fix Version/s: 2.4.1
   3.0.0

Issue resolved by pull request 23158
[https://github.com/apache/spark/pull/23158]

> In progress applications with last updated time is lesser than the cleaning 
> interval are getting removed during cleaning logs
> -
>
> Key: SPARK-26186
> URL: https://issues.apache.org/jira/browse/SPARK-26186
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: shahid
>Assignee: shahid
>Priority: Major
> Fix For: 3.0.0, 2.4.1
>
>
> Inporgress applications with last updated time is withing the cleaning 
> interval are getting deleted.
>  
> Added a UT to test the scenario.
> {code:java}
> test("should not clean inprogress application with lastUpdated time less the 
> maxTime") {
> val firstFileModifiedTime = TimeUnit.DAYS.toMillis(1)
> val secondFileModifiedTime = TimeUnit.DAYS.toMillis(6)
> val maxAge = TimeUnit.DAYS.toMillis(7)
> val clock = new ManualClock(0)
> val provider = new FsHistoryProvider(
>   createTestConf().set("spark.history.fs.cleaner.maxAge", 
> s"${maxAge}ms"), clock)
> val log = newLogFile("inProgressApp1", None, inProgress = true)
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1"))
> )
> clock.setTime(firstFileModifiedTime)
> provider.checkForLogs()
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1")),
>   SparkListenerJobStart(0, 1L, Nil, null)
> )
> clock.setTime(secondFileModifiedTime)
> provider.checkForLogs()
> clock.setTime(TimeUnit.DAYS.toMillis(10))
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1")),
>   SparkListenerJobStart(0, 1L, Nil, null),
>   SparkListenerJobEnd(0, 1L, JobSucceeded)
> )
> provider.checkForLogs()
> // This should not trigger any cleanup
> updateAndCheck(provider) { list =>
>   list.size should be(1)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26186) In progress applications with last updated time is lesser than the cleaning interval are getting removed during cleaning logs

2018-11-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26186:
--

Assignee: shahid

> In progress applications with last updated time is lesser than the cleaning 
> interval are getting removed during cleaning logs
> -
>
> Key: SPARK-26186
> URL: https://issues.apache.org/jira/browse/SPARK-26186
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: shahid
>Assignee: shahid
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> Inporgress applications with last updated time is withing the cleaning 
> interval are getting deleted.
>  
> Added a UT to test the scenario.
> {code:java}
> test("should not clean inprogress application with lastUpdated time less the 
> maxTime") {
> val firstFileModifiedTime = TimeUnit.DAYS.toMillis(1)
> val secondFileModifiedTime = TimeUnit.DAYS.toMillis(6)
> val maxAge = TimeUnit.DAYS.toMillis(7)
> val clock = new ManualClock(0)
> val provider = new FsHistoryProvider(
>   createTestConf().set("spark.history.fs.cleaner.maxAge", 
> s"${maxAge}ms"), clock)
> val log = newLogFile("inProgressApp1", None, inProgress = true)
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1"))
> )
> clock.setTime(firstFileModifiedTime)
> provider.checkForLogs()
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1")),
>   SparkListenerJobStart(0, 1L, Nil, null)
> )
> clock.setTime(secondFileModifiedTime)
> provider.checkForLogs()
> clock.setTime(TimeUnit.DAYS.toMillis(10))
> writeFile(log, true, None,
>   SparkListenerApplicationStart(
> "inProgressApp1", Some("inProgressApp1"), 3L, "test", 
> Some("attempt1")),
>   SparkListenerJobStart(0, 1L, Nil, null),
>   SparkListenerJobEnd(0, 1L, JobSucceeded)
> )
> provider.checkForLogs()
> // This should not trigger any cleanup
> updateAndCheck(provider) { list =>
>   list.size should be(1)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen

2018-11-29 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703556#comment-16703556
 ] 

Kazuaki Ishizaki commented on SPARK-24498:
--

I see. Let us close for now.
We may need some strategy to choose better java bytecode compiler at runtime.

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26183) ConcurrentModificationException when using Spark collectionAccumulator

2018-11-29 Thread Jairo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703534#comment-16703534
 ] 

Jairo commented on SPARK-26183:
---

.../

> ConcurrentModificationException when using Spark collectionAccumulator
> --
>
> Key: SPARK-26183
> URL: https://issues.apache.org/jira/browse/SPARK-26183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rob Dawson
>Priority: Major
>
> I'm trying to run a Spark-based application on an Azure HDInsight on-demand 
> cluster, and am seeing lots of SparkExceptions (caused by 
> ConcurrentModificationExceptions) being logged. The application runs without 
> these errors when I start a local Spark instance.
> I've seen reports of [similar errors when using 
> accumulators|https://issues.apache.org/jira/browse/SPARK-17463] and my code 
> is indeed using a CollectionAccumulator, however I have placed synchronized 
> blocks everywhere I use it, and it makes no difference. The accumulator code 
> looks like this:
> {code:scala}
> class MySparkClass(sc : SparkContext) {
> val myAccumulator = sc.collectionAccumulator[MyRecord]
> override def add(record: MyRecord) = {
> synchronized {
> myAccumulator.add(record)
> }
> }
> override def endOfBatch() = {
> synchronized {
> myAccumulator.value.asScala.foreach((record: MyRecord) => {
> processIt(record)
> })
> }
> }
> }{code}
> The exceptions don't cause the application to fail, however when the code 
> tries to read values out of the accumulator it is empty and {{processIt}} is 
> never called.
> {code}
> 18/11/26 11:04:37 WARN Executor: Issue communicating with driver in 
> heartbeater
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
> at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:770)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
> at 
> java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method

[jira] [Updated] (SPARK-26081) Do not write empty files by text datasources

2018-11-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26081:
--
Docs Text: In Spark 3.0, when empty partitions are written to a CSV, JSON 
or text data source, they no longer produce an empty file, but instead produce 
no file at all.
   Labels: release-notes  (was: )

> Do not write empty files by text datasources
> 
>
> Key: SPARK-26081
> URL: https://issues.apache.org/jira/browse/SPARK-26081
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Text based datasources like CSV, JSON and Text produces empty files for empty 
> partitions. This introduces additional overhead while opening and reading 
> such files back. In current implementation of OutputWriter, the output stream 
> are created eagerly even no records are written to the stream. So, creation 
> can be postponed up to the first write.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26081) Do not write empty files by text datasources

2018-11-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26081.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23052
[https://github.com/apache/spark/pull/23052]

> Do not write empty files by text datasources
> 
>
> Key: SPARK-26081
> URL: https://issues.apache.org/jira/browse/SPARK-26081
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Text based datasources like CSV, JSON and Text produces empty files for empty 
> partitions. This introduces additional overhead while opening and reading 
> such files back. In current implementation of OutputWriter, the output stream 
> are created eagerly even no records are written to the stream. So, creation 
> can be postponed up to the first write.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26081) Do not write empty files by text datasources

2018-11-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26081:
-

Assignee: Maxim Gekk

> Do not write empty files by text datasources
> 
>
> Key: SPARK-26081
> URL: https://issues.apache.org/jira/browse/SPARK-26081
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Text based datasources like CSV, JSON and Text produces empty files for empty 
> partitions. This introduces additional overhead while opening and reading 
> such files back. In current implementation of OutputWriter, the output stream 
> are created eagerly even no records are written to the stream. So, creation 
> can be postponed up to the first write.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26180) Add a withCreateTempDir function to the SparkCore test case

2018-11-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26180:
--
Priority: Minor  (was: Major)

> Add a withCreateTempDir function to the SparkCore test case
> ---
>
> Key: SPARK-26180
> URL: https://issues.apache.org/jira/browse/SPARK-26180
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests
>Affects Versions: 2.5.0
>Reporter: caoxuewen
>Priority: Minor
>
> Currently, the common withTempDir function is used in Spark SQL test cases. 
> To handle val dir = Utils. createTempDir () and Utils. deleteRecursively 
> (dir). Unfortunately, the withTempDir function cannot be used in the 
> SparkCore test case. This PR adds a common withCreateTempDir function to 
> clean up SparkCore test cases. thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26183) ConcurrentModificationException when using Spark collectionAccumulator

2018-11-29 Thread Jairo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703444#comment-16703444
 ] 

Jairo commented on SPARK-26183:
---

.

> ConcurrentModificationException when using Spark collectionAccumulator
> --
>
> Key: SPARK-26183
> URL: https://issues.apache.org/jira/browse/SPARK-26183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rob Dawson
>Priority: Major
>
> I'm trying to run a Spark-based application on an Azure HDInsight on-demand 
> cluster, and am seeing lots of SparkExceptions (caused by 
> ConcurrentModificationExceptions) being logged. The application runs without 
> these errors when I start a local Spark instance.
> I've seen reports of [similar errors when using 
> accumulators|https://issues.apache.org/jira/browse/SPARK-17463] and my code 
> is indeed using a CollectionAccumulator, however I have placed synchronized 
> blocks everywhere I use it, and it makes no difference. The accumulator code 
> looks like this:
> {code:scala}
> class MySparkClass(sc : SparkContext) {
> val myAccumulator = sc.collectionAccumulator[MyRecord]
> override def add(record: MyRecord) = {
> synchronized {
> myAccumulator.add(record)
> }
> }
> override def endOfBatch() = {
> synchronized {
> myAccumulator.value.asScala.foreach((record: MyRecord) => {
> processIt(record)
> })
> }
> }
> }{code}
> The exceptions don't cause the application to fail, however when the code 
> tries to read values out of the accumulator it is empty and {{processIt}} is 
> never called.
> {code}
> 18/11/26 11:04:37 WARN Executor: Issue communicating with driver in 
> heartbeater
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
> at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:770)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
> at 
> java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.ja

[jira] [Commented] (SPARK-26209) Allow for dataframe bucketization without Hive

2018-11-29 Thread Walt Elder (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703324#comment-16703324
 ] 

Walt Elder commented on SPARK-26209:


I've attempted to get on the dev@ boards about this, to no avail

> Allow for dataframe bucketization without Hive
> --
>
> Key: SPARK-26209
> URL: https://issues.apache.org/jira/browse/SPARK-26209
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Java API, SQL
>Affects Versions: 2.4.0
>Reporter: Walt Elder
>Priority: Minor
>
> As a DataFrame author, I can elect to bucketize my output without involving 
> Hive or HMS, so that my hive-less environment can benefit from this 
> query-optimization technique. 
>  
> https://issues.apache.org/jira/browse/SPARK-19256?focusedCommentId=16345397&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16345397
>  identifies this as a shortcoming with the umbrella feature in provided via 
> SPARK-19256.
>  
> In short, relying on Hive to store metadata *precludes* environments which 
> don't have/use hive from making use of bucketization features. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26211) Fix InSet for binary, and struct and array with null.

2018-11-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26211.
-
   Resolution: Fixed
 Assignee: Takuya Ueshin
Fix Version/s: 3.0.0
   2.4.1
   2.3.3

> Fix InSet for binary, and struct and array with null.
> -
>
> Key: SPARK-26211
> URL: https://issues.apache.org/jira/browse/SPARK-26211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> Currently {{InSet}} doesn't work properly for binary type, or struct and 
> array type with null value in the set.
>  Because, as for binary type, the {{HashSet}} doesn't work properly for 
> {{Array[Byte]}}, and as for struct and array type with null value in the set, 
> the {{ordering}} will throw a {{NPE}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26177) Automated formatting for Scala code

2018-11-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26177.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23148
[https://github.com/apache/spark/pull/23148]

> Automated formatting for Scala code
> ---
>
> Key: SPARK-26177
> URL: https://issues.apache.org/jira/browse/SPARK-26177
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Cody Koeninger
>Assignee: Cody Koeninger
>Priority: Major
> Fix For: 3.0.0
>
>
> Provide an easy way for contributors to run scalafmt only on files that 
> differ from git master
> See discussion at
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Automated-formatting-td25791.html]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26177) Automated formatting for Scala code

2018-11-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26177:
-

Assignee: Cody Koeninger

> Automated formatting for Scala code
> ---
>
> Key: SPARK-26177
> URL: https://issues.apache.org/jira/browse/SPARK-26177
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Cody Koeninger
>Assignee: Cody Koeninger
>Priority: Major
>
> Provide an easy way for contributors to run scalafmt only on files that 
> differ from git master
> See discussion at
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Automated-formatting-td25791.html]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26218) Throw exception on overflow for integers

2018-11-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703266#comment-16703266
 ] 

Apache Spark commented on SPARK-26218:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21599

> Throw exception on overflow for integers
> 
>
> Key: SPARK-26218
> URL: https://issues.apache.org/jira/browse/SPARK-26218
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Major
>
> SPARK-24598 just updated the documentation in order to state that our 
> addition is a Java style one and not a SQL style. But in order to follow the 
> SQL standard we should instead throw an exception if an overflow occurs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26218) Throw exception on overflow for integers

2018-11-29 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-26218:
---

 Summary: Throw exception on overflow for integers
 Key: SPARK-26218
 URL: https://issues.apache.org/jira/browse/SPARK-26218
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Marco Gaido


SPARK-24598 just updated the documentation in order to state that our addition 
is a Java style one and not a SQL style. But in order to follow the SQL 
standard we should instead throw an exception if an overflow occurs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26215) define reserved keywords after SQL standard

2018-11-29 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-26215:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-26217

> define reserved keywords after SQL standard
> ---
>
> Key: SPARK-26215
> URL: https://issues.apache.org/jira/browse/SPARK-26215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> There are 2 kinds of SQL keywords: reserved and non-reserved. Reserved 
> keywords can't be used as identifiers.
> In Spark SQL, we are too tolerant about non-reserved keywors. A lot of 
> keywords are non-reserved and sometimes it cause ambiguity (IIRC we hit a 
> problem when improving the INTERVAL syntax).
> I think it will be better to just follow other databases or SQL standard to 
> define reserved keywords, so that we don't need to think very hard about how 
> to avoid ambiguity.
> For reference: https://www.postgresql.org/docs/8.1/sql-keywords-appendix.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26218) Throw exception on overflow for integers

2018-11-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703270#comment-16703270
 ] 

Apache Spark commented on SPARK-26218:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21599

> Throw exception on overflow for integers
> 
>
> Key: SPARK-26218
> URL: https://issues.apache.org/jira/browse/SPARK-26218
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Major
>
> SPARK-24598 just updated the documentation in order to state that our 
> addition is a Java style one and not a SQL style. But in order to follow the 
> SQL standard we should instead throw an exception if an overflow occurs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26218) Throw exception on overflow for integers

2018-11-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26218:


Assignee: (was: Apache Spark)

> Throw exception on overflow for integers
> 
>
> Key: SPARK-26218
> URL: https://issues.apache.org/jira/browse/SPARK-26218
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Major
>
> SPARK-24598 just updated the documentation in order to state that our 
> addition is a Java style one and not a SQL style. But in order to follow the 
> SQL standard we should instead throw an exception if an overflow occurs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26218) Throw exception on overflow for integers

2018-11-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26218:


Assignee: Apache Spark

> Throw exception on overflow for integers
> 
>
> Key: SPARK-26218
> URL: https://issues.apache.org/jira/browse/SPARK-26218
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-24598 just updated the documentation in order to state that our 
> addition is a Java style one and not a SQL style. But in order to follow the 
> SQL standard we should instead throw an exception if an overflow occurs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23179) Support option to throw exception if overflow occurs

2018-11-29 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-23179:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-26217

> Support option to throw exception if overflow occurs
> 
>
> Key: SPARK-23179
> URL: https://issues.apache.org/jira/browse/SPARK-23179
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Major
>
> SQL ANSI 2011 states that in case of overflow during arithmetic operations, 
> an exception should be thrown. This is what most of the SQL DBs do (eg. 
> SQLServer, DB2). Hive currently returns NULL (as Spark does) but HIVE-18291 
> is open to be SQL compliant.
> I propose to have a config option which allows to decide whether Spark should 
> behave according to SQL standards or in the current way (ie. returning NULL).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26217) Compliance to SQL standard (SQL:2011)

2018-11-29 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-26217:
---

 Summary: Compliance to SQL standard (SQL:2011)
 Key: SPARK-26217
 URL: https://issues.apache.org/jira/browse/SPARK-26217
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 3.0.0
Reporter: Marco Gaido


This is an umbrella JIRA following the discussion in SPARK-26215. Its aim is to 
include all the items which need to be tracked in order to make spark reach 
full compliance to the standard (or at least a better compliance).

In the title I specified SQL:2011 in order to define which standard version 
should be followed when there are differences among different versions. We can 
discuss if we should target a different version instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26163) Parsing decimals from JSON using locale

2018-11-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26163.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23132
[https://github.com/apache/spark/pull/23132]

> Parsing decimals from JSON using locale
> ---
>
> Key: SPARK-26163
> URL: https://issues.apache.org/jira/browse/SPARK-26163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, the decimal type can be inferring from a JSON field value only if 
> it is recognised by JacksonParser and does not depend on locale. Also 
> decimals in locale specific format cannot be parsed. The ticket aims to 
> support parsing/converting/inferring of decimals from JSON inputs using 
> locale.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >