[jira] [Assigned] (SPARK-23447) Cleanup codegen template for Literal

2018-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23447:


Assignee: Apache Spark

> Cleanup codegen template for Literal
> 
>
> Key: SPARK-23447
> URL: https://issues.apache.org/jira/browse/SPARK-23447
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Kris Mok
>Assignee: Apache Spark
>Priority: Major
>
> Ideally, the codegen templates for {{Literal}} should emit literals in the 
> {{isNull}} and {{value}} fields of {{ExprCode}} so that they can be 
> effectively inlined into their use sites.
> But currently there are a couple of paths where {{Literal.doGenCode()}} 
> return {{ExprCode}} that has non-trivial {{code}} field, and all of those are 
> actually unnecessary.
> We can make a simple refactoring to make sure all codegen templates for 
> {{Literal}} return empty {{code}} and simple literal/constant expressions in 
> {{isNull}} and {{value}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23446) Explicitly check supported types in toPandas

2018-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366686#comment-16366686
 ] 

Apache Spark commented on SPARK-23446:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/20625

> Explicitly check supported types in toPandas
> 
>
> Key: SPARK-23446
> URL: https://issues.apache.org/jira/browse/SPARK-23446
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See:
> {code}
> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
> df = spark.createDataFrame([[bytearray("a")]])
> df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> df.toPandas()
> {code}
> {code}
>  _1
> 0  [97]
>   _1
> 0  a
> {code}
> We didn't finish binary type in Arrow conversion at Python side. We should 
> disallow it.
> Same thine applies to nested timestamps. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23446) Explicitly check supported types in toPandas

2018-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23446:


Assignee: (was: Apache Spark)

> Explicitly check supported types in toPandas
> 
>
> Key: SPARK-23446
> URL: https://issues.apache.org/jira/browse/SPARK-23446
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See:
> {code}
> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
> df = spark.createDataFrame([[bytearray("a")]])
> df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> df.toPandas()
> {code}
> {code}
>  _1
> 0  [97]
>   _1
> 0  a
> {code}
> We didn't finish binary type in Arrow conversion at Python side. We should 
> disallow it.
> Same thine applies to nested timestamps. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23446) Explicitly check supported types in toPandas

2018-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23446:


Assignee: Apache Spark

> Explicitly check supported types in toPandas
> 
>
> Key: SPARK-23446
> URL: https://issues.apache.org/jira/browse/SPARK-23446
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> See:
> {code}
> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
> df = spark.createDataFrame([[bytearray("a")]])
> df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> df.toPandas()
> {code}
> {code}
>  _1
> 0  [97]
>   _1
> 0  a
> {code}
> We didn't finish binary type in Arrow conversion at Python side. We should 
> disallow it.
> Same thine applies to nested timestamps. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23446) Explicitly check supported types in toPandas

2018-02-15 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23446:
-
Summary: Explicitly check supported types in toPandas  (was: Explicitly 
specify supported types in toPandas)

> Explicitly check supported types in toPandas
> 
>
> Key: SPARK-23446
> URL: https://issues.apache.org/jira/browse/SPARK-23446
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See:
> {code}
> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
> df = spark.createDataFrame([[bytearray("a")]])
> df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> df.toPandas()
> {code}
> {code}
>  _1
> 0  [97]
>   _1
> 0  a
> {code}
> We didn't finish binary type in Arrow conversion at Python side. We should 
> disallow it.
> Same thine applies to nested timestamps. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23447) Cleanup codegen template for Literal

2018-02-15 Thread Kris Mok (JIRA)
Kris Mok created SPARK-23447:


 Summary: Cleanup codegen template for Literal
 Key: SPARK-23447
 URL: https://issues.apache.org/jira/browse/SPARK-23447
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.1, 2.2.0
Reporter: Kris Mok


Ideally, the codegen templates for {{Literal}} should emit literals in the 
{{isNull}} and {{value}} fields of {{ExprCode}} so that they can be effectively 
inlined into their use sites.
But currently there are a couple of paths where {{Literal.doGenCode()}} return 
{{ExprCode}} that has non-trivial {{code}} field, and all of those are actually 
unnecessary.

We can make a simple refactoring to make sure all codegen templates for 
{{Literal}} return empty {{code}} and simple literal/constant expressions in 
{{isNull}} and {{value}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23446) Explicitly specify supported types in toPandas

2018-02-15 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-23446:


 Summary: Explicitly specify supported types in toPandas
 Key: SPARK-23446
 URL: https://issues.apache.org/jira/browse/SPARK-23446
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon


See:

{code}
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
df = spark.createDataFrame([[bytearray("a")]])
df.toPandas()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df.toPandas()
{code}


{code}
 _1
0  [97]
  _1
0  a
{code}

We didn't finish binary type in Arrow conversion at Python side. We should 
disallow it.
Same thine applies to nested timestamps. 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-15 Thread Maxim Gekk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366547#comment-16366547
 ] 

Maxim Gekk commented on SPARK-23410:


[~sameerag] It is not blocker anymore. I unset the blocker flag.

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-15 Thread Maxim Gekk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-23410:
---
Priority: Major  (was: Blocker)

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-15 Thread Sameer Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366537#comment-16366537
 ] 

Sameer Agarwal commented on SPARK-23410:


[~maxgekk] [~smilegator] any ETA on this? As [~hyukjin.kwon] points out, given 
that https://github.com/apache/spark/pull/20302 is reverted, should we still 
block RC4 on this?

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Blocker
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23445) ColumnStat refactoring

2018-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23445:


Assignee: (was: Apache Spark)

> ColumnStat refactoring
> --
>
> Key: SPARK-23445
> URL: https://issues.apache.org/jira/browse/SPARK-23445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Refactor ColumnStat to be more flexible.
>  * Split {{ColumnStat}} and {{CatalogColumnStat}} just like 
> {{CatalogStatistics}} is split from {{Statistics}}. This detaches how the 
> statistics are stored from how they are processed in the query plan. 
> {{CatalogColumnStat}} keeps {{min}} and {{max}} as {{String}}, making it not 
> depend on dataType information.
>  * For {{CatalogColumnStat}}, parse column names from property names in the 
> metastore ({{KEY_VERSION }}property), not from metastore schema. This allows 
> the catalog to read stats into {{CatalogColumnStat}}s even if the schema 
> itself is not in the metastore.
>  * Make all fields optional. {{min}}, {{max}} and {{histogram}} for columns 
> were optional already. Having them all optional is more consistent, and gives 
> flexibility to e.g. drop some of the fields through transformations if they 
> are difficult / impossible to calculate.
> The added flexibility will make it possible to have alternative 
> implementations for stats, and separates stats collection from stats and 
> estimation processing in plans.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23445) ColumnStat refactoring

2018-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366530#comment-16366530
 ] 

Apache Spark commented on SPARK-23445:
--

User 'juliuszsompolski' has created a pull request for this issue:
https://github.com/apache/spark/pull/20624

> ColumnStat refactoring
> --
>
> Key: SPARK-23445
> URL: https://issues.apache.org/jira/browse/SPARK-23445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Refactor ColumnStat to be more flexible.
>  * Split {{ColumnStat}} and {{CatalogColumnStat}} just like 
> {{CatalogStatistics}} is split from {{Statistics}}. This detaches how the 
> statistics are stored from how they are processed in the query plan. 
> {{CatalogColumnStat}} keeps {{min}} and {{max}} as {{String}}, making it not 
> depend on dataType information.
>  * For {{CatalogColumnStat}}, parse column names from property names in the 
> metastore ({{KEY_VERSION }}property), not from metastore schema. This allows 
> the catalog to read stats into {{CatalogColumnStat}}s even if the schema 
> itself is not in the metastore.
>  * Make all fields optional. {{min}}, {{max}} and {{histogram}} for columns 
> were optional already. Having them all optional is more consistent, and gives 
> flexibility to e.g. drop some of the fields through transformations if they 
> are difficult / impossible to calculate.
> The added flexibility will make it possible to have alternative 
> implementations for stats, and separates stats collection from stats and 
> estimation processing in plans.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23445) ColumnStat refactoring

2018-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23445:


Assignee: Apache Spark

> ColumnStat refactoring
> --
>
> Key: SPARK-23445
> URL: https://issues.apache.org/jira/browse/SPARK-23445
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Juliusz Sompolski
>Assignee: Apache Spark
>Priority: Major
>
> Refactor ColumnStat to be more flexible.
>  * Split {{ColumnStat}} and {{CatalogColumnStat}} just like 
> {{CatalogStatistics}} is split from {{Statistics}}. This detaches how the 
> statistics are stored from how they are processed in the query plan. 
> {{CatalogColumnStat}} keeps {{min}} and {{max}} as {{String}}, making it not 
> depend on dataType information.
>  * For {{CatalogColumnStat}}, parse column names from property names in the 
> metastore ({{KEY_VERSION }}property), not from metastore schema. This allows 
> the catalog to read stats into {{CatalogColumnStat}}s even if the schema 
> itself is not in the metastore.
>  * Make all fields optional. {{min}}, {{max}} and {{histogram}} for columns 
> were optional already. Having them all optional is more consistent, and gives 
> flexibility to e.g. drop some of the fields through transformations if they 
> are difficult / impossible to calculate.
> The added flexibility will make it possible to have alternative 
> implementations for stats, and separates stats collection from stats and 
> estimation processing in plans.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23445) ColumnStat refactoring

2018-02-15 Thread Juliusz Sompolski (JIRA)
Juliusz Sompolski created SPARK-23445:
-

 Summary: ColumnStat refactoring
 Key: SPARK-23445
 URL: https://issues.apache.org/jira/browse/SPARK-23445
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Juliusz Sompolski


Refactor ColumnStat to be more flexible.
 * Split {{ColumnStat}} and {{CatalogColumnStat}} just like 
{{CatalogStatistics}} is split from {{Statistics}}. This detaches how the 
statistics are stored from how they are processed in the query plan. 
{{CatalogColumnStat}} keeps {{min}} and {{max}} as {{String}}, making it not 
depend on dataType information.
 * For {{CatalogColumnStat}}, parse column names from property names in the 
metastore ({{KEY_VERSION }}property), not from metastore schema. This allows 
the catalog to read stats into {{CatalogColumnStat}}s even if the schema itself 
is not in the metastore.
 * Make all fields optional. {{min}}, {{max}} and {{histogram}} for columns 
were optional already. Having them all optional is more consistent, and gives 
flexibility to e.g. drop some of the fields through transformations if they are 
difficult / impossible to calculate.

The added flexibility will make it possible to have alternative implementations 
for stats, and separates stats collection from stats and estimation processing 
in plans.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage

2018-02-15 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366421#comment-16366421
 ] 

Shixiong Zhu commented on SPARK-23433:
--

cc [~irashid]

> java.lang.IllegalStateException: more than one active taskSet for stage
> ---
>
> Key: SPARK-23433
> URL: https://issues.apache.org/jira/browse/SPARK-23433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shixiong Zhu
>Priority: Major
>
> This following error thrown by DAGScheduler stopped the cluster:
> {code}
> 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 
> 7580621: 7580621.2,7580621.1
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage

2018-02-15 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366417#comment-16366417
 ] 

Shixiong Zhu commented on SPARK-23433:
--

{code}
18/02/11 13:22:20 INFO TaskSetManager: Finished task 17.0 in stage 7580621.1 
(TID 65577139) in 303870 ms on 10.0.246.111 (executor 24) (18/19)
18/02/11 13:22:20 INFO DAGScheduler: ShuffleMapStage 7580621 (start at 
command-2841337:340) finished in 303.880 s
18/02/11 13:22:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 7580621 
(start at command-2841337:340) because some of its tasks had failed: 2, 15, 27, 
28, 41
18/02/11 13:22:27 INFO DAGScheduler: Submitting ShuffleMapStage 7580621 
(MapPartitionsRDD[2660062] at start at command-2841337:340), which has no 
missing parents
18/02/11 13:22:27 INFO DAGScheduler: Submitting 5 missing tasks from 
ShuffleMapStage 7580621 (MapPartitionsRDD[2660062] at start at 
command-2841337:340) (first 15 tasks are for partitions Vector(2, 15, 27, 28, 
41))
18/02/11 13:22:27 INFO TaskSchedulerImpl: Adding task set 7580621.2 with 5 tasks
18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: 
DAGSchedulerEventProcessLoop failed; shutting down SparkContext
java.lang.IllegalStateException: more than one active taskSet for stage 
7580621: 7580621.2,7580621.1
at 
org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229)
at 
org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
18/02/11 13:22:27 INFO TaskSchedulerImpl: Cancelling stage 7580621
18/02/11 13:22:27 INFO TaskSchedulerImpl: Cancelling stage 7580621
18/02/11 13:22:27 INFO TaskSchedulerImpl: Stage 7580621 was cancelled
18/02/11 13:22:27 INFO DAGScheduler: ShuffleMapStage 7580621 (start at 
command-2841337:340) failed in 0.057 s due to Job aborted due to stage failure: 
Stage 7580621 cancelled
org.apache.spark.SparkException: Job aborted due to stage failure: Stage 
7580621 cancelled
18/02/11 13:22:27 WARN TaskSetManager: Lost task 18.0 in stage 7580621.1 (TID 
65577140, 10.0.144.170, executor 16): TaskKilled (Stage cancelled)
{code}

According to the above logs, I think the issue is in this line: 
https://github.com/apache/spark/blob/1dc2c1d5e85c5f404f470aeb44c1f3c22786bdea/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1281

"Task 18.0 in stage 7580621.0" finished and updated 
"shuffleStage.pendingPartitions" when "Task 18.0 in stage 7580621.1" was still 
running. Hence, when 18 of 19 tasks finished in "stage 7580621.1", this 
condition 
(https://github.com/apache/spark/blob/1dc2c1d5e85c5f404f470aeb44c1f3c22786bdea/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1284)
 would be true and trigger "stage 7580621.2".



> java.lang.IllegalStateException: more than one active taskSet for stage
> ---
>
> Key: SPARK-23433
> URL: https://issues.apache.org/jira/browse/SPARK-23433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shixiong Zhu
>Priority: Major
>
> This following error thrown by DAGScheduler stopped the cluster:
> {code}
> 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 
> 7580621: 7580621.2,7580621.1
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage

[jira] [Commented] (SPARK-23368) Avoid unnecessary Exchange or Sort after projection

2018-02-15 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366374#comment-16366374
 ] 

Maryann Xue commented on SPARK-23368:
-

[~cloud_fan], [~smilegator], Could you please help review this PR? Thanks in 
advance!

> Avoid unnecessary Exchange or Sort after projection
> ---
>
> Key: SPARK-23368
> URL: https://issues.apache.org/jira/browse/SPARK-23368
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Priority: Minor
>
> After column rename projection, the ProjectExec's outputOrdering and 
> outputPartitioning should reflect the projected columns as well. For example,
> {code:java}
> SELECT b1
> FROM (
> SELECT a a1, b b1
> FROM testData2
> ORDER BY a
> )
> ORDER BY a1{code}
> The inner query is ordered on a1 as well. If we had a rule to eliminate Sort 
> on sorted result, together with this fix, the order-by in the outer query 
> could have been optimized out.
>  
> Similarly, the below query
> {code:java}
> SELECT *
> FROM (
> SELECT t1.a a1, t2.a a2, t1.b b1, t2.b b2
> FROM testData2 t1
> LEFT JOIN testData2 t2
> ON t1.a = t2.a
> )
> JOIN testData2 t3
> ON a1 = t3.a{code}
> is equivalent to
> {code:java}
> SELECT *
> FROM testData2 t1
> LEFT JOIN testData2 t2
> ON t1.a = t2.a
> JOIN testData2 t3
> ON t1.a = t3.a{code}
> , so the unnecessary sorting and hash-partitioning that have been optimized 
> out for the second query should have be eliminated in the first query as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23444) would like to be able to cancel jobs cleanly

2018-02-15 Thread Jose Torres (JIRA)
Jose Torres created SPARK-23444:
---

 Summary: would like to be able to cancel jobs cleanly
 Key: SPARK-23444
 URL: https://issues.apache.org/jira/browse/SPARK-23444
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Jose Torres


In Structured Streaming, we often need to cancel a Spark job in order to close 
the stream. SparkContext does not (as far as I can tell) provide a runJob 
handle which cleanly signals when a job was cancelled; it simply throws a 
generic SparkException. So we're forced to awkwardly parse this SparkException 
in order to determine whether the job failed because of a cancellation (which 
we expect and want to swallow) or another error (which we want to propagate).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-02-15 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366277#comment-16366277
 ] 

Bago Amirbekian commented on SPARK-23265:
-

What's the status of this? Will this be a change in behaviour?

> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> {{numBuckets }}when transforming multiple columns, since that is then applied 
> to all columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22913) Hive Partition Pruning, Fractional and Timestamp types

2018-02-15 Thread Ameen Tayyebi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ameen Tayyebi resolved SPARK-22913.
---
Resolution: Won't Fix

Resolving in favor of native Glue integration. These advanced predicates can't 
be supported because the version of Hive embedded in Spark does not support 
them.

> Hive Partition Pruning, Fractional and Timestamp types
> --
>
> Key: SPARK-22913
> URL: https://issues.apache.org/jira/browse/SPARK-22913
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ameen Tayyebi
>Priority: Major
>
> Spark currently pushes the predicates it has in the SQL query to Hive 
> Metastore. This only applies to predicates that are placed on top of 
> partitioning columns. As more and more hive metastore implementations come 
> around, this is an important optimization to allow data to be prefiltered to 
> only relevant partitions. Consider the following example:
> Table:
> create external table data (key string, quantity long)
> partitioned by (processing-date timestamp)
> Query:
> select * from data where processing-date = '2017-10-23 00:00:00'
> Currently, no filters will be pushed to the hive metastore for the above 
> query. The reason is that the code that tries to compute predicates to be 
> sent to hive metastore, only deals with integral and string column types. It 
> doesn't know how to handle fractional and timestamp columns.
> I have tables in my metastore (AWS Glue) with millions of partitions of type 
> timestamp and double. In my specific case, it takes Spark's master node about 
> 6.5 minutes to download all partitions for the table, and then filter the 
> partitions client-side. The actual processing time of my query is only 6 
> seconds. In other words, without partition pruning, I'm looking at 6.5 
> minutes of processing and with partition pruning, I'm looking at 6 seconds 
> only.
> I have a fix for this developed locally that I'll provide shortly as a pull 
> request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23443) Spark with Glue as external catalog

2018-02-15 Thread Ameen Tayyebi (JIRA)
Ameen Tayyebi created SPARK-23443:
-

 Summary: Spark with Glue as external catalog
 Key: SPARK-23443
 URL: https://issues.apache.org/jira/browse/SPARK-23443
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Ameen Tayyebi


AWS Glue Catalog is an external Hive metastore backed by a web service. It 
allows permanent storage of catalog data for BigData use cases.

To find out more information about AWS Glue, please consult:
 * AWS Glue - [https://aws.amazon.com/glue/]
 * Using Glue as a Metastore catalog for Spark - 
[https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html]

Today, the integration of Glue and Spark is through the Hive layer. Glue 
implements the IMetaStore interface of Hive and for installations of Spark that 
contain Hive, Glue can be used as the metastore.

The feature set that Glue supports does not align 1-1 with the set of features 
that the latest version of Spark supports. For example, Glue interface supports 
more advanced partition pruning that the latest version of Hive embedded in 
Spark.

To enable a more natural integration with Spark and to allow leveraging latest 
features of Glue, without being coupled to Hive, a direct integration through 
Spark's own Catalog API is proposed. This Jira tracks this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work

2018-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366252#comment-16366252
 ] 

Apache Spark commented on SPARK-23413:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/20623

> Sorting tasks by Host / Executor ID on the Stage page does not work
> ---
>
> Key: SPARK-23413
> URL: https://issues.apache.org/jira/browse/SPARK-23413
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Blocker
> Fix For: 2.3.0
>
>
> Sorting tasks by Host / Executor ID throws exceptions: 
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Executor ID at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
> {code}
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Host at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> {code}
>   !image-2018-02-13-16-50-32-600.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work

2018-02-15 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-23413:
-
Affects Version/s: (was: 2.4.0)

> Sorting tasks by Host / Executor ID on the Stage page does not work
> ---
>
> Key: SPARK-23413
> URL: https://issues.apache.org/jira/browse/SPARK-23413
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Blocker
> Fix For: 2.3.0
>
>
> Sorting tasks by Host / Executor ID throws exceptions: 
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Executor ID at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
> {code}
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Host at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> {code}
>   !image-2018-02-13-16-50-32-600.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work

2018-02-15 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366199#comment-16366199
 ] 

Imran Rashid commented on SPARK-23413:
--

This was fixed by https://github.com/apache/spark/pull/20601 in master, and 
https://github.com/apache/spark/pull/20623 in branch-2.3 (because of a mistake 
I made while merging)

> Sorting tasks by Host / Executor ID on the Stage page does not work
> ---
>
> Key: SPARK-23413
> URL: https://issues.apache.org/jira/browse/SPARK-23413
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Blocker
> Fix For: 2.3.0
>
>
> Sorting tasks by Host / Executor ID throws exceptions: 
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Executor ID at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
> {code}
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Host at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> {code}
>   !image-2018-02-13-16-50-32-600.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work

2018-02-15 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-23413.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20623
[https://github.com/apache/spark/pull/20623]

> Sorting tasks by Host / Executor ID on the Stage page does not work
> ---
>
> Key: SPARK-23413
> URL: https://issues.apache.org/jira/browse/SPARK-23413
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Blocker
> Fix For: 2.3.0
>
>
> Sorting tasks by Host / Executor ID throws exceptions: 
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Executor ID at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
> {code}
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Host at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> {code}
>   !image-2018-02-13-16-50-32-600.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work

2018-02-15 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-23413:


Assignee: Attila Zsolt Piros

> Sorting tasks by Host / Executor ID on the Stage page does not work
> ---
>
> Key: SPARK-23413
> URL: https://issues.apache.org/jira/browse/SPARK-23413
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Blocker
>
> Sorting tasks by Host / Executor ID throws exceptions: 
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Executor ID at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
> {code}
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Host at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> {code}
>   !image-2018-02-13-16-50-32-600.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable

2018-02-15 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-23173:
-
Labels: release-notes  (was: )

> from_json can produce nulls for fields which are marked as non-nullable
> ---
>
> Key: SPARK-23173
> URL: https://issues.apache.org/jira/browse/SPARK-23173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Herman van Hovell
>Priority: Major
>  Labels: release-notes
>
> The {{from_json}} function uses a schema to convert a string into a Spark SQL 
> struct. This schema can contain non-nullable fields. The underlying 
> {{JsonToStructs}} expression does not check if a resulting struct respects 
> the nullability of the schema. This leads to very weird problems in consuming 
> expressions. In our case parquet writing would produce an illegal parquet 
> file.
> There are roughly solutions here:
>  # Assume that each field in schema passed to {{from_json}} is nullable, and 
> ignore the nullability information set in the passed schema.
>  # Validate the object during runtime, and fail execution if the data is null 
> where we are not expecting this.
> I currently am slightly in favor of option 1, since this is the more 
> performant option and a lot easier to do.
> WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-15 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366162#comment-16366162
 ] 

Igor Berman edited comment on SPARK-23423 at 2/15/18 7:30 PM:
--

[~skonto] I'll at Sunday probably. I'm not loosing agents.


was (Author: igor.berman):
[~skonto] I'll at Sunday probably

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-15 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366162#comment-16366162
 ] 

Igor Berman commented on SPARK-23423:
-

[~skonto] I'll at Sunday probably

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23377.
---
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.1

Resolved in master and branch-2.3 with 
https://github.com/apache/spark/pull/20594

> Bucketizer with multiple columns persistence bug
> 
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Assignee: Liang-Chi Hsieh
>Priority: Blocker
> Fix For: 2.3.1, 2.4.0
>
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>   at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23430) Cannot sort "Executor ID" or "Host" columns in the task table

2018-02-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-23430.
--
Resolution: Duplicate

> Cannot sort "Executor ID" or "Host" columns in the task table
> -
>
> Key: SPARK-23430
> URL: https://issues.apache.org/jira/browse/SPARK-23430
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
>  Labels: regression
>
> Click the "Executor ID" or "Host" header in the task table and it will fail:
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Executor ID
>   at org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1009)
>   at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:686)
>   at org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61)
>   at org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96)
>   at org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:700)
>   at org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293)
>   at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282)
>   at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98)
>   at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98)
>   at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.eclipse.jetty.server.Server.handle(Server.java:534)
>   at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
>   at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>   at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>   at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
>   at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases

2018-02-15 Thread Pranav Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pranav Rao updated SPARK-23442:
---
Description: 
Through the DataFrameWriter[T] interface I have created a external HIVE table 
with 5000 (horizontal) partitions and 50 buckets in each partition. Overall the 
dataset is 600GB and the provider is Parquet.

Now this works great when joining with a similarly bucketed dataset - it's able 
to avoid a shuffle. 

But any action on this Dataframe(from _spark.table("tablename")_), works with 
only 50 RDD partitions. This is happening because of 
[createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc].
 So the 600GB dataset is only read through 50 tasks, which makes this 
partitioning + bucketing scheme not useful.

I cannot expose the base directory of the parquet folder for reading the 
dataset, because the partition locations don't follow a (basePath + partSpec) 
format.

Meanwhile, are there workarounds to use higher parallelism while reading such a 
table? 
 Let me know if I can help in any way.

  was:
Through the DataFrameWriter[T] interface I have created a external HIVE table 
with 5000 (horizontal) partitions and 50 buckets in each partition. Overall the 
dataset is 600GB and the provider is Parquet.

Now this works great when joining with a similarly bucketed dataset - it's able 
to avoid a shuffle. 

But any action on this Dataframe(from _spark.table("tablename")_), works with 
only 50 RDD partitions. This is happening because of 
[createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc].
 So the 600GB dataset is only read through 50 tasks, which makes this 
partitioning + bucketing scheme not useful at all.

I cannot expose the base directory of the parquet folder for reading the 
dataset, because the partition locations don't follow a (basePath + partSpec) 
format.

Meanwhile, are there workarounds to use higher parallelism while reading such a 
table? 
Let me know if I can help in any way.


> Reading from partitioned and bucketed table uses only bucketSpec.numBuckets 
> partitions in all cases
> ---
>
> Key: SPARK-23442
> URL: https://issues.apache.org/jira/browse/SPARK-23442
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Pranav Rao
>Priority: Major
>
> Through the DataFrameWriter[T] interface I have created a external HIVE table 
> with 5000 (horizontal) partitions and 50 buckets in each partition. Overall 
> the dataset is 600GB and the provider is Parquet.
> Now this works great when joining with a similarly bucketed dataset - it's 
> able to avoid a shuffle. 
> But any action on this Dataframe(from _spark.table("tablename")_), works with 
> only 50 RDD partitions. This is happening because of 
> [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc].
>  So the 600GB dataset is only read through 50 tasks, which makes this 
> partitioning + bucketing scheme not useful.
> I cannot expose the base directory of the parquet folder for reading the 
> dataset, because the partition locations don't follow a (basePath + partSpec) 
> format.
> Meanwhile, are there workarounds to use higher parallelism while reading such 
> a table? 
>  Let me know if I can help in any way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases

2018-02-15 Thread Pranav Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pranav Rao updated SPARK-23442:
---
Description: 
Through the DataFrameWriter[T] interface I have created a external HIVE table 
with 5000 (horizontal) partitions and 50 buckets in each partition. Overall the 
dataset is 600GB and the provider is Parquet.

Now this works great when joining with a similarly bucketed dataset - it's able 
to avoid a shuffle. 

But any action on this Dataframe(from _spark.table("tablename")_), works with 
only 50 RDD partitions. This is happening because of 
[createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc].
 So the 600GB dataset is only read through 50 tasks, which makes this 
partitioning + bucketing scheme not useful at all.

I cannot expose the base directory of the parquet folder for reading the 
dataset, because the partition locations don't follow a (basePath + partSpec) 
format.

Meanwhile, are there workarounds to use higher parallelism while reading such a 
table? 
Let me know if I can help in any way.

  was:
Through the DataFrameWriter[T] interface I have created a external HIVE table 
with 5000 (horizontal) partitions and 50 buckets in each partition. Overall the 
dataset is 600GB and the provider is Parquet.

Now this works great when joining with a similarly bucketed dataset - it's able 
to avoid a shuffle. 

But any action on this Dataframe(from _spark.table("tablename")_), works with 
only 50 RDD partitions. This is happening because of 
[createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc].
 So the 600GB dataset is only read through 50 tasks, which makes this 
partitioning + bucketing scheme not useful at all.

I cannot expose the base directory of the parquet folder for reading the 
dataset, because the partition locations don't follow a (basePath + partSpec) 
format.

Meanwhile, are there workarounds to use higher parallelism while reading such a 
table? Let me know if we


> Reading from partitioned and bucketed table uses only bucketSpec.numBuckets 
> partitions in all cases
> ---
>
> Key: SPARK-23442
> URL: https://issues.apache.org/jira/browse/SPARK-23442
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Pranav Rao
>Priority: Major
>
> Through the DataFrameWriter[T] interface I have created a external HIVE table 
> with 5000 (horizontal) partitions and 50 buckets in each partition. Overall 
> the dataset is 600GB and the provider is Parquet.
> Now this works great when joining with a similarly bucketed dataset - it's 
> able to avoid a shuffle. 
> But any action on this Dataframe(from _spark.table("tablename")_), works with 
> only 50 RDD partitions. This is happening because of 
> [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc].
>  So the 600GB dataset is only read through 50 tasks, which makes this 
> partitioning + bucketing scheme not useful at all.
> I cannot expose the base directory of the parquet folder for reading the 
> dataset, because the partition locations don't follow a (basePath + partSpec) 
> format.
> Meanwhile, are there workarounds to use higher parallelism while reading such 
> a table? 
> Let me know if I can help in any way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases

2018-02-15 Thread Pranav Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pranav Rao updated SPARK-23442:
---
Environment: (was: spark.sql("SET spark.default.parallelism=1000") 


{{spark.sql("set spark.sql.shuffle.partitions=500") }}

{{spark.sql("set spark.sql.files.maxPartitionBytes=134217728")}}

{{-}}

{{$ hdfs getconf -confKey mapreduce.input.fileinputformat.split.minsize}}
0 

$ hdfs getconf -confKey dfs.blocksize
 134217728 

$ hdfs getconf -confKey mapreduce.job.maps
 32)

> Reading from partitioned and bucketed table uses only bucketSpec.numBuckets 
> partitions in all cases
> ---
>
> Key: SPARK-23442
> URL: https://issues.apache.org/jira/browse/SPARK-23442
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Pranav Rao
>Priority: Major
>
> Through the DataFrameWriter[T] interface I have created a external HIVE table 
> with 5000 (horizontal) partitions and 50 buckets in each partition. Overall 
> the dataset is 600GB and the provider is Parquet.
> Now this works great when joining with a similarly bucketed dataset - it's 
> able to avoid a shuffle. 
> But any action on this Dataframe(from _spark.table("tablename")_), works with 
> only 50 RDD partitions. This is happening because of 
> [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc].
>  So the 600GB dataset is only read through 50 tasks, which makes this 
> partitioning + bucketing scheme not useful at all.
> I cannot expose the base directory of the parquet folder for reading the 
> dataset, because the partition locations don't follow a (basePath + partSpec) 
> format.
> Meanwhile, are there workarounds to use higher parallelism while reading such 
> a table? Let me know if we



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases

2018-02-15 Thread Pranav Rao (JIRA)
Pranav Rao created SPARK-23442:
--

 Summary: Reading from partitioned and bucketed table uses only 
bucketSpec.numBuckets partitions in all cases
 Key: SPARK-23442
 URL: https://issues.apache.org/jira/browse/SPARK-23442
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.2.1
 Environment: spark.sql("SET spark.default.parallelism=1000") 

{{spark.sql("set spark.sql.shuffle.partitions=500") }}

{{spark.sql("set spark.sql.files.maxPartitionBytes=134217728")}}

{{-}}

{{$ hdfs getconf -confKey mapreduce.input.fileinputformat.split.minsize}}
0 

$ hdfs getconf -confKey dfs.blocksize
 134217728 

$ hdfs getconf -confKey mapreduce.job.maps
 32
Reporter: Pranav Rao


Through the DataFrameWriter[T] interface I have created a external HIVE table 
with 5000 (horizontal) partitions and 50 buckets in each partition. Overall the 
dataset is 600GB and the provider is Parquet.

Now this works great when joining with a similarly bucketed dataset - it's able 
to avoid a shuffle. 

But any action on this Dataframe(from _spark.table("tablename")_), works with 
only 50 RDD partitions. This is happening because of 
[createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc].
 So the 600GB dataset is only read through 50 tasks, which makes this 
partitioning + bucketing scheme not useful at all.

I cannot expose the base directory of the parquet folder for reading the 
dataset, because the partition locations don't follow a (basePath + partSpec) 
format.

Meanwhile, are there workarounds to use higher parallelism while reading such a 
table? Let me know if we



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-23377:
-

Assignee: Liang-Chi Hsieh

> Bucketizer with multiple columns persistence bug
> 
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Assignee: Liang-Chi Hsieh
>Priority: Blocker
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>   at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner

2018-02-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365991#comment-16365991
 ] 

Piotr Kołaczkowski commented on SPARK-14540:


Any progress on this? Are you planning to finalize this by the time Scala 2.13 
is stable?

> Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
> 
>
> Key: SPARK-14540
> URL: https://issues.apache.org/jira/browse/SPARK-14540
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>Priority: Major
>
> Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running 
> ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures:
> {code}
> [info] - toplevel return statements in closures are identified at cleaning 
> time *** FAILED *** (32 milliseconds)
> [info]   Expected exception 
> org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no 
> exception was thrown. (ClosureCleanerSuite.scala:57)
> {code}
> and
> {code}
> [info] - user provided closures are actually cleaned *** FAILED *** (56 
> milliseconds)
> [info]   Expected ReturnStatementInClosureException, but got 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task not 
> serializable: java.io.NotSerializableException: java.lang.Object
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class 
> org.apache.spark.util.TestUserClosuresActuallyCleaned$, 
> functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I,
>  implementation=invokeStatic 
> org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I,
>  instantiatedMethodType=(I)I, numCaptured=1])
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, 
> functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  
> instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  numCaptured=1])
> [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: 
> "f", type: "interface scala.Function3")
> [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", 
> MapPartitionsRDD[2] at apply at Transformer.scala:22)
> [info]- field (class "scala.Tuple2", name: "_1", type: "class 
> java.lang.Object")
> [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at 
> apply at 
> Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)).
> [info]   This means the closure provided by user is not actually cleaned. 
> (ClosureCleanerSuite.scala:78)
> {code}
> We'll need to figure out a closure cleaning strategy which works for 2.12 
> lambdas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23441) Remove interrupts from ContinuousExecution

2018-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23441:


Assignee: (was: Apache Spark)

> Remove interrupts from ContinuousExecution
> --
>
> Key: SPARK-23441
> URL: https://issues.apache.org/jira/browse/SPARK-23441
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Minor
>
> The reason StreamExecution interrupts the query execution thread is that, for 
> the microbatch case, nontrivial work goes on in that thread to construct a 
> batch. In ContinuousExecution, this doesn't apply. Once the state is flipped 
> from ACTIVE and the underlying job is cancelled, the query execution thread 
> will immediately go to cleanup. So we don't need to call 
> queryExecutionThread.interrupt() at all there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23441) Remove interrupts from ContinuousExecution

2018-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365984#comment-16365984
 ] 

Apache Spark commented on SPARK-23441:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/20622

> Remove interrupts from ContinuousExecution
> --
>
> Key: SPARK-23441
> URL: https://issues.apache.org/jira/browse/SPARK-23441
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Minor
>
> The reason StreamExecution interrupts the query execution thread is that, for 
> the microbatch case, nontrivial work goes on in that thread to construct a 
> batch. In ContinuousExecution, this doesn't apply. Once the state is flipped 
> from ACTIVE and the underlying job is cancelled, the query execution thread 
> will immediately go to cleanup. So we don't need to call 
> queryExecutionThread.interrupt() at all there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23441) Remove interrupts from ContinuousExecution

2018-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23441:


Assignee: Apache Spark

> Remove interrupts from ContinuousExecution
> --
>
> Key: SPARK-23441
> URL: https://issues.apache.org/jira/browse/SPARK-23441
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Minor
>
> The reason StreamExecution interrupts the query execution thread is that, for 
> the microbatch case, nontrivial work goes on in that thread to construct a 
> batch. In ContinuousExecution, this doesn't apply. Once the state is flipped 
> from ACTIVE and the underlying job is cancelled, the query execution thread 
> will immediately go to cleanup. So we don't need to call 
> queryExecutionThread.interrupt() at all there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23441) Remove interrupts from ContinuousExecution

2018-02-15 Thread Jose Torres (JIRA)
Jose Torres created SPARK-23441:
---

 Summary: Remove interrupts from ContinuousExecution
 Key: SPARK-23441
 URL: https://issues.apache.org/jira/browse/SPARK-23441
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres


The reason StreamExecution interrupts the query execution thread is that, for 
the microbatch case, nontrivial work goes on in that thread to construct a 
batch. In ContinuousExecution, this doesn't apply. Once the state is flipped 
from ACTIVE and the underlying job is cancelled, the query execution thread 
will immediately go to cleanup. So we don't need to call 
queryExecutionThread.interrupt() at all there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23440) Clean up StreamExecution interrupts

2018-02-15 Thread Jose Torres (JIRA)
Jose Torres created SPARK-23440:
---

 Summary: Clean up StreamExecution interrupts
 Key: SPARK-23440
 URL: https://issues.apache.org/jira/browse/SPARK-23440
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres


StreamExecution currently heavily leverages interrupt() to stop the query 
execution thread. But the query execution thread is sometimes in the middle of 
a context that will wrap or convert the InterruptedException, so we maintain a 
whitelist of exceptions that we think indicate an exception caused by stop 
rather than an error condition. This is awkward and probably fragile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23436) Incorrect Date column Inference in partition discovery

2018-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23436:


Assignee: Apache Spark

> Incorrect Date column Inference in partition discovery
> --
>
> Key: SPARK-23436
> URL: https://issues.apache.org/jira/browse/SPARK-23436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Apoorva Sareen
>Assignee: Apache Spark
>Priority: Major
>
> If a Partition column appears to partial date/timestamp
>     example : 2018-01-01-23 
> where it is only truncated upto an hour then the data types of the 
> partitioning columns are automatically inferred as date however, the values 
> are loaded as null. 
> Here is an example code to reproduce this behaviour
>  
>  
> {code:java}
> val data = Seq(("1", "2018-01", "2018-01-01-04", "test")).toDF("id", 
> "date_month", "data_hour", "data")  
> data.write.partitionBy("id","date_month","data_hour").parquet("output/test")
> val input = spark.read.parquet("output/test")  
> input.printSchema()
> input.show()
> ## Result ###
> root
> |-- data: string (nullable = true)
> |-- id: integer (nullable = true)
> |-- date_month: string (nullable = true)
> |-- data_hour: date (nullable = true)
> ++---+--+-+
> |data| id|date_month|data_hour|
> ++---+--+-+
> |test|  1|   2018-01|     null|
> ++---+--+-+{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23436) Incorrect Date column Inference in partition discovery

2018-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23436:


Assignee: (was: Apache Spark)

> Incorrect Date column Inference in partition discovery
> --
>
> Key: SPARK-23436
> URL: https://issues.apache.org/jira/browse/SPARK-23436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Apoorva Sareen
>Priority: Major
>
> If a Partition column appears to partial date/timestamp
>     example : 2018-01-01-23 
> where it is only truncated upto an hour then the data types of the 
> partitioning columns are automatically inferred as date however, the values 
> are loaded as null. 
> Here is an example code to reproduce this behaviour
>  
>  
> {code:java}
> val data = Seq(("1", "2018-01", "2018-01-01-04", "test")).toDF("id", 
> "date_month", "data_hour", "data")  
> data.write.partitionBy("id","date_month","data_hour").parquet("output/test")
> val input = spark.read.parquet("output/test")  
> input.printSchema()
> input.show()
> ## Result ###
> root
> |-- data: string (nullable = true)
> |-- id: integer (nullable = true)
> |-- date_month: string (nullable = true)
> |-- data_hour: date (nullable = true)
> ++---+--+-+
> |data| id|date_month|data_hour|
> ++---+--+-+
> |test|  1|   2018-01|     null|
> ++---+--+-+{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23436) Incorrect Date column Inference in partition discovery

2018-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365959#comment-16365959
 ] 

Apache Spark commented on SPARK-23436:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/20621

> Incorrect Date column Inference in partition discovery
> --
>
> Key: SPARK-23436
> URL: https://issues.apache.org/jira/browse/SPARK-23436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Apoorva Sareen
>Priority: Major
>
> If a Partition column appears to partial date/timestamp
>     example : 2018-01-01-23 
> where it is only truncated upto an hour then the data types of the 
> partitioning columns are automatically inferred as date however, the values 
> are loaded as null. 
> Here is an example code to reproduce this behaviour
>  
>  
> {code:java}
> val data = Seq(("1", "2018-01", "2018-01-01-04", "test")).toDF("id", 
> "date_month", "data_hour", "data")  
> data.write.partitionBy("id","date_month","data_hour").parquet("output/test")
> val input = spark.read.parquet("output/test")  
> input.printSchema()
> input.show()
> ## Result ###
> root
> |-- data: string (nullable = true)
> |-- id: integer (nullable = true)
> |-- date_month: string (nullable = true)
> |-- data_hour: date (nullable = true)
> ++---+--+-+
> |data| id|date_month|data_hour|
> ++---+--+-+
> |test|  1|   2018-01|     null|
> ++---+--+-+{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23415) BufferHolderSparkSubmitSuite is flaky

2018-02-15 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365931#comment-16365931
 ] 

Kazuaki Ishizaki commented on SPARK-23415:
--

I realized that an issue in this test case. I think that this test case expects 
to try memory allocation in {{grow()} method four times. However, since the 
allocated memory is reused, this test performed memory allocation only once 
with {{roundToWord(ARRAY_MAX / 2)}}. No memory allocations occur with other 
sizes.

If I updated this test to perform four memory allocations with {{4g}} heap 
size,  an exception occurs.

> BufferHolderSparkSubmitSuite is flaky
> -
>
> Key: SPARK-23415
> URL: https://issues.apache.org/jira/browse/SPARK-23415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> The test suite fails due to 60-second timeout sometimes.
> {code}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> {code}
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87380/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4206/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23415) BufferHolderSparkSubmitSuite is flaky

2018-02-15 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365931#comment-16365931
 ] 

Kazuaki Ishizaki edited comment on SPARK-23415 at 2/15/18 5:02 PM:
---

I realized that an issue in this test case. I think that this test case expects 
to try memory allocation in {{grow()}} method four times. However, since the 
allocated memory is reused, this test performed memory allocation only once 
with {{roundToWord(ARRAY_MAX / 2)}}. No memory allocations occur with other 
sizes.

If I updated this test to perform four memory allocations with {{4g}} heap 
size,  an exception occurs.


was (Author: kiszk):
I realized that an issue in this test case. I think that this test case expects 
to try memory allocation in {{grow()} method four times. However, since the 
allocated memory is reused, this test performed memory allocation only once 
with {{roundToWord(ARRAY_MAX / 2)}}. No memory allocations occur with other 
sizes.

If I updated this test to perform four memory allocations with {{4g}} heap 
size,  an exception occurs.

> BufferHolderSparkSubmitSuite is flaky
> -
>
> Key: SPARK-23415
> URL: https://issues.apache.org/jira/browse/SPARK-23415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> The test suite fails due to 60-second timeout sometimes.
> {code}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> {code}
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87380/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4206/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib

2018-02-15 Thread Simon Dirmeier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365929#comment-16365929
 ] 

Simon Dirmeier commented on SPARK-23437:


Great suggestion. If there is a way to contribute I'd love to.

> [ML] Distributed Gaussian Process Regression for MLlib
> --
>
> Key: SPARK-23437
> URL: https://issues.apache.org/jira/browse/SPARK-23437
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.1
>Reporter: Valeriy Avanesov
>Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23438) DStreams could lose blocks with WAL enabled when driver crashes

2018-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23438:


Assignee: (was: Apache Spark)

> DStreams could lose blocks with WAL enabled when driver crashes
> ---
>
> Key: SPARK-23438
> URL: https://issues.apache.org/jira/browse/SPARK-23438
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> There is a race condition introduced in SPARK-11141 which could cause data 
> loss.
> This affects all versions since 1.6.0.
> Problematic situation:
>  # Start streaming job with 2 receivers with WAL enabled.
>  # Receiver 1 receives a block and does the following
>  ** Writes a BlockAdditionEvent into WAL
>  ** Puts the block into it's received block queue with ID 1
>  # Receiver 2 receives a block and does the following
>  ** Writes a BlockAdditionEvent into WAL
>  # Spark allocates all blocks from it's received block queue and writes 
> AllocatedBlocks(IDs=(1)) into WAL
>  # Driver crashes
>  # New Driver recovers from WAL
>  # Realise block with ID 2 never processed
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23438) DStreams could lose blocks with WAL enabled when driver crashes

2018-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365928#comment-16365928
 ] 

Apache Spark commented on SPARK-23438:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/20620

> DStreams could lose blocks with WAL enabled when driver crashes
> ---
>
> Key: SPARK-23438
> URL: https://issues.apache.org/jira/browse/SPARK-23438
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> There is a race condition introduced in SPARK-11141 which could cause data 
> loss.
> This affects all versions since 1.6.0.
> Problematic situation:
>  # Start streaming job with 2 receivers with WAL enabled.
>  # Receiver 1 receives a block and does the following
>  ** Writes a BlockAdditionEvent into WAL
>  ** Puts the block into it's received block queue with ID 1
>  # Receiver 2 receives a block and does the following
>  ** Writes a BlockAdditionEvent into WAL
>  # Spark allocates all blocks from it's received block queue and writes 
> AllocatedBlocks(IDs=(1)) into WAL
>  # Driver crashes
>  # New Driver recovers from WAL
>  # Realise block with ID 2 never processed
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23438) DStreams could lose blocks with WAL enabled when driver crashes

2018-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23438:


Assignee: Apache Spark

> DStreams could lose blocks with WAL enabled when driver crashes
> ---
>
> Key: SPARK-23438
> URL: https://issues.apache.org/jira/browse/SPARK-23438
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Critical
>
> There is a race condition introduced in SPARK-11141 which could cause data 
> loss.
> This affects all versions since 1.6.0.
> Problematic situation:
>  # Start streaming job with 2 receivers with WAL enabled.
>  # Receiver 1 receives a block and does the following
>  ** Writes a BlockAdditionEvent into WAL
>  ** Puts the block into it's received block queue with ID 1
>  # Receiver 2 receives a block and does the following
>  ** Writes a BlockAdditionEvent into WAL
>  # Spark allocates all blocks from it's received block queue and writes 
> AllocatedBlocks(IDs=(1)) into WAL
>  # Driver crashes
>  # New Driver recovers from WAL
>  # Realise block with ID 2 never processed
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7

2018-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23390:


Assignee: Wenchen Fan  (was: Apache Spark)

> Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
> --
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/
> From a very quick look, these failures seem to be correlated with 
> https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
>  
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/
>  after https://github.com/apache/spark/pull/20562 (cc 
> [~feng...@databricks.com]) was merged.
> The following is Parquet leakage.
> {code}
> Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7

2018-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365927#comment-16365927
 ] 

Apache Spark commented on SPARK-23390:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20619

> Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
> --
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/
> From a very quick look, these failures seem to be correlated with 
> https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
>  
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/
>  after https://github.com/apache/spark/pull/20562 (cc 
> [~feng...@databricks.com]) was merged.
> The following is Parquet leakage.
> {code}
> Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7

2018-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23390:


Assignee: Apache Spark  (was: Wenchen Fan)

> Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
> --
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.3.0
>
>
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/
> From a very quick look, these failures seem to be correlated with 
> https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
>  
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/
>  after https://github.com/apache/spark/pull/20562 (cc 
> [~feng...@databricks.com]) was merged.
> The following is Parquet leakage.
> {code}
> Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23340) Upgrade Apache ORC to 1.4.3

2018-02-15 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23340:

Priority: Major  (was: Blocker)

> Upgrade Apache ORC to 1.4.3
> ---
>
> Key: SPARK-23340
> URL: https://issues.apache.org/jira/browse/SPARK-23340
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.0
>
>
> This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
> Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 
> more patches including bug fixes (https://s.apache.org/Fll8).
> Especially, the following ORC-285 is fixed at 1.4.3.
> {code}
> scala> val df = Seq(Array.empty[Float]).toDF()
> scala> df.write.format("orc").save("/tmp/floatarray")
> scala> spark.read.orc("/tmp/floatarray")
> res1: org.apache.spark.sql.DataFrame = [value: array]
> scala> spark.read.orc("/tmp/floatarray").show()
> 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.io.IOException: Error reading file: 
> file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
>   at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
> ...
> Caused by: java.io.EOFException: Read past EOF for compressed stream Stream 
> for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23340) Upgrade Apache ORC to 1.4.3

2018-02-15 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23340.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.3.0

> Upgrade Apache ORC to 1.4.3
> ---
>
> Key: SPARK-23340
> URL: https://issues.apache.org/jira/browse/SPARK-23340
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Blocker
> Fix For: 2.3.0
>
>
> This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
> Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 
> more patches including bug fixes (https://s.apache.org/Fll8).
> Especially, the following ORC-285 is fixed at 1.4.3.
> {code}
> scala> val df = Seq(Array.empty[Float]).toDF()
> scala> df.write.format("orc").save("/tmp/floatarray")
> scala> spark.read.orc("/tmp/floatarray")
> res1: org.apache.spark.sql.DataFrame = [value: array]
> scala> spark.read.orc("/tmp/floatarray").show()
> 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.io.IOException: Error reading file: 
> file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
>   at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
> ...
> Caused by: java.io.EOFException: Read past EOF for compressed stream Stream 
> for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23426) Use `hive` ORC impl and disable PPD for Spark 2.3.0

2018-02-15 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23426.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.3.0

> Use `hive` ORC impl and disable PPD for Spark 2.3.0
> ---
>
> Key: SPARK-23426
> URL: https://issues.apache.org/jira/browse/SPARK-23426
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2018-02-15 Thread Lucas Partridge (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365904#comment-16365904
 ] 

Lucas Partridge commented on SPARK-17025:
-

What's the up-to-date status for this please? The previous comment implies 
testing is still needed whereas the article at 
[https://databricks.com/blog/2017/08/30/developing-custom-machine-learning-algorithms-in-pyspark.html]
 seems to imply everything's ready. Does anyone know which version of Spark 
this will make it into?

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7

2018-02-15 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23390:
--
Description: 
We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
{{spark-branch-2.3-test-sbt-hadoop-2.7}}:

{code}
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 15 times over 10.01215805999 
seconds. Last failure message: There are 1 possibly leaked file streams..
{code}

Here's the full history: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/

>From a very quick look, these failures seem to be correlated with 
>https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from 
>the following stack trace (full logs 
>[here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
> 

{code}
[info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
connection created at:
java.lang.Throwable
at 
org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
at 
org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
at 
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
{code}

Also, while this might be just a false correlation but the frequency of these 
test failures have increased considerably in 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/
 after https://github.com/apache/spark/pull/20562 (cc 
[~feng...@databricks.com]) was merged.

The following is Parquet leakage.
{code}
Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
at 
org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
at 
org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538)
at 
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
{code}


  was:
We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
{{spark-branch-2.3-test-sbt-hadoop-2.7}}:

{code}
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 15 times over 10.01215805999 
seconds. Last failure message: There are 1 possibly leaked file streams..
{code}

Here's the full history: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/

>From a very quick look, these failures seem to be correlated with 
>https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from 
>the following stack trace (full logs 
>[here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
> 

{code}
[info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
15:55:58

[jira] [Reopened] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7

2018-02-15 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-23390:
---

I'm reopening this issue due to Parquet leakage is detected.

- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/205/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/

{code}
Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
at 
org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
at 
org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538)
at 
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
{code}


> Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
> --
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/
> From a very quick look, these failures seem to be correlated with 
> https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
>  
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/
>  after https://github.com/apache/spark/pull/20562 (cc 
> [~feng...@databricks.com]) was merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23439) Ambiguous reference when selecting column inside StructType with same name that outer colum

2018-02-15 Thread Alejandro Trujillo Caballero (JIRA)
Alejandro Trujillo Caballero created SPARK-23439:


 Summary: Ambiguous reference when selecting column inside 
StructType with same name that outer colum
 Key: SPARK-23439
 URL: https://issues.apache.org/jira/browse/SPARK-23439
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
 Environment: Scala 2.11.8, Spark 2.2.0
Reporter: Alejandro Trujillo Caballero


Hi.

I've seen that when working with nested struct fields in a DataFrame and doing 
a select operation the nesting is lost and this can result in collisions 
between column names.

For example:

 
{code:java}
case class Foo(a: Int, b: Bar)
case class Bar(a: Int)

val items = List(
  Foo(1, Bar(1)),
  Foo(2, Bar(2))
)

val df = spark.createDataFrame(items)

val df_a_a = df.select($"a", $"b.a").show
//+---+---+
//|  a|  a|
//+---+---+
//|  1|  1|
//|  2|  2|
//+---+---+

df.select($"a", $"b.a").printSchema
//root
//|-- a: integer (nullable = false)
//|-- a: integer (nullable = true)

df.select($"a", $"b.a").select($"a")
//org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: 
a#9, a#{code}
 

 

Shouldn't the second column be named "b.a"?

 

Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23438) DStreams could lose blocks with WAL enabled when driver crashes

2018-02-15 Thread Gabor Somogyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365861#comment-16365861
 ] 

Gabor Somogyi commented on SPARK-23438:
---

I'm working on that.

> DStreams could lose blocks with WAL enabled when driver crashes
> ---
>
> Key: SPARK-23438
> URL: https://issues.apache.org/jira/browse/SPARK-23438
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> There is a race condition introduced in SPARK-11141 which could cause data 
> loss.
> This affects all versions since 1.6.0.
> Problematic situation:
>  # Start streaming job with 2 receivers with WAL enabled.
>  # Receiver 1 receives a block and does the following
>  ** Writes a BlockAdditionEvent into WAL
>  ** Puts the block into it's received block queue with ID 1
>  # Receiver 2 receives a block and does the following
>  ** Writes a BlockAdditionEvent into WAL
>  # Spark allocates all blocks from it's received block queue and writes 
> AllocatedBlocks(IDs=(1)) into WAL
>  # Driver crashes
>  # New Driver recovers from WAL
>  # Realise block with ID 2 never processed
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23438) DStreams could lose blocks with WAL enabled when driver crashes

2018-02-15 Thread Gabor Somogyi (JIRA)
Gabor Somogyi created SPARK-23438:
-

 Summary: DStreams could lose blocks with WAL enabled when driver 
crashes
 Key: SPARK-23438
 URL: https://issues.apache.org/jira/browse/SPARK-23438
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 1.6.0
Reporter: Gabor Somogyi


There is a race condition introduced in SPARK-11141 which could cause data loss.

This affects all versions since 1.6.0.

Problematic situation:
 # Start streaming job with 2 receivers with WAL enabled.
 # Receiver 1 receives a block and does the following

 ** Writes a BlockAdditionEvent into WAL
 ** Puts the block into it's received block queue with ID 1
 # Receiver 2 receives a block and does the following

 ** Writes a BlockAdditionEvent into WAL

 # Spark allocates all blocks from it's received block queue and writes 
AllocatedBlocks(IDs=(1)) into WAL
 # Driver crashes
 # New Driver recovers from WAL
 # Realise block with ID 2 never processed

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib

2018-02-15 Thread Valeriy Avanesov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valeriy Avanesov updated SPARK-23437:
-
Summary: [ML] Distributed Gaussian Process Regression for MLlib  (was: 
Distributed Gaussian Process Regression for MLlib)

> [ML] Distributed Gaussian Process Regression for MLlib
> --
>
> Key: SPARK-23437
> URL: https://issues.apache.org/jira/browse/SPARK-23437
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.1
>Reporter: Valeriy Avanesov
>Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23437) Distributed Gaussian Process Regression for MLlib

2018-02-15 Thread Valeriy Avanesov (JIRA)
Valeriy Avanesov created SPARK-23437:


 Summary: Distributed Gaussian Process Regression for MLlib
 Key: SPARK-23437
 URL: https://issues.apache.org/jira/browse/SPARK-23437
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 2.2.1
Reporter: Valeriy Avanesov


Gaussian Process Regression (GP) is a well known black box non-linear 
regression approach [1]. For years the approach remained inapplicable to large 
samples due to its cubic computational complexity, however, more recent 
techniques (Sparse GP) allowed for only linear complexity. The field continues 
to attracts interest of the researches – several papers devoted to GP were 
present on NIPS 2017. 

Unfortunately, non-parametric regression techniques coming with mllib are 
restricted to tree-based approaches.

I propose to create and include an implementation (which I am going to work on) 
of so-called robust Bayesian Committee Machine proposed and investigated in [2].

[1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
The MIT Press.

[2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian processes. 
In _Proceedings of the 32nd International Conference on International 
Conference on Machine Learning - Volume 37_ (ICML'15), Francis Bach and David 
Blei (Eds.), Vol. 37. JMLR.org 1481-1490.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23436) Incorrect Date column Inference in partition discovery

2018-02-15 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365780#comment-16365780
 ] 

Marco Gaido commented on SPARK-23436:
-

Thanks for reporting this. This affects also current branch. I am working on it.

> Incorrect Date column Inference in partition discovery
> --
>
> Key: SPARK-23436
> URL: https://issues.apache.org/jira/browse/SPARK-23436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Apoorva Sareen
>Priority: Major
>
> If a Partition column appears to partial date/timestamp
>     example : 2018-01-01-23 
> where it is only truncated upto an hour then the data types of the 
> partitioning columns are automatically inferred as date however, the values 
> are loaded as null. 
> Here is an example code to reproduce this behaviour
>  
>  
> {code:java}
> val data = Seq(("1", "2018-01", "2018-01-01-04", "test")).toDF("id", 
> "date_month", "data_hour", "data")  
> data.write.partitionBy("id","date_month","data_hour").parquet("output/test")
> val input = spark.read.parquet("output/test")  
> input.printSchema()
> input.show()
> ## Result ###
> root
> |-- data: string (nullable = true)
> |-- id: integer (nullable = true)
> |-- date_month: string (nullable = true)
> |-- data_hour: date (nullable = true)
> ++---+--+-+
> |data| id|date_month|data_hour|
> ++---+--+-+
> |test|  1|   2018-01|     null|
> ++---+--+-+{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-15 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365468#comment-16365468
 ] 

Stavros Kontopoulos edited comment on SPARK-23423 at 2/15/18 3:37 PM:
--

The delivery of mesos task updates  is a responsibility of mesos master AFIK. 
Btw you should see the status of tasks on the agent log (assuming you are not 
loosing agents right?) where the mesos executor was running on. [~susanxhuynh] 
is it possible that under heavy load the task updates are not delivered? 
[~igor.berman] could you attach any log available?


was (Author: skonto):
The delivery of mesos task updates  is a responsibility of mesos master AFIK. 
Btw you should see the status of tasks on the agent log (assuming you are not 
loosing agents right?) where the mesos executor was running on. [~susanxhuynh] 
is it possible that under heavy load the task updates are not delivered?

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23436) Incorrect Date column Inference in partition discovery

2018-02-15 Thread Apoorva Sareen (JIRA)
Apoorva Sareen created SPARK-23436:
--

 Summary: Incorrect Date column Inference in partition discovery
 Key: SPARK-23436
 URL: https://issues.apache.org/jira/browse/SPARK-23436
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Apoorva Sareen


If a Partition column appears to partial date/timestamp

    example : 2018-01-01-23 

where it is only truncated upto an hour then the data types of the partitioning 
columns are automatically inferred as date however, the values are loaded as 
null. 

Here is an example code to reproduce this behaviour

 

 
{code:java}
val data = Seq(("1", "2018-01", "2018-01-01-04", "test")).toDF("id", 
"date_month", "data_hour", "data")  

data.write.partitionBy("id","date_month","data_hour").parquet("output/test")

val input = spark.read.parquet("output/test")  

input.printSchema()

input.show()


## Result ###

root

|-- data: string (nullable = true)

|-- id: integer (nullable = true)

|-- date_month: string (nullable = true)

|-- data_hour: date (nullable = true)



++---+--+-+

|data| id|date_month|data_hour|

++---+--+-+

|test|  1|   2018-01|     null|

++---+--+-+{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21302) history server WebUI show HTTP ERROR 500

2018-02-15 Thread bharath kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365691#comment-16365691
 ] 

bharath kumar commented on SPARK-21302:
---

>From what i have noticed that Jobs are progressing even-though we have NPE . 
>Once the jobs are completed , we can view the stats for Application Master

> history server WebUI show HTTP ERROR 500
> 
>
> Key: SPARK-21302
> URL: https://issues.apache.org/jira/browse/SPARK-21302
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Jason Pan
>Priority: Major
> Attachments: npe.PNG, nullpointer.PNG
>
>
> When navigate to history server WebUI, and check incomplete applications, 
> show http 500
> Error logs:
> 17/07/05 20:17:44 INFO ApplicationCacheCheckFilter: Application Attempt 
> app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/None updated; 
> refreshing
> 17/07/05 20:17:44 WARN ServletHandler: 
> /history/app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/executors/
> java.lang.NullPointerException
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.spark_project.jetty.server.Server.handle(Server.java:499)
> at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
> at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> at java.lang.Thread.run(Thread.java:785)
> 17/07/05 20:18:00 WARN ServletHandler: /
> java.lang.NullPointerException
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
> at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479)
> at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
> at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> at org.spark_project.jetty.server.Server.handle(Server.java:499)
> at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
> at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
> at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
> at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
> at java.lang.Thread.run(Thread.java:785)
> 17/07/05 20:18:17 WARN ServletHandler: /
> java.lang.NullPointerException
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-02-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365666#comment-16365666
 ] 

Hyukjin Kwon commented on SPARK-23410:
--

It's reverted in https://github.com/apache/spark/pull/20614 anyway. Shall we 
unset the blocker?

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Blocker
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23405) The task will hang up when a small table left semi join a big table

2018-02-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23405.
---
Resolution: Invalid

> The task will hang up when a small table left semi join a big table
> ---
>
> Key: SPARK-23405
> URL: https://issues.apache.org/jira/browse/SPARK-23405
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
>Priority: Major
> Attachments: SQL.png, taskhang up.png
>
>
> I run a sql: `select ls.cs_order_number from ls left semi join catalog_sales 
> cs on ls.cs_order_number = cs.cs_order_number`, The `ls` table is a small 
> table ,and the number is one. The `catalog_sales` table is a big table,  and 
> the number is 10 billion. The task will be hang up:
> !taskhang up.png!
>  And the sql page is :
> !SQL.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23402) Dataset write method not working as expected for postgresql database

2018-02-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365549#comment-16365549
 ] 

Sean Owen commented on SPARK-23402:
---

This is all just the master branch of github.com/apache/spark and yes you need 
to build it yourself.

> Dataset write method not working as expected for postgresql database
> 
>
> Key: SPARK-23402
> URL: https://issues.apache.org/jira/browse/SPARK-23402
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: PostgreSQL: 9.5.8 (10 + Also same issue)
> OS: Cent OS 7 & Windows 7,8
> JDBC: 9.4-1201-jdbc41
>  
> Spark:  I executed in both 2.1.0 and 2.2.1
> Mode: Standalone
> OS: Windows 7
>Reporter: Pallapothu Jyothi Swaroop
>Priority: Major
> Attachments: Emsku[1].jpg
>
>
> I am using spark dataset write to insert data on postgresql existing table. 
> For this I am using  write method mode as append mode. While using i am 
> getting exception like table already exists. But, I gave option as append 
> mode.
> It's strange. When i change options to sqlserver/oracle append mode is 
> working as expected.
>  
> *Database Properties:*
> {{destinationProps.put("driver", "org.postgresql.Driver"); 
> destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); 
> destinationProps.put("user", "dbmig");}}
> {{destinationProps.put("password", "dbmig");}}
>  
> *Dataset Write Code:*
> {{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"),
>  "dqvalue", destinationdbProperties);}} 
>  
>  
> {{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: 
> relation "dqvalue" already exists at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) 
> at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at 
> org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at 
> org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at 
> org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) 
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) 
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at 
> org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at 
> com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25)
>  at 
> com.ads.dqam.action.AbstractValueAnalysis.persistAnalysis(AbstractValueAnalysis.java:81)
>  at com.ads.dqam.Analysis.doAnalysis(Analysis.java:32) at 
> com.ads.dqam.Client.main(Client.java:71)}}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-15 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365495#comment-16365495
 ] 

Steve Loughran commented on SPARK-23308:


I'm going to recommend this is closed as a WONTFIX. Not the place of Spark to 
determine what is recoverable —it'd never get it right

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-15 Thread Igor Berman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Berman updated SPARK-23423:

Labels: Mesos dynamic_allocation  (was: )

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-15 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365468#comment-16365468
 ] 

Stavros Kontopoulos edited comment on SPARK-23423 at 2/15/18 12:31 PM:
---

The delivery of mesos task updates  is a responsibility of mesos master AFIK. 
Btw you should see the status of tasks on the agent log (assuming you are not 
loosing agents right?) where the mesos executor was running on. [~susanxhuynh] 
is it possible that under heavy load the task updates are not delivered?


was (Author: skonto):
The task updates delivery is a responsibility of mesos master AFIK. Btw you 
should see the status of tasks on the agent log (assuming you are not loosing 
agents right?) where the mesos executor was running on. [~susanxhuynh] is it 
possible that under heavy load the task updates are not delivered?

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-15 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365468#comment-16365468
 ] 

Stavros Kontopoulos commented on SPARK-23423:
-

The task updates delivery is a responsibility of mesos master AFIK. Btw you 
should see the status of tasks on the agent log (assuming you are not loosing 
agents right?) where the mesos executor was running on. [~susanxhuynh] is it 
possible that under heavy load the task updates are not delivered?

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-15 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365438#comment-16365438
 ] 

Igor Berman commented on SPARK-23423:
-

[~skonto] do you think it's could be connected to 
https://issues.apache.org/jira/browse/SPARK-21460 ? The Jira is connected to 
Yarn, but seems like it might be same cause

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-15 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365436#comment-16365436
 ] 

Igor Berman commented on SPARK-23423:
-

[~skonto], yes this is correct. Besides TASK_RUNNING reports(many) there are 
only 2 TASK_FAILURE reports present(there are no FINISHED, LOST tasks)
We have cluster with approx 100 machines and for this specific framework many 
executors were running (40 or so), many of them were killed during scale 
down(>> 2). I can see in mesos ui that there are completed tasks with KILLED 
state for this application.

the question is where all other reports? Can you think of any reason they won't 
be reported or driver won't register it?

 

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23422) YarnShuffleIntegrationSuite failure when SPARK_PREPEND_CLASSES set to 1

2018-02-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23422:
--

Assignee: Gabor Somogyi

> YarnShuffleIntegrationSuite failure when SPARK_PREPEND_CLASSES set to 1
> ---
>
> Key: SPARK-23422
> URL: https://issues.apache.org/jira/browse/SPARK-23422
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Minor
> Fix For: 2.3.1, 2.4.0
>
>
> YarnShuffleIntegrationSuite fails when SPARK_PREPEND_CLASSES set to 1.
> Normally mllib built before yarn module. When SPARK_PREPEND_CLASSES used 
> mllib classes are on yarn test classpath.
> Before 2.3 that did not cause issues. But 2.3 has SPARK-22450, which 
> registered some mllib classes with the kryo serializer. Now it dies with the 
> following error:
> {code:java}
> 18/02/13 07:33:29 INFO SparkContext: Starting job: collect at 
> YarnShuffleIntegrationSuite.scala:143
> Exception in thread "dag-scheduler-event-loop" 
> java.lang.NoClassDefFoundError: breeze/linalg/DenseMatrix
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23422) YarnShuffleIntegrationSuite failure when SPARK_PREPEND_CLASSES set to 1

2018-02-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23422.

   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 20608
[https://github.com/apache/spark/pull/20608]

> YarnShuffleIntegrationSuite failure when SPARK_PREPEND_CLASSES set to 1
> ---
>
> Key: SPARK-23422
> URL: https://issues.apache.org/jira/browse/SPARK-23422
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Minor
> Fix For: 2.4.0, 2.3.1
>
>
> YarnShuffleIntegrationSuite fails when SPARK_PREPEND_CLASSES set to 1.
> Normally mllib built before yarn module. When SPARK_PREPEND_CLASSES used 
> mllib classes are on yarn test classpath.
> Before 2.3 that did not cause issues. But 2.3 has SPARK-22450, which 
> registered some mllib classes with the kryo serializer. Now it dies with the 
> following error:
> {code:java}
> 18/02/13 07:33:29 INFO SparkContext: Starting job: collect at 
> YarnShuffleIntegrationSuite.scala:143
> Exception in thread "dag-scheduler-event-loop" 
> java.lang.NoClassDefFoundError: breeze/linalg/DenseMatrix
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-15 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365295#comment-16365295
 ] 

Stavros Kontopoulos commented on SPARK-23423:
-

[~igor.berman] From the log I see that the scheduler received two task updates 
with state TASK_FAILURE, is that correct?

The executor removal is triggered if the status of the task received belongs to 
any of the following states: 

FINISHED, FAILED, KILLED, LOST.

[https://github.com/apache/spark/blob/ed8647609883fcef16be5d24c2cb4ebda25bd6f0/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L640]

[https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/TaskState.scala#L22]

 

 

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23435) R tests should support latest testthat

2018-02-15 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-23435:


 Summary: R tests should support latest testthat
 Key: SPARK-23435
 URL: https://issues.apache.org/jira/browse/SPARK-23435
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.1, 2.4.0
Reporter: Felix Cheung


To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was released 
in Dec 2017, and its method has been changed.

In order for our tests to keep working, we need to detect that and call a 
different method.

Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23359) Adds an alias 'names' of 'fieldNames' in Scala's StructType

2018-02-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23359.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20545
[https://github.com/apache/spark/pull/20545]

> Adds an alias 'names' of 'fieldNames' in Scala's StructType
> ---
>
> Key: SPARK-23359
> URL: https://issues.apache.org/jira/browse/SPARK-23359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.4.0
>
>
> See the discussion in SPARK-20090. It happened to deprecate {{names}} in 
> Python side but seems we better add an alias {{names}} in Scala side too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22817) Use fixed testthat version for SparkR tests in AppVeyor

2018-02-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365263#comment-16365263
 ] 

Felix Cheung edited comment on SPARK-22817 at 2/15/18 9:13 AM:
---

I should have caught this - -we need to fix the test because it will fail in 
CRAN - another option is to fix the dependency version in DESCRIPTION file-

scratch that. in CRAN we are calling test_package, which works fine.


was (Author: felixcheung):
I should have caught this - we need to fix the test because it will fail in 
CRAN - another option is to fix the dependency version in DESCRIPTION file

> Use fixed testthat version for SparkR tests in AppVeyor
> ---
>
> Key: SPARK-22817
> URL: https://issues.apache.org/jira/browse/SPARK-22817
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.2.2, 2.3.0
>
>
> We happened to access to the internal {{run_tests}} - 
> https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75. 
> https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L58
> This seems removed out in 2.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23359) Adds an alias 'names' of 'fieldNames' in Scala's StructType

2018-02-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23359:
---

Assignee: Hyukjin Kwon

> Adds an alias 'names' of 'fieldNames' in Scala's StructType
> ---
>
> Key: SPARK-23359
> URL: https://issues.apache.org/jira/browse/SPARK-23359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.4.0
>
>
> See the discussion in SPARK-20090. It happened to deprecate {{names}} in 
> Python side but seems we better add an alias {{names}} in Scala side too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23366) Improve hot reading path in ReadAheadInputStream

2018-02-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23366.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20555
[https://github.com/apache/spark/pull/20555]

> Improve hot reading path in ReadAheadInputStream
> 
>
> Key: SPARK-23366
> URL: https://issues.apache.org/jira/browse/SPARK-23366
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 2.4.0
>
>
> ReadAheadInputStream was introduced in 
> [apache/spark#18317|https://github.com/apache/spark/pull/18317] to optimize 
> reading spill files from disk.
> However, investigating flamegraphs of profiles from investigating some 
> regressed workloads after switch to Spark 2.3, it seems that the hot path of 
> reading small amounts of data (like readInt) is inefficient - it involves 
> taking locks, and multiple checks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23366) Improve hot reading path in ReadAheadInputStream

2018-02-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23366:
---

Assignee: Juliusz Sompolski

> Improve hot reading path in ReadAheadInputStream
> 
>
> Key: SPARK-23366
> URL: https://issues.apache.org/jira/browse/SPARK-23366
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
> Fix For: 2.4.0
>
>
> ReadAheadInputStream was introduced in 
> [apache/spark#18317|https://github.com/apache/spark/pull/18317] to optimize 
> reading spill files from disk.
> However, investigating flamegraphs of profiles from investigating some 
> regressed workloads after switch to Spark 2.3, it seems that the hot path of 
> reading small amounts of data (like readInt) is inefficient - it involves 
> taking locks, and multiple checks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22817) Use fixed testthat version for SparkR tests in AppVeyor

2018-02-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365263#comment-16365263
 ] 

Felix Cheung commented on SPARK-22817:
--

I should have caught this - we need to fix the test because it will fail in 
CRAN - another option is to fix the dependency version in DESCRIPTION file

> Use fixed testthat version for SparkR tests in AppVeyor
> ---
>
> Key: SPARK-22817
> URL: https://issues.apache.org/jira/browse/SPARK-22817
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.2.2, 2.3.0
>
>
> We happened to access to the internal {{run_tests}} - 
> https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75. 
> https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L58
> This seems removed out in 2.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23329) Update the function descriptions with the arguments and returned values of the trigonometric functions

2018-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23329:


Assignee: Apache Spark

> Update the function descriptions with the arguments and returned values of 
> the trigonometric functions
> --
>
> Key: SPARK-23329
> URL: https://issues.apache.org/jira/browse/SPARK-23329
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> We need an update on the function descriptions for all the trigonometric 
> functions. For example, {{cos}}, {{sin}}, and {{cot}}. Internally, the 
> implementation is based on the java.lang.Math. We need a clear description 
> about the units of the input arguments and the returned values. 
> For example, the following descriptions are lacking such info. 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L551-L555
> https://github.com/apache/spark/blob/d5861aba9d80ca15ad3f22793b79822e470d6913/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1978



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23329) Update the function descriptions with the arguments and returned values of the trigonometric functions

2018-02-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23329:


Assignee: (was: Apache Spark)

> Update the function descriptions with the arguments and returned values of 
> the trigonometric functions
> --
>
> Key: SPARK-23329
> URL: https://issues.apache.org/jira/browse/SPARK-23329
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Minor
>  Labels: starter
>
> We need an update on the function descriptions for all the trigonometric 
> functions. For example, {{cos}}, {{sin}}, and {{cot}}. Internally, the 
> implementation is based on the java.lang.Math. We need a clear description 
> about the units of the input arguments and the returned values. 
> For example, the following descriptions are lacking such info. 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L551-L555
> https://github.com/apache/spark/blob/d5861aba9d80ca15ad3f22793b79822e470d6913/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1978



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23329) Update the function descriptions with the arguments and returned values of the trigonometric functions

2018-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365261#comment-16365261
 ] 

Apache Spark commented on SPARK-23329:
--

User 'misutoth' has created a pull request for this issue:
https://github.com/apache/spark/pull/20618

> Update the function descriptions with the arguments and returned values of 
> the trigonometric functions
> --
>
> Key: SPARK-23329
> URL: https://issues.apache.org/jira/browse/SPARK-23329
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Minor
>  Labels: starter
>
> We need an update on the function descriptions for all the trigonometric 
> functions. For example, {{cos}}, {{sin}}, and {{cot}}. Internally, the 
> implementation is based on the java.lang.Math. We need a clear description 
> about the units of the input arguments and the returned values. 
> For example, the following descriptions are lacking such info. 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L551-L555
> https://github.com/apache/spark/blob/d5861aba9d80ca15ad3f22793b79822e470d6913/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1978



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false

2018-02-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23416.
-
   Resolution: Fixed
Fix Version/s: 2.3.1

Issue resolved by pull request 20605
[https://github.com/apache/spark/pull/20605]

> flaky test: 
> org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress
>  test for failOnDataLoss=false
> --
>
> Key: SPARK-23416
> URL: https://issues.apache.org/jira/browse/SPARK-23416
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Minor
> Fix For: 2.3.1
>
>
> I suspect this is a race condition latent in the DataSourceV2 write path, or 
> at least the interaction of that write path with StreamTest.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/]
> h3. Error Message
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted.
> h3. Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted. at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
>  Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing 
> job aborted. at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at 
> org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> 

[jira] [Resolved] (SPARK-23419) data source v2 write path should re-throw interruption exceptions directly

2018-02-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23419.
-
   Resolution: Fixed
Fix Version/s: 2.3.1

Issue resolved by pull request 20605
[https://github.com/apache/spark/pull/20605]

> data source v2 write path should re-throw interruption exceptions directly
> --
>
> Key: SPARK-23419
> URL: https://issues.apache.org/jira/browse/SPARK-23419
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.1
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work

2018-02-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23413:
---
Priority: Blocker  (was: Major)

> Sorting tasks by Host / Executor ID on the Stage page does not work
> ---
>
> Key: SPARK-23413
> URL: https://issues.apache.org/jira/browse/SPARK-23413
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Attila Zsolt Piros
>Priority: Blocker
>
> Sorting tasks by Host / Executor ID throws exceptions: 
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Executor ID at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
> {code}
> {code}
> java.lang.IllegalArgumentException: Invalid sort column: Host at 
> org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at 
> org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at 
> org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at 
> org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at 
> org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at 
> org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at 
> org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> {code}
>   !image-2018-02-13-16-50-32-600.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org