[jira] [Updated] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level

2017-03-03 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19816:
-
Affects Version/s: (was: 2.2.0)

> DataFrameCallbackSuite doesn't recover the log level
> 
>
> Key: SPARK-19816
> URL: https://issues.apache.org/jira/browse/SPARK-19816
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>
> "DataFrameCallbackSuite.execute callback functions when a DataFrame action 
> failed" sets the log level to "fatal" but doesn't recover it. Hence, tests 
> running after it won't output any logs except fatal logs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level

2017-03-03 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19816:
-
Affects Version/s: 2.1.0

> DataFrameCallbackSuite doesn't recover the log level
> 
>
> Key: SPARK-19816
> URL: https://issues.apache.org/jira/browse/SPARK-19816
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>
> "DataFrameCallbackSuite.execute callback functions when a DataFrame action 
> failed" sets the log level to "fatal" but doesn't recover it. Hence, tests 
> running after it won't output any logs except fatal logs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19815) Not orderable should be applied to right key instead of left key

2017-03-03 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895522#comment-15895522
 ] 

Zhan Zhang commented on SPARK-19815:


I am thinking the logic again. On the surface, the logic may be correct. Since 
in the join, the left and right key should be the same type. Please close this 
JIRA.

> Not orderable should be applied to right key instead of left key
> 
>
> Key: SPARK-19815
> URL: https://issues.apache.org/jira/browse/SPARK-19815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Priority: Minor
>
> When generating ShuffledHashJoinExec, the orderable condition should be 
> applied to right key instead of left key.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-03-03 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895502#comment-15895502
 ] 

Imran Rashid commented on SPARK-19659:
--

I think Reynold has a good point.  I really don't like the idea of always have 
the MapStatus track 2k sizes -- I already have to regularly recommend to users 
that they bump their partition count > 2k to avoid an OOM from too many 
CompressedMapStatus.  Going over 2k partitions generally gives big memory 
savings from using HighlyCompressedMapStatus.

Your point about deciding how many outlier to track is valid; but I think there 
are a lot of other options you might consider as well, eg., track all the sizes 
that are more than 2x the average, or track a few different size buckets, and 
keep a bit set for each bucket,  etc.  these should allow the MapStatus to stay 
very compact, but have bounded error on the size.

For implementation, I'd also break your proposal down into smaller pieces.  In 
fact, the three ideas are all useful independently (though they are more robust 
together).

But two larger pieces I see missing: 1) how will we test the changes out?  not 
for correctness, but for performance / stability benefits?  2) are there 
metrics we should be collecting so we can better answer these questions, that 
we currently are not answering?  eg., the distribution of sizes in MapStatus is 
not stored anywhere for later analysis (though its not easy to come up with a 
good way to store them, since there are n^2 sizes in one shuffle); how much 
memory is used by the network layer; how much error is there in the sizes from 
the MapStatus, etc.  I think some parts can be implemented anyway, behind 
feature flags (perhaps undocumented), but its something to keep in mind.

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: SPARK-19659-design-v1.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19701) the `in` operator in pyspark is broken

2017-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19701:


Assignee: (was: Apache Spark)

> the `in` operator in pyspark is broken
> --
>
> Key: SPARK-19701
> URL: https://issues.apache.org/jira/browse/SPARK-19701
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>
> {code}
> >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md")
> >>> linesWithSpark = textFile.filter("Spark" in textFile.value)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, 
> in __nonzero__
> raise ValueError("Cannot convert column into bool: please use '&' for 
> 'and', '|' for 'or', "
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|' 
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19701) the `in` operator in pyspark is broken

2017-03-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895494#comment-15895494
 ] 

Apache Spark commented on SPARK-19701:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17160

> the `in` operator in pyspark is broken
> --
>
> Key: SPARK-19701
> URL: https://issues.apache.org/jira/browse/SPARK-19701
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>
> {code}
> >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md")
> >>> linesWithSpark = textFile.filter("Spark" in textFile.value)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, 
> in __nonzero__
> raise ValueError("Cannot convert column into bool: please use '&' for 
> 'and', '|' for 'or', "
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|' 
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19701) the `in` operator in pyspark is broken

2017-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19701:


Assignee: Apache Spark

> the `in` operator in pyspark is broken
> --
>
> Key: SPARK-19701
> URL: https://issues.apache.org/jira/browse/SPARK-19701
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>
> {code}
> >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md")
> >>> linesWithSpark = textFile.filter("Spark" in textFile.value)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, 
> in __nonzero__
> raise ValueError("Cannot convert column into bool: please use '&' for 
> 'and', '|' for 'or', "
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|' 
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level

2017-03-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-19816:
-
Fix Version/s: 2.1.1

> DataFrameCallbackSuite doesn't recover the log level
> 
>
> Key: SPARK-19816
> URL: https://issues.apache.org/jira/browse/SPARK-19816
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>
> "DataFrameCallbackSuite.execute callback functions when a DataFrame action 
> failed" sets the log level to "fatal" but doesn't recover it. Hence, tests 
> running after it won't output any logs except fatal logs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level

2017-03-03 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19816:

Fix Version/s: 2.2.0

> DataFrameCallbackSuite doesn't recover the log level
> 
>
> Key: SPARK-19816
> URL: https://issues.apache.org/jira/browse/SPARK-19816
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
>
> "DataFrameCallbackSuite.execute callback functions when a DataFrame action 
> failed" sets the log level to "fatal" but doesn't recover it. Hence, tests 
> running after it won't output any logs except fatal logs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level

2017-03-03 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19816.
-
Resolution: Fixed

> DataFrameCallbackSuite doesn't recover the log level
> 
>
> Key: SPARK-19816
> URL: https://issues.apache.org/jira/browse/SPARK-19816
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> "DataFrameCallbackSuite.execute callback functions when a DataFrame action 
> failed" sets the log level to "fatal" but doesn't recover it. Hence, tests 
> running after it won't output any logs except fatal logs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19818) SparkR union should check for name consistency of input data frames

2017-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19818:


Assignee: (was: Apache Spark)

> SparkR union should check for name consistency of input data frames 
> 
>
> Key: SPARK-19818
> URL: https://issues.apache.org/jira/browse/SPARK-19818
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Priority: Minor
>
> The current implementation accepts data frames with different schemas. See 
> issues below:
> {code}
> df <- createDataFrame(data.frame(name = c("Michael", "Andy", "Justin"), age = 
> c(1, 30, 19)))
> union(df, df[, c(2, 1)])
>  name age
> 1 Michael 1.0
> 2Andy30.0
> 3  Justin19.0
> 4 1.0 Michael
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19818) SparkR union should check for name consistency of input data frames

2017-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19818:


Assignee: Apache Spark

> SparkR union should check for name consistency of input data frames 
> 
>
> Key: SPARK-19818
> URL: https://issues.apache.org/jira/browse/SPARK-19818
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> The current implementation accepts data frames with different schemas. See 
> issues below:
> {code}
> df <- createDataFrame(data.frame(name = c("Michael", "Andy", "Justin"), age = 
> c(1, 30, 19)))
> union(df, df[, c(2, 1)])
>  name age
> 1 Michael 1.0
> 2Andy30.0
> 3  Justin19.0
> 4 1.0 Michael
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19818) SparkR union should check for name consistency of input data frames

2017-03-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895468#comment-15895468
 ] 

Apache Spark commented on SPARK-19818:
--

User 'actuaryzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/17159

> SparkR union should check for name consistency of input data frames 
> 
>
> Key: SPARK-19818
> URL: https://issues.apache.org/jira/browse/SPARK-19818
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Priority: Minor
>
> The current implementation accepts data frames with different schemas. See 
> issues below:
> {code}
> df <- createDataFrame(data.frame(name = c("Michael", "Andy", "Justin"), age = 
> c(1, 30, 19)))
> union(df, df[, c(2, 1)])
>  name age
> 1 Michael 1.0
> 2Andy30.0
> 3  Justin19.0
> 4 1.0 Michael
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19818) SparkR union should check for name consistency of input data frames

2017-03-03 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-19818:
---

 Summary: SparkR union should check for name consistency of input 
data frames 
 Key: SPARK-19818
 URL: https://issues.apache.org/jira/browse/SPARK-19818
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Wayne Zhang
Priority: Minor


The current implementation accepts data frames with different schemas. See 
issues below:
{code}
df <- createDataFrame(data.frame(name = c("Michael", "Andy", "Justin"), age = 
c(1, 30, 19)))
union(df, df[, c(2, 1)])
 name age
1 Michael 1.0
2Andy30.0
3  Justin19.0
4 1.0 Michael
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19804) HiveClientImpl does not work with Hive 2.2.0 metastore

2017-03-03 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19804.
-
   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.2.0

> HiveClientImpl does not work with Hive 2.2.0 metastore
> --
>
> Key: SPARK-19804
> URL: https://issues.apache.org/jira/browse/SPARK-19804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.2.0
>
>
> I know that Spark currently does not officially support Hive 2.2 (perhaps 
> because it hasn't been released yet); but we have some 2.2 patches in CDH and 
> the current code in the isolated client fails. The most probably culprit are 
> changes added in HIVE-13149.
> The fix is simple, and here's the patch we applied in CDH:
> https://github.com/cloudera/spark/commit/954f060afe6ed469e85d656abd02790a79ec07a0
> Fixing that doesn't affect any existing Hive version support, but will make 
> it easier to support 2.2 when it's out.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19804) HiveClientImpl does not work with Hive 2.2.0 metastore

2017-03-03 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895459#comment-15895459
 ] 

Xiao Li edited comment on SPARK-19804 at 3/4/17 2:47 AM:
-

Resolved by https://github.com/apache/spark/pull/17154


was (Author: smilegator):
https://github.com/apache/spark/pull/17154

> HiveClientImpl does not work with Hive 2.2.0 metastore
> --
>
> Key: SPARK-19804
> URL: https://issues.apache.org/jira/browse/SPARK-19804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.2.0
>
>
> I know that Spark currently does not officially support Hive 2.2 (perhaps 
> because it hasn't been released yet); but we have some 2.2 patches in CDH and 
> the current code in the isolated client fails. The most probably culprit are 
> changes added in HIVE-13149.
> The fix is simple, and here's the patch we applied in CDH:
> https://github.com/cloudera/spark/commit/954f060afe6ed469e85d656abd02790a79ec07a0
> Fixing that doesn't affect any existing Hive version support, but will make 
> it easier to support 2.2 when it's out.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19804) HiveClientImpl does not work with Hive 2.2.0 metastore

2017-03-03 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895459#comment-15895459
 ] 

Xiao Li commented on SPARK-19804:
-

https://github.com/apache/spark/pull/17154

> HiveClientImpl does not work with Hive 2.2.0 metastore
> --
>
> Key: SPARK-19804
> URL: https://issues.apache.org/jira/browse/SPARK-19804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> I know that Spark currently does not officially support Hive 2.2 (perhaps 
> because it hasn't been released yet); but we have some 2.2 patches in CDH and 
> the current code in the isolated client fails. The most probably culprit are 
> changes added in HIVE-13149.
> The fix is simple, and here's the patch we applied in CDH:
> https://github.com/cloudera/spark/commit/954f060afe6ed469e85d656abd02790a79ec07a0
> Fixing that doesn't affect any existing Hive version support, but will make 
> it easier to support 2.2 when it's out.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2017-03-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895452#comment-15895452
 ] 

Apache Spark commented on SPARK-16845:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17157

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: hejie
>Assignee: Liwei Lin
> Fix For: 2.1.1, 2.2.0
>
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2017-03-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895451#comment-15895451
 ] 

Apache Spark commented on SPARK-16845:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17158

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: hejie
>Assignee: Liwei Lin
> Fix For: 2.1.1, 2.2.0
>
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-03-03 Thread jin xing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895411#comment-15895411
 ] 

jin xing edited comment on SPARK-19659 at 3/4/17 2:11 AM:
--

[~rxin]
Thanks a lot for comment.
Tracking average size and also the outliers is a good idea.
But there can be multiple huge blocks creating too much pressure(e.g. there are 
10% blocks much bigger than they other 90%) and it is a little bit hard to 
decide how many outliers we should track. 
If we track too many outliers, *MapStatus* can cost too much memory.
I think the benefit of tracking the max for each N/2000 consecutive blocks is 
that we can avoid having *MapStatus* cost too much memory(at most around 
2000Bytes after compressing) and we can have all outliers under control. Do you 
think it's worth trying?


was (Author: jinxing6...@126.com):
[~rxin]
Thanks a lot for comment.
Tracking average size and also the outliers is a good idea.
But there can be multiple huge blocks creating too much pressure(e.g. there are 
10% blocks much bigger than they other 90%) and it is a little bit hard to 
decide how many outliers we should track. 
If we track too many outliers, *MapStatus* can cost too much memory.
I think the benefit of tracking the max for each N/2000 consecutive blocks is 
that we can avoid having *MapStatus* cost too much memory(at most around 
2000Bytes).

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: SPARK-19659-design-v1.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-03-03 Thread jin xing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895411#comment-15895411
 ] 

jin xing commented on SPARK-19659:
--

[~rxin]
Thanks a lot for comment.
Tracking average size and also the outliers is a good idea.
But there can be multiple huge blocks creating too much pressure(e.g. there are 
10% blocks much bigger than they other 90%) and it is a little bit hard to 
decide how many outliers we should track. 
If we track too many outliers, *MapStatus* can cost too much memory.
I think the benefit of tracking the max for each N/2000 consecutive blocks is 
that we can avoid having *MapStatus* cost too much memory(at most around 
2000Bytes).

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: SPARK-19659-design-v1.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19817) make it clear that `timeZone` option is a general option in DataFrameReader/Writer

2017-03-03 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19817:

Description: As timezone setting can also affect partition values, it works 
for all formats, we should make it clear.  (was: As timezone setting can also 
affect partition values, it doesn't make sense that we only support timezone 
options for JSON and CSV in `DataFrameReader/Writer`, we should support all 
formats.)

> make it clear that `timeZone` option is a general option in 
> DataFrameReader/Writer
> --
>
> Key: SPARK-19817
> URL: https://issues.apache.org/jira/browse/SPARK-19817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Takuya Ueshin
> Fix For: 2.2.0
>
>
> As timezone setting can also affect partition values, it works for all 
> formats, we should make it clear.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19817) make it clear that `timeZone` option is a general option in DataFrameReader/Writer

2017-03-03 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19817:

Summary: make it clear that `timeZone` option is a general option in 
DataFrameReader/Writer  (was: support timeZone option for all formats in 
`DataFrameReader/Writer`)

> make it clear that `timeZone` option is a general option in 
> DataFrameReader/Writer
> --
>
> Key: SPARK-19817
> URL: https://issues.apache.org/jira/browse/SPARK-19817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Takuya Ueshin
> Fix For: 2.2.0
>
>
> As timezone setting can also affect partition values, it doesn't make sense 
> that we only support timezone options for JSON and CSV in 
> `DataFrameReader/Writer`, we should support all formats.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19817) support timeZone option for all formats in `DataFrameReader/Writer`

2017-03-03 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-19817:
---

 Summary: support timeZone option for all formats in 
`DataFrameReader/Writer`
 Key: SPARK-19817
 URL: https://issues.apache.org/jira/browse/SPARK-19817
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Wenchen Fan
Assignee: Takuya Ueshin


As timezone setting can also affect partition values, it doesn't make sense 
that we only support timezone options for JSON and CSV in 
`DataFrameReader/Writer`, we should support all formats.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-18350) Support session local timezone

2017-03-03 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-18350:
-

> Support session local timezone
> --
>
> Key: SPARK-18350
> URL: https://issues.apache.org/jira/browse/SPARK-18350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Takuya Ueshin
>  Labels: releasenotes
> Fix For: 2.2.0
>
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for 
> execution.
> An explicit non-goal is locale handling.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19718) Fix flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite: stress test for failOnDataLoss=false

2017-03-03 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19718.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.2.0

> Fix flaky test: 
> org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite: 
> stress test for failOnDataLoss=false
> ---
>
> Key: SPARK-19718
> URL: https://issues.apache.org/jira/browse/SPARK-19718
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
>
> SPARK-19617 changed HDFSMetadataLog to enable interrupts when using the local 
> file system. However, now we hit HADOOP-12074: `Shell.runCommand` converts 
> `InterruptedException` to `new IOException(ie.toString())` before Hadoop 2.8.
> Test failure: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/2504/consoleFull
> {code}
> [info] - stress test for failOnDataLoss=false *** FAILED *** (1 minute, 1 
> second)
> [info]   org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 27d45f4f-14dc-4c74-8b52-4bbd4f2b9bec, runId = 
> 23b8c1ea-4da9-4096-967a-692933e4b319] terminated with exception: 
> java.lang.InterruptedException
> [info]   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:304)
> [info]   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:190)
> [info]   Cause: java.io.IOException: java.lang.InterruptedException
> [info]   at org.apache.hadoop.util.Shell.runCommand(Shell.java:578)
> [info]   at org.apache.hadoop.util.Shell.run(Shell.java:478)
> [info]   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:766)
> [info]   at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
> [info]   at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
> [info]   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
> [info]   at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:300)
> [info]   at 
> org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1014)
> [info]   at 
> org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85)
> [info]   at 
> org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:354)
> [info]   at 
> org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:394)
> [info]   at 
> org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
> [info]   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:680)
> [info]   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:676)
> [info]   at 
> org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
> [info]   at org.apache.hadoop.fs.FileContext.create(FileContext.java:676)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level

2017-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19816:


Assignee: Apache Spark  (was: Shixiong Zhu)

> DataFrameCallbackSuite doesn't recover the log level
> 
>
> Key: SPARK-19816
> URL: https://issues.apache.org/jira/browse/SPARK-19816
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> "DataFrameCallbackSuite.execute callback functions when a DataFrame action 
> failed" sets the log level to "fatal" but doesn't recover it. Hence, tests 
> running after it won't output any logs except fatal logs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level

2017-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19816:


Assignee: Shixiong Zhu  (was: Apache Spark)

> DataFrameCallbackSuite doesn't recover the log level
> 
>
> Key: SPARK-19816
> URL: https://issues.apache.org/jira/browse/SPARK-19816
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> "DataFrameCallbackSuite.execute callback functions when a DataFrame action 
> failed" sets the log level to "fatal" but doesn't recover it. Hence, tests 
> running after it won't output any logs except fatal logs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level

2017-03-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895348#comment-15895348
 ] 

Apache Spark commented on SPARK-19816:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17156

> DataFrameCallbackSuite doesn't recover the log level
> 
>
> Key: SPARK-19816
> URL: https://issues.apache.org/jira/browse/SPARK-19816
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> "DataFrameCallbackSuite.execute callback functions when a DataFrame action 
> failed" sets the log level to "fatal" but doesn't recover it. Hence, tests 
> running after it won't output any logs except fatal logs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19811) sparksql 2.1 can not prune hive partition

2017-03-03 Thread sydt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895346#comment-15895346
 ] 

sydt edited comment on SPARK-19811 at 3/4/17 1:04 AM:
--

this is not a problem because it can be resolved by change partition 
information "DAY_ID='20170212' AND PROV_ID ='842'"  to lower case


was (Author: wangchao2017):
this is not a problem because it can be resolved by change partition 
information "DAY_ID='20170212' AND PROV_ID ='842'"  to lower spell.

> sparksql 2.1 can not prune hive partition 
> --
>
> Key: SPARK-19811
> URL: https://issues.apache.org/jira/browse/SPARK-19811
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: sydt
>
> When sparksql2.1 execute sql, it has error:
> java.lang.RuntimeException: Expected only partition pruning predicates: 
> (isnotnull(DAY_ID#216) && (DAY_ID#216 = 20170212))  and the sql  sentence is 
> select PROD_INST_ID from CRM_DB.ITG_PROD_INST WHERE DAY_ID='20170212' AND  
> PROV_ID  ='842' limit 10; where DAY_ID and PROVE_ID is partition in hive.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19811) sparksql 2.1 can not prune hive partition

2017-03-03 Thread sydt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895346#comment-15895346
 ] 

sydt commented on SPARK-19811:
--

this is not a problem because it can be resolved by change partition 
information "DAY_ID='20170212' AND PROV_ID ='842'"  to lower spell.

> sparksql 2.1 can not prune hive partition 
> --
>
> Key: SPARK-19811
> URL: https://issues.apache.org/jira/browse/SPARK-19811
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: sydt
>
> When sparksql2.1 execute sql, it has error:
> java.lang.RuntimeException: Expected only partition pruning predicates: 
> (isnotnull(DAY_ID#216) && (DAY_ID#216 = 20170212))  and the sql  sentence is 
> select PROD_INST_ID from CRM_DB.ITG_PROD_INST WHERE DAY_ID='20170212' AND  
> PROV_ID  ='842' limit 10; where DAY_ID and PROVE_ID is partition in hive.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19816) DataFrameCallbackSuite doesn't recover the log level

2017-03-03 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-19816:


 Summary: DataFrameCallbackSuite doesn't recover the log level
 Key: SPARK-19816
 URL: https://issues.apache.org/jira/browse/SPARK-19816
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 2.2.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


"DataFrameCallbackSuite.execute callback functions when a DataFrame action 
failed" sets the log level to "fatal" but doesn't recover it. Hence, tests 
running after it won't output any logs except fatal logs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2017-03-03 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-13446:
---

Assignee: Xiao Li

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2017-03-03 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-13446.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17061
[https://github.com/apache/spark/pull/17061]

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
> Fix For: 2.2.0
>
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use

2017-03-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19348:
--
Fix Version/s: 2.2.0

> pyspark.ml.Pipeline gets corrupted under multi threaded use
> ---
>
> Key: SPARK-19348
> URL: https://issues.apache.org/jira/browse/SPARK-19348
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Vinayak Joshi
>Assignee: Bryan Cutler
> Fix For: 2.2.0
>
> Attachments: pyspark_pipeline_threads.py
>
>
> When pyspark.ml.Pipeline objects are constructed concurrently in separate 
> python threads, it is observed that the stages used to construct a pipeline 
> object get corrupted i.e the stages supplied to a Pipeline object in one 
> thread appear inside a different Pipeline object constructed in a different 
> thread. 
> Things work fine if construction of pyspark.ml.Pipeline objects is 
> serialized, so this looks like a thread safety problem with 
> pyspark.ml.Pipeline object construction. 
> Confirmed that the problem exists with Spark 1.6.x as well as 2.x.
> While the corruption of the Pipeline stages is easily caught, we need to know 
> if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) 
> are also affected by the underlying cause of this problem. That is, whether 
> other pipeline operations like pyspark.ml.pipeline.fit( )  may be performed 
> in separate threads (on distinct pipeline objects) concurrently without any 
> cross contamination between them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18350) Support session local timezone

2017-03-03 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18350.
-
   Resolution: Fixed
 Assignee: Takuya Ueshin
Fix Version/s: 2.2.0

> Support session local timezone
> --
>
> Key: SPARK-18350
> URL: https://issues.apache.org/jira/browse/SPARK-18350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Takuya Ueshin
>  Labels: releasenotes
> Fix For: 2.2.0
>
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for 
> execution.
> An explicit non-goal is locale handling.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18939) Timezone support in partition values.

2017-03-03 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-18939:
---

Assignee: Takuya Ueshin

> Timezone support in partition values.
> -
>
> Key: SPARK-18939
> URL: https://issues.apache.org/jira/browse/SPARK-18939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.2.0
>
>
> We should also use session local timezone to interpret partition values.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18939) Timezone support in partition values.

2017-03-03 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18939.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17053
[https://github.com/apache/spark/pull/17053]

> Timezone support in partition values.
> -
>
> Key: SPARK-18939
> URL: https://issues.apache.org/jira/browse/SPARK-18939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Takuya Ueshin
> Fix For: 2.2.0
>
>
> We should also use session local timezone to interpret partition values.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19815) Not orderable should be applied to right key instead of left key

2017-03-03 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-19815:
---
Summary: Not orderable should be applied to right key instead of left key  
(was: Not order able should be applied to right key instead of left key)

> Not orderable should be applied to right key instead of left key
> 
>
> Key: SPARK-19815
> URL: https://issues.apache.org/jira/browse/SPARK-19815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Priority: Minor
>
> When generating ShuffledHashJoinExec, the orderable condition should be 
> applied to right key instead of left key.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19815) Not order able should be applied to right key instead of left key

2017-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19815:


Assignee: (was: Apache Spark)

> Not order able should be applied to right key instead of left key
> -
>
> Key: SPARK-19815
> URL: https://issues.apache.org/jira/browse/SPARK-19815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Priority: Minor
>
> When generating ShuffledHashJoinExec, the orderable condition should be 
> applied to right key instead of left key.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19815) Not order able should be applied to right key instead of left key

2017-03-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895250#comment-15895250
 ] 

Apache Spark commented on SPARK-19815:
--

User 'zhzhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17155

> Not order able should be applied to right key instead of left key
> -
>
> Key: SPARK-19815
> URL: https://issues.apache.org/jira/browse/SPARK-19815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Priority: Minor
>
> When generating ShuffledHashJoinExec, the orderable condition should be 
> applied to right key instead of left key.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19815) Not order able should be applied to right key instead of left key

2017-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19815:


Assignee: Apache Spark

> Not order able should be applied to right key instead of left key
> -
>
> Key: SPARK-19815
> URL: https://issues.apache.org/jira/browse/SPARK-19815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> When generating ShuffledHashJoinExec, the orderable condition should be 
> applied to right key instead of left key.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19815) Not order able should be applied to right key instead of left key

2017-03-03 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-19815:
--

 Summary: Not order able should be applied to right key instead of 
left key
 Key: SPARK-19815
 URL: https://issues.apache.org/jira/browse/SPARK-19815
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Zhan Zhang
Priority: Minor


When generating ShuffledHashJoinExec, the orderable condition should be applied 
to right key instead of left key.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19084) conditional function: field

2017-03-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895154#comment-15895154
 ] 

Apache Spark commented on SPARK-19084:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17154

> conditional function: field
> ---
>
> Key: SPARK-19084
> URL: https://issues.apache.org/jira/browse/SPARK-19084
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Chenzhao Guo
>
> field(str, str1, str2, ... ) is a variable-length(>=2) function which returns 
> the index of str in the list (str1, str2, ... ) or 0 if not found.
> Every parameter is required to be subtype of AtomicType.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19813) maxFilesPerTrigger combo latestFirst may miss old files in combination with maxFileAge in FileStreamSource

2017-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19813:


Assignee: Burak Yavuz  (was: Apache Spark)

> maxFilesPerTrigger combo latestFirst may miss old files in combination with 
> maxFileAge in FileStreamSource
> --
>
> Key: SPARK-19813
> URL: https://issues.apache.org/jira/browse/SPARK-19813
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> There is a file stream source option called maxFileAge which limits how old 
> the files can be, relative the latest file that has been seen. This is used 
> to limit the files that need to be remembered as "processed". Files older 
> than the latest processed files are ignored. This values is by default 7 days.
> This causes a problem when both 
>  - latestFirst = true
>  - maxFilesPerTrigger > total files to be processed.
> Here is what happens in all combinations
>  1) latestFirst = false - Since files are processed in order, there wont be 
> any unprocessed file older than the latest processed file. All files will be 
> processed.
>  2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge 
> thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is 
> not, then all old files get processed in the first batch, and so no file is 
> left behind.
>  3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch 
> process the latest X files. That sets the threshold latest file - maxFileAge, 
> so files older than this threshold will never be considered for processing. 
> The bug is with case 3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19813) maxFilesPerTrigger combo latestFirst may miss old files in combination with maxFileAge in FileStreamSource

2017-03-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895110#comment-15895110
 ] 

Apache Spark commented on SPARK-19813:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/17153

> maxFilesPerTrigger combo latestFirst may miss old files in combination with 
> maxFileAge in FileStreamSource
> --
>
> Key: SPARK-19813
> URL: https://issues.apache.org/jira/browse/SPARK-19813
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> There is a file stream source option called maxFileAge which limits how old 
> the files can be, relative the latest file that has been seen. This is used 
> to limit the files that need to be remembered as "processed". Files older 
> than the latest processed files are ignored. This values is by default 7 days.
> This causes a problem when both 
>  - latestFirst = true
>  - maxFilesPerTrigger > total files to be processed.
> Here is what happens in all combinations
>  1) latestFirst = false - Since files are processed in order, there wont be 
> any unprocessed file older than the latest processed file. All files will be 
> processed.
>  2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge 
> thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is 
> not, then all old files get processed in the first batch, and so no file is 
> left behind.
>  3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch 
> process the latest X files. That sets the threshold latest file - maxFileAge, 
> so files older than this threshold will never be considered for processing. 
> The bug is with case 3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19813) maxFilesPerTrigger combo latestFirst may miss old files in combination with maxFileAge in FileStreamSource

2017-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19813:


Assignee: Apache Spark  (was: Burak Yavuz)

> maxFilesPerTrigger combo latestFirst may miss old files in combination with 
> maxFileAge in FileStreamSource
> --
>
> Key: SPARK-19813
> URL: https://issues.apache.org/jira/browse/SPARK-19813
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> There is a file stream source option called maxFileAge which limits how old 
> the files can be, relative the latest file that has been seen. This is used 
> to limit the files that need to be remembered as "processed". Files older 
> than the latest processed files are ignored. This values is by default 7 days.
> This causes a problem when both 
>  - latestFirst = true
>  - maxFilesPerTrigger > total files to be processed.
> Here is what happens in all combinations
>  1) latestFirst = false - Since files are processed in order, there wont be 
> any unprocessed file older than the latest processed file. All files will be 
> processed.
>  2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge 
> thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is 
> not, then all old files get processed in the first batch, and so no file is 
> left behind.
>  3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch 
> process the latest X files. That sets the threshold latest file - maxFileAge, 
> so files older than this threshold will never be considered for processing. 
> The bug is with case 3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19814) Spark History Server Out Of Memory / Extreme GC

2017-03-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894952#comment-15894952
 ] 

Sean Owen commented on SPARK-19814:
---

Yes, that already describes further optimizations. I would close this as a 
duplicate at least if you're not showing a memory leak. 

> Spark History Server Out Of Memory / Extreme GC
> ---
>
> Key: SPARK-19814
> URL: https://issues.apache.org/jira/browse/SPARK-19814
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.1, 2.0.0, 2.1.0
> Environment: Spark History Server (we've run it on several different 
> Hadoop distributions)
>Reporter: Simon King
> Attachments: SparkHistoryCPUandRAM.png
>
>
> Spark History Server runs out of memory, gets into GC thrash and eventually 
> becomes unresponsive. This seems to happen more quickly with heavy use of the 
> REST API. We've seen this with several versions of Spark. 
> Running with the following settings (spark 2.1):
> spark.history.fs.cleaner.enabledtrue
> spark.history.fs.cleaner.interval   1d
> spark.history.fs.cleaner.maxAge 7d
> spark.history.retainedApplications  500
> We will eventually get errors like:
> 17/02/25 05:02:19 WARN ServletHandler:·
> javax.servlet.ServletException: scala.MatchError: java.lang.OutOfMemoryError: 
> GC overhead limit exceeded (of class java.lang.OutOfMemoryError)
>   at 
> org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489)
>   at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:529)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>   at org.spark_project.jetty.server.Server.handle(Server.java:499)
>   at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>   at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit 
> exceeded (of class java.lang.OutOfMemoryError)
>   at 
> org.apache.spark.deploy.history.ApplicationCache.getSparkUI(ApplicationCache.scala:148)
>   at 
> org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:110)
>   at 
> org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:244)
>   at 
> org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:49)
>   at 
> org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66)
>   at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter$1.run(SubResourceLocatorRouter.java:158)
>   at 
> org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.getResource(SubResourceLocatorRouter.java:178)
>   at 
> org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.apply(SubResourceLocatorRouter.java:109)
>   at 
> org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:109)
>   at 
> org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112)
>   at 
> org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112)
>   at 
> 

[jira] [Commented] (SPARK-19814) Spark History Server Out Of Memory / Extreme GC

2017-03-03 Thread Simon King (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894948#comment-15894948
 ] 

Simon King commented on SPARK-19814:


Sean, I think that giving more memory only delays the problem, but we will 
experiment more with larger heap settings. We're just starting to look into the 
issue, hoping for early help diagnosing or configuring around the issue. Hope 
there's a simpler fix than the major overhaul proposed here: 
https://issues.apache.org/jira/browse/SPARK-18085

> Spark History Server Out Of Memory / Extreme GC
> ---
>
> Key: SPARK-19814
> URL: https://issues.apache.org/jira/browse/SPARK-19814
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.1, 2.0.0, 2.1.0
> Environment: Spark History Server (we've run it on several different 
> Hadoop distributions)
>Reporter: Simon King
> Attachments: SparkHistoryCPUandRAM.png
>
>
> Spark History Server runs out of memory, gets into GC thrash and eventually 
> becomes unresponsive. This seems to happen more quickly with heavy use of the 
> REST API. We've seen this with several versions of Spark. 
> Running with the following settings (spark 2.1):
> spark.history.fs.cleaner.enabledtrue
> spark.history.fs.cleaner.interval   1d
> spark.history.fs.cleaner.maxAge 7d
> spark.history.retainedApplications  500
> We will eventually get errors like:
> 17/02/25 05:02:19 WARN ServletHandler:·
> javax.servlet.ServletException: scala.MatchError: java.lang.OutOfMemoryError: 
> GC overhead limit exceeded (of class java.lang.OutOfMemoryError)
>   at 
> org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489)
>   at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:529)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>   at org.spark_project.jetty.server.Server.handle(Server.java:499)
>   at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>   at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit 
> exceeded (of class java.lang.OutOfMemoryError)
>   at 
> org.apache.spark.deploy.history.ApplicationCache.getSparkUI(ApplicationCache.scala:148)
>   at 
> org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:110)
>   at 
> org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:244)
>   at 
> org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:49)
>   at 
> org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66)
>   at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter$1.run(SubResourceLocatorRouter.java:158)
>   at 
> org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.getResource(SubResourceLocatorRouter.java:178)
>   at 
> org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.apply(SubResourceLocatorRouter.java:109)
>   at 
> org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:109)
>   at 
> 

[jira] [Commented] (SPARK-19814) Spark History Server Out Of Memory / Extreme GC

2017-03-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894942#comment-15894942
 ] 

Sean Owen commented on SPARK-19814:
---

I'm not sure if this is a bug. It depends on how much memory you give, how much 
data the history server stores. 4G may not be enough; increase that?
Unless it's a memory leak or some obviously too large data structure, I don't 
think it's a bug, but if you have a concrete optimization, you can open a pull 
request.

> Spark History Server Out Of Memory / Extreme GC
> ---
>
> Key: SPARK-19814
> URL: https://issues.apache.org/jira/browse/SPARK-19814
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.1, 2.0.0, 2.1.0
> Environment: Spark History Server (we've run it on several different 
> Hadoop distributions)
>Reporter: Simon King
> Attachments: SparkHistoryCPUandRAM.png
>
>
> Spark History Server runs out of memory, gets into GC thrash and eventually 
> becomes unresponsive. This seems to happen more quickly with heavy use of the 
> REST API. We've seen this with several versions of Spark. 
> Running with the following settings (spark 2.1):
> spark.history.fs.cleaner.enabledtrue
> spark.history.fs.cleaner.interval   1d
> spark.history.fs.cleaner.maxAge 7d
> spark.history.retainedApplications  500
> We will eventually get errors like:
> 17/02/25 05:02:19 WARN ServletHandler:·
> javax.servlet.ServletException: scala.MatchError: java.lang.OutOfMemoryError: 
> GC overhead limit exceeded (of class java.lang.OutOfMemoryError)
>   at 
> org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489)
>   at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:529)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>   at org.spark_project.jetty.server.Server.handle(Server.java:499)
>   at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>   at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit 
> exceeded (of class java.lang.OutOfMemoryError)
>   at 
> org.apache.spark.deploy.history.ApplicationCache.getSparkUI(ApplicationCache.scala:148)
>   at 
> org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:110)
>   at 
> org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:244)
>   at 
> org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:49)
>   at 
> org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66)
>   at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter$1.run(SubResourceLocatorRouter.java:158)
>   at 
> org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.getResource(SubResourceLocatorRouter.java:178)
>   at 
> org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.apply(SubResourceLocatorRouter.java:109)
>   at 
> org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:109)
>   at 
> org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112)
>   

[jira] [Updated] (SPARK-19814) Spark History Server Out Of Memory / Extreme GC

2017-03-03 Thread Simon King (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon King updated SPARK-19814:
---
Attachment: SparkHistoryCPUandRAM.png

Graph showing CPU usage (top) and RSS RAM (bottom). Note the one run of SHS in 
the middle with lower max heap setting eventually spent much more CPU time on 
garbage collection.

> Spark History Server Out Of Memory / Extreme GC
> ---
>
> Key: SPARK-19814
> URL: https://issues.apache.org/jira/browse/SPARK-19814
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.1, 2.0.0, 2.1.0
> Environment: Spark History Server (we've run it on several different 
> Hadoop distributions)
>Reporter: Simon King
> Attachments: SparkHistoryCPUandRAM.png
>
>
> Spark History Server runs out of memory, gets into GC thrash and eventually 
> becomes unresponsive. This seems to happen more quickly with heavy use of the 
> REST API. We've seen this with several versions of Spark. 
> Running with the following settings (spark 2.1):
> spark.history.fs.cleaner.enabledtrue
> spark.history.fs.cleaner.interval   1d
> spark.history.fs.cleaner.maxAge 7d
> spark.history.retainedApplications  500
> We will eventually get errors like:
> 17/02/25 05:02:19 WARN ServletHandler:·
> javax.servlet.ServletException: scala.MatchError: java.lang.OutOfMemoryError: 
> GC overhead limit exceeded (of class java.lang.OutOfMemoryError)
>   at 
> org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489)
>   at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:529)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>   at org.spark_project.jetty.server.Server.handle(Server.java:499)
>   at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>   at 
> org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit 
> exceeded (of class java.lang.OutOfMemoryError)
>   at 
> org.apache.spark.deploy.history.ApplicationCache.getSparkUI(ApplicationCache.scala:148)
>   at 
> org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:110)
>   at 
> org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:244)
>   at 
> org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:49)
>   at 
> org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66)
>   at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter$1.run(SubResourceLocatorRouter.java:158)
>   at 
> org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.getResource(SubResourceLocatorRouter.java:178)
>   at 
> org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.apply(SubResourceLocatorRouter.java:109)
>   at 
> org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:109)
>   at 
> org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112)
>   at 
> org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112)
>   at 
> 

[jira] [Updated] (SPARK-19813) maxFilesPerTrigger combo latestFirst may miss old files in combination with maxFileAge in FileStreamSource

2017-03-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19813:
-
Target Version/s: 2.2.0

> maxFilesPerTrigger combo latestFirst may miss old files in combination with 
> maxFileAge in FileStreamSource
> --
>
> Key: SPARK-19813
> URL: https://issues.apache.org/jira/browse/SPARK-19813
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> There is a file stream source option called maxFileAge which limits how old 
> the files can be, relative the latest file that has been seen. This is used 
> to limit the files that need to be remembered as "processed". Files older 
> than the latest processed files are ignored. This values is by default 7 days.
> This causes a problem when both 
>  - latestFirst = true
>  - maxFilesPerTrigger > total files to be processed.
> Here is what happens in all combinations
>  1) latestFirst = false - Since files are processed in order, there wont be 
> any unprocessed file older than the latest processed file. All files will be 
> processed.
>  2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge 
> thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is 
> not, then all old files get processed in the first batch, and so no file is 
> left behind.
>  3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch 
> process the latest X files. That sets the threshold latest file - maxFileAge, 
> so files older than this threshold will never be considered for processing. 
> The bug is with case 3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19814) Spark History Server Out Of Memory / Extreme GC

2017-03-03 Thread Simon King (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon King updated SPARK-19814:
---
Description: 
Spark History Server runs out of memory, gets into GC thrash and eventually 
becomes unresponsive. This seems to happen more quickly with heavy use of the 
REST API. We've seen this with several versions of Spark. 

Running with the following settings (spark 2.1):
spark.history.fs.cleaner.enabledtrue
spark.history.fs.cleaner.interval   1d
spark.history.fs.cleaner.maxAge 7d
spark.history.retainedApplications  500

We will eventually get errors like:
17/02/25 05:02:19 WARN ServletHandler:·
javax.servlet.ServletException: scala.MatchError: java.lang.OutOfMemoryError: 
GC overhead limit exceeded (of class java.lang.OutOfMemoryError)
  at 
org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489)
  at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
  at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
  at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
  at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
  at 
org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
  at 
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
  at 
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
  at 
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
  at 
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
  at 
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
  at 
org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:529)
  at 
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
  at 
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
  at org.spark_project.jetty.server.Server.handle(Server.java:499)
  at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
  at 
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
  at 
org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
  at 
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
  at 
org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
  at java.lang.Thread.run(Thread.java:745)

Caused by: scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit 
exceeded (of class java.lang.OutOfMemoryError)
  at 
org.apache.spark.deploy.history.ApplicationCache.getSparkUI(ApplicationCache.scala:148)
  at 
org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:110)
  at 
org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:244)
  at 
org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:49)
  at 
org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66)
  at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at 
org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter$1.run(SubResourceLocatorRouter.java:158)
  at 
org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.getResource(SubResourceLocatorRouter.java:178)
  at 
org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.apply(SubResourceLocatorRouter.java:109)
  at 
org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:109)
  at 
org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112)
  at 
org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112)
  at 
org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112)
  at 
org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112)
  at 
org.glassfish.jersey.server.internal.routing.RoutingStage.apply(RoutingStage.java:92)
  at 
org.glassfish.jersey.server.internal.routing.RoutingStage.apply(RoutingStage.java:61)
  at org.glassfish.jersey.process.internal.Stages.process(Stages.java:197)
  at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:318)
  at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
  at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
  at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
  at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
  at org.glassfish.jersey.internal.Errors.process(Errors.java:267)

  at 

[jira] [Created] (SPARK-19814) Spark History Server Out Of Memory / Extreme GC

2017-03-03 Thread Simon King (JIRA)
Simon King created SPARK-19814:
--

 Summary: Spark History Server Out Of Memory / Extreme GC
 Key: SPARK-19814
 URL: https://issues.apache.org/jira/browse/SPARK-19814
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.1.0, 2.0.0, 1.6.1
 Environment: Spark History Server (we've run it on several different 
Hadoop distributions)
Reporter: Simon King


Spark History Server runs out of memory, gets into GC thrash and eventually 
becomes unresponsive. This seems to happen more quickly with heavy use of the 
REST API. We've seen this with several versions of Spark. 

Running with the following settings (spark 2.1):
{{spark.history.fs.cleaner.enabledtrue
spark.history.fs.cleaner.interval   1d
spark.history.fs.cleaner.maxAge 7d
spark.history.retainedApplications  500}}

We will eventually get errors like:
{{17/02/25 05:02:19 WARN ServletHandler:·
javax.servlet.ServletException: scala.MatchError: java.lang.OutOfMemoryError: 
GC overhead limit exceeded (of class java.lang.OutOfMemoryError)
  at 
org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489)
  at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
  at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
  at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
  at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
  at 
org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812)
  at 
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
  at 
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
  at 
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
  at 
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
  at 
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
  at 
org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:529)
  at 
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
  at 
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
  at org.spark_project.jetty.server.Server.handle(Server.java:499)
  at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311)
  at 
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
  at 
org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
  at 
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
  at 
org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
  at java.lang.Thread.run(Thread.java:745)

Caused by: scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit 
exceeded (of class java.lang.OutOfMemoryError)
  at 
org.apache.spark.deploy.history.ApplicationCache.getSparkUI(ApplicationCache.scala:148)
  at 
org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:110)
  at 
org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:244)
  at 
org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:49)
  at 
org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66)
  at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at 
org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter$1.run(SubResourceLocatorRouter.java:158)
  at 
org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.getResource(SubResourceLocatorRouter.java:178)
  at 
org.glassfish.jersey.server.internal.routing.SubResourceLocatorRouter.apply(SubResourceLocatorRouter.java:109)
  at 
org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:109)
  at 
org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112)
  at 
org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112)
  at 
org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112)
  at 
org.glassfish.jersey.server.internal.routing.RoutingStage._apply(RoutingStage.java:112)
  at 
org.glassfish.jersey.server.internal.routing.RoutingStage.apply(RoutingStage.java:92)
  at 
org.glassfish.jersey.server.internal.routing.RoutingStage.apply(RoutingStage.java:61)
  at org.glassfish.jersey.process.internal.Stages.process(Stages.java:197)
  at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:318)
  at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
  at 

[jira] [Updated] (SPARK-19690) Join a streaming DataFrame with a batch DataFrame may not work

2017-03-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19690:
-
Priority: Critical  (was: Major)

> Join a streaming DataFrame with a batch DataFrame may not work
> --
>
> Key: SPARK-19690
> URL: https://issues.apache.org/jira/browse/SPARK-19690
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.3, 2.1.0, 2.1.1
>Reporter: Shixiong Zhu
>Priority: Critical
>
> When joining a streaming DataFrame with a batch DataFrame, if the batch 
> DataFrame has an aggregation, it will be converted to a streaming physical 
> aggregation. Then the query will crash.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18258) Sinks need access to offset representation

2017-03-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18258:
-
Target Version/s:   (was: 2.2.0)

> Sinks need access to offset representation
> --
>
> Key: SPARK-18258
> URL: https://issues.apache.org/jira/browse/SPARK-18258
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> Transactional "exactly-once" semantics for output require storing an offset 
> identifier in the same transaction as results.
> The Sink.addBatch method currently only has access to batchId and data, not 
> the actual offset representation.
> I want to store the actual offsets, so that they are recoverable as long as 
> the results are and I'm not locked in to a particular streaming engine.
> I could see this being accomplished by adding parameters to Sink.addBatch for 
> the starting and ending offsets (either the offsets themselves, or the 
> SPARK-17829 string/json representation).  That would be an API change, but if 
> there's another way to map batch ids to offset representations without 
> changing the Sink api that would work as well.  
> I'm assuming we don't need the same level of access to offsets throughout a 
> job as e.g. the Kafka dstream gives, because Sinks are the main place that 
> should need them.
> After SPARK-17829 is complete and offsets have a .json method, an api for 
> this ticket might look like
> {code}
> trait Sink {
>   def addBatch(batchId: Long, data: DataFrame, start: OffsetSeq, end: 
> OffsetSeq): Unit
> {code}
> where start and end were provided by StreamExecution.runBatch using 
> committedOffsets and availableOffsets.  
> I'm not 100% certain that the offsets in the seq could always be mapped back 
> to the correct source when restarting complicated multi-source jobs, but I 
> think it'd be sufficient.  Passing the string/json representation of the seq 
> instead of the seq itself would probably be sufficient as well, but the 
> convention of rendering a None as "-" in the json is maybe a little 
> idiosyncratic to parse, and the constant defining that is private.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19813) maxFilesPerTrigger combo latestFirst may miss old files in combination with maxFileAge in FileStreamSource

2017-03-03 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-19813:
---

 Summary: maxFilesPerTrigger combo latestFirst may miss old files 
in combination with maxFileAge in FileStreamSource
 Key: SPARK-19813
 URL: https://issues.apache.org/jira/browse/SPARK-19813
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Burak Yavuz
Assignee: Burak Yavuz


There is a file stream source option called maxFileAge which limits how old the 
files can be, relative the latest file that has been seen. This is used to 
limit the files that need to be remembered as "processed". Files older than the 
latest processed files are ignored. This values is by default 7 days.
This causes a problem when both 
 - latestFirst = true
 - maxFilesPerTrigger > total files to be processed.

Here is what happens in all combinations
 1) latestFirst = false - Since files are processed in order, there wont be any 
unprocessed file older than the latest processed file. All files will be 
processed.
 2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge 
thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is 
not, then all old files get processed in the first batch, and so no file is 
left behind.
 3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch 
process the latest X files. That sets the threshold latest file - maxFileAge, 
so files older than this threshold will never be considered for processing. 

The bug is with case 3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19774) StreamExecution should call stop() on sources when a stream fails

2017-03-03 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19774.
--
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> StreamExecution should call stop() on sources when a stream fails
> -
>
> Key: SPARK-19774
> URL: https://issues.apache.org/jira/browse/SPARK-19774
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.1.1, 2.2.0
>
>
> We call stop() on a Structured Streaming Source only when the stream is 
> shutdown when a user calls streamingQuery.stop(). We should actually stop all 
> sources when the stream fails as well, otherwise we may leak resources, e.g. 
> connections to Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19701) the `in` operator in pyspark is broken

2017-03-03 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894796#comment-15894796
 ] 

Wenchen Fan commented on SPARK-19701:
-

let's remove it then

> the `in` operator in pyspark is broken
> --
>
> Key: SPARK-19701
> URL: https://issues.apache.org/jira/browse/SPARK-19701
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>
> {code}
> >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md")
> >>> linesWithSpark = textFile.filter("Spark" in textFile.value)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, 
> in __nonzero__
> raise ValueError("Cannot convert column into bool: please use '&' for 
> 'and', '|' for 'or', "
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|' 
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

2017-03-03 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894768#comment-15894768
 ] 

Marcelo Vanzin commented on SPARK-18085:


bq.  does this local db will delete the data as specified by the configuration?

The existing log cleaner functionality will be maintained, so the application 
logs will be cleaned the same way they are today. For the new local DBs, I 
kinda touch on that in the document. My current plan is to first have a 
configuration for the maximum amount of data the SHS can use locally (and use a 
LRU-style approach to delete local DBs), and eventually cache these DBs in 
remote storage (e.g. HDFS) so that they don't need to be re-created (which can 
be expensive).

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2017-03-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894701#comment-15894701
 ] 

Apache Spark commented on SPARK-18278:
--

User 'erikerlandson' has created a pull request for this issue:
https://github.com/apache/spark/pull/16061

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19710) Test Failures in SQLQueryTests on big endian platforms

2017-03-03 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-19710.
---
   Resolution: Fixed
 Assignee: Pete Robbins
Fix Version/s: 2.2.0

> Test Failures in SQLQueryTests on big endian platforms
> --
>
> Key: SPARK-19710
> URL: https://issues.apache.org/jira/browse/SPARK-19710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Pete Robbins
>Assignee: Pete Robbins
>Priority: Minor
> Fix For: 2.2.0
>
>
> Some of the new test queries introduced by 
> https://issues.apache.org/jira/browse/SPARK-18871 fail when run on zLinux 
> (big endian)
> The order of the return rows is different to the results file, hence the 
> failures, but the results are valid for the queries as insufficient ordering 
> is specified to give absolute results.
> The failing tests are in o.a.s.SQLQuerTestSuite
> in-joins.sql
> not-in-joins.sql
> in-set-operations.sql
> These can be fixed by adding to the ORDER BY clauses to determine the 
> resulting row order.
> PR on it's way



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19812) YARN shuffle service fails to relocate recovery DB directories

2017-03-03 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894584#comment-15894584
 ] 

Thomas Graves commented on SPARK-19812:
---

note that it will go ahead and start using the recovery db, it just doesn't 
copy over the old one so anything running gets lost.

> YARN shuffle service fails to relocate recovery DB directories
> --
>
> Key: SPARK-19812
> URL: https://issues.apache.org/jira/browse/SPARK-19812
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> The yarn shuffle service tries to switch from the yarn local directories to 
> the real recovery directory but can fail to move the existing recovery db's.  
> It fails due to Files.move not doing directories that have contents.
> 2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move 
> recovery file sparkShuffleRecovery.ldb to the path 
> /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle
> java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb
> at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498)
> at 
> sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
> at java.nio.file.Files.move(Files.java:1395)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684)
> This used to use f.renameTo and we switched it in the pr due to review 
> comments and it looks like didn't do a final real test. The tests are using 
> files rather then directories so it didn't catch. We need to fix the test 
> also.
> history: 
> https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19812) YARN shuffle service fails to relocate recovery DB directories

2017-03-03 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-19812:
--
Summary: YARN shuffle service fails to relocate recovery DB directories  
(was: YARN shuffle service fix moving recovery DB directories)

> YARN shuffle service fails to relocate recovery DB directories
> --
>
> Key: SPARK-19812
> URL: https://issues.apache.org/jira/browse/SPARK-19812
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> The yarn shuffle service tries to switch from the yarn local directories to 
> the real recovery directory but can fail to move the existing recovery db's.  
> It fails due to Files.move not doing directories that have contents.
> 2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move 
> recovery file sparkShuffleRecovery.ldb to the path 
> /mapred/yarn-nodemanager/nm-aux-services/spark_shuffle
> java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb
> at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498)
> at 
> sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
> at java.nio.file.Files.move(Files.java:1395)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684)
> This used to use f.renameTo and we switched it in the pr due to review 
> comments and it looks like didn't do a final real test. The tests are using 
> files rather then directories so it didn't catch. We need to fix the test 
> also.
> history: 
> https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19812) YARN shuffle service fix moving recovery DB directories

2017-03-03 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-19812:
-

 Summary: YARN shuffle service fix moving recovery DB directories
 Key: SPARK-19812
 URL: https://issues.apache.org/jira/browse/SPARK-19812
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.0.1
Reporter: Thomas Graves
Assignee: Thomas Graves


The yarn shuffle service tries to switch from the yarn local directories to the 
real recovery directory but can fail to move the existing recovery db's.  It 
fails due to Files.move not doing directories that have contents.

2017-03-03 14:57:19,558 [main] ERROR yarn.YarnShuffleService: Failed to move 
recovery file sparkShuffleRecovery.ldb to the path 
/mapred/yarn-nodemanager/nm-aux-services/spark_shuffle
java.nio.file.DirectoryNotEmptyException:/yarn-local/sparkShuffleRecovery.ldb
at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:498)
at 
sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262)
at java.nio.file.Files.move(Files.java:1395)
at 
org.apache.spark.network.yarn.YarnShuffleService.initRecoveryDb(YarnShuffleService.java:369)
at 
org.apache.spark.network.yarn.YarnShuffleService.createSecretManager(YarnShuffleService.java:200)
at 
org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:174)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:262)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684)

This used to use f.renameTo and we switched it in the pr due to review comments 
and it looks like didn't do a final real test. The tests are using files rather 
then directories so it didn't catch. We need to fix the test also.

history: 
https://github.com/apache/spark/pull/14999/commits/65de8531ccb91287f5a8a749c7819e99533b9440



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18389) Disallow cyclic view reference

2017-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18389:


Assignee: (was: Apache Spark)

> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18389) Disallow cyclic view reference

2017-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18389:


Assignee: Apache Spark

> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18389) Disallow cyclic view reference

2017-03-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894571#comment-15894571
 ] 

Apache Spark commented on SPARK-18389:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/17152

> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19758) Casting string to timestamp in inline table definition fails with AnalysisException

2017-03-03 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-19758.
---
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.2.0

> Casting string to timestamp in inline table definition fails with 
> AnalysisException
> ---
>
> Key: SPARK-19758
> URL: https://issues.apache.org/jira/browse/SPARK-19758
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Liang-Chi Hsieh
>Priority: Blocker
> Fix For: 2.2.0
>
>
> The following query runs succesfully on Spark 2.1.x but fails in the current 
> master:
> {code}
> sql("""CREATE TEMPORARY VIEW table_4(timestamp_col_3) AS VALUES 
> TIMESTAMP('1991-12-06 00:00:00.0')""")
> {code}
> Here's the error:
> {code}
> scala> sql("""CREATE TEMPORARY VIEW table_4(timestamp_col_3) AS VALUES 
> TIMESTAMP('1991-12-06 00:00:00.0')""")
> org.apache.spark.sql.AnalysisException: failed to evaluate expression 
> CAST('1991-12-06 00:00:00.0' AS TIMESTAMP): None.get; line 1 pos 50
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4$$anonfun$apply$4.apply(ResolveInlineTables.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4$$anonfun$apply$4.apply(ResolveInlineTables.scala:95)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4.apply(ResolveInlineTables.scala:95)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$4.apply(ResolveInlineTables.scala:94)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.convert(ResolveInlineTables.scala:94)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:36)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$$anonfun$apply$1.applyOrElse(ResolveInlineTables.scala:32)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.apply(ResolveInlineTables.scala:32)
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables$.apply(ResolveInlineTables.scala:31)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:65)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:63)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:128)
>   at 
> 

[jira] [Commented] (SPARK-15797) To expose groupingSets for DataFrame

2017-03-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-15797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894509#comment-15894509
 ] 

Pau Tallada Crespí commented on SPARK-15797:


Hi, any progress on this? :P

> To expose groupingSets for DataFrame
> 
>
> Key: SPARK-15797
> URL: https://issues.apache.org/jira/browse/SPARK-15797
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Priyanka Garg
>
> Currently, Cube and rollup functions are exposed in data frame but not 
> grouping sets. 
> For eg.
> df.rollup($"department", $"group", $designation).avg() results into 
> a. All combinations of department , group and designations
> b. All combinations of department , group , taking designation as null
> c. All departments , taking groups and designation as null
> d. taking department and group both null ( means aggregating on the complete 
> data)
> On the same lines , there should be a function grouping sets , in which 
> custom groupings can be specified.
> For eg.
> df.groupingSets(($"department", $"group", $"designation"), ($"group") 
> ,($"designation"), () ).avg() 
> This should result into:
> 1. All combinations of department, group and designation
> 2. All values of group taking department and designation as null
> 3. All  values of designation, taking department and group as null.
> 4. Aggregation on complete data i.e. taking designation, group and department 
> as null.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19503) Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count()

2017-03-03 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894496#comment-15894496
 ] 

Herman van Hovell commented on SPARK-19503:
---

We do not prune local sorts yet; however a user can explicitly request those. 
The query should return the requested physical layout, but other than that we 
should just prune unneeded shuffles and sorts.

> Execution Plan Optimizer: avoid sort or shuffle when it does not change end 
> result such as df.sort(...).count()
> ---
>
> Key: SPARK-19503
> URL: https://issues.apache.org/jira/browse/SPARK-19503
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
> Environment: Perhaps only a pyspark or databricks AWS issue
>Reporter: R
>Priority: Minor
>  Labels: execution, optimizer, plan, query
>
> df.sort(...).count()
> performs shuffle and sort and then count! This is wasteful as sort is not 
> required here and makes me wonder how smart the algebraic optimiser is 
> indeed! The data may be partitioned by known count (such as parquet files) 
> and we should not shuffle to just perform count.
> This may look trivial, but if optimiser fails to recognise this, I wonder 
> what else is it missing especially in more complex operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19764) Executors hang with supposedly running task that are really finished.

2017-03-03 Thread Ari Gesher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ari Gesher updated SPARK-19764:
---

There's nothing output in the driver. It just appears hung.


> Executors hang with supposedly running task that are really finished.
> -
>
> Key: SPARK-19764
> URL: https://issues.apache.org/jira/browse/SPARK-19764
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04 LTS
> OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> Spark 2.0.2 - Spark Cluster Manager
>Reporter: Ari Gesher
> Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, 
> SPARK-19764.tgz
>
>
> We've come across a job that won't finish.  Running on a six-node cluster, 
> each of the executors end up with 5-7 tasks that are never marked as 
> completed.
> Here's an excerpt from the web UI:
> ||Index  ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch 
> Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result 
> Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read 
> Size / Records||Errors||
> |105  | 1131  | 0 | SUCCESS   |PROCESS_LOCAL  |4 / 172.31.24.171 |
> 2017/02/27 22:51:36 |   1.9 min |   9 ms |  4 ms |  0.7 s | 2 ms|   6 ms| 
>   384.1 MB|   90.3 MB / 572   | |
> |106| 1168|   0|  RUNNING |ANY|   2 / 172.31.16.112|  2017/02/27 
> 22:53:25|6.5 h   |0 ms|  0 ms|   1 s |0 ms|  0 ms|   |384.1 MB   
> |98.7 MB / 624 | |  
> However, the Executor reports the task as finished: 
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> As does the driver log:
> {noformat}
> 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168)
> 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). 
> 2633558 bytes result sent via BlockManager)
> {noformat}
> Full log from this executor and the {{stderr}} from 
> {{app-20170227223614-0001/2/stderr}} attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)

2017-03-03 Thread Jakub Dubovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894402#comment-15894402
 ] 

Jakub Dubovsky commented on SPARK-16599:


[~srowen] I tried to create a custom spark build with the change you suggested 
above. But I am unable to install it locally (see below). I asked on spark dev 
mailing list but nobody really helped. So I try to post it here. 

[This is a 
change|https://gist.github.com/james64/cc158bdb81bc1828937c757fde94ce82] I did 
to spark on v2.1.0 tag. And [this is a build 
output|https://gist.github.com/james64/85b3bf4613e7105bebd687502258a518] I got 
when tried to run this:

./build/mvn -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn 
-Dhadoop.version=2.6.0-cdh5.7.1 clean install

I believe the profile selection and versions are right because this was 
successful:

./dev/make-distribution.sh --name spark-custom-lock --tgz -Phadoop-2.6 -Phive 
-Phive-thriftserver -Pyarn -Dhadoop.version=2.6.0-cdh5.7.1

> java.util.NoSuchElementException: None.get  at at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
> --
>
> Key: SPARK-16599
> URL: https://issues.apache.org/jira/browse/SPARK-16599
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: centos 6.7   spark 2.0
>Reporter: binde
>
> run a spark job with spark 2.0, error message
> Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): 
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19810) Remove support for Scala 2.10

2017-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19810:


Assignee: Apache Spark  (was: Sean Owen)

> Remove support for Scala 2.10
> -
>
> Key: SPARK-19810
> URL: https://issues.apache.org/jira/browse/SPARK-19810
> Project: Spark
>  Issue Type: Task
>  Components: ML, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Critical
>
> This tracks the removal of Scala 2.10 support, as discussed in 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html
>  and other lists.
> The primary motivations are to simplify the code and build, and to enable 
> Scala 2.12 support later. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19810) Remove support for Scala 2.10

2017-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19810:


Assignee: Sean Owen  (was: Apache Spark)

> Remove support for Scala 2.10
> -
>
> Key: SPARK-19810
> URL: https://issues.apache.org/jira/browse/SPARK-19810
> Project: Spark
>  Issue Type: Task
>  Components: ML, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Critical
>
> This tracks the removal of Scala 2.10 support, as discussed in 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html
>  and other lists.
> The primary motivations are to simplify the code and build, and to enable 
> Scala 2.12 support later. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19810) Remove support for Scala 2.10

2017-03-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894267#comment-15894267
 ] 

Apache Spark commented on SPARK-19810:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17150

> Remove support for Scala 2.10
> -
>
> Key: SPARK-19810
> URL: https://issues.apache.org/jira/browse/SPARK-19810
> Project: Spark
>  Issue Type: Task
>  Components: ML, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Critical
>
> This tracks the removal of Scala 2.10 support, as discussed in 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html
>  and other lists.
> The primary motivations are to simplify the code and build, and to enable 
> Scala 2.12 support later. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19811) sparksql 2.1 can not prune hive partition

2017-03-03 Thread sydt (JIRA)
sydt created SPARK-19811:


 Summary: sparksql 2.1 can not prune hive partition 
 Key: SPARK-19811
 URL: https://issues.apache.org/jira/browse/SPARK-19811
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: sydt


When sparksql2.1 execute sql, it has error:
java.lang.RuntimeException: Expected only partition pruning predicates: 
(isnotnull(DAY_ID#216) && (DAY_ID#216 = 20170212))  and the sql  sentence is 
select PROD_INST_ID from CRM_DB.ITG_PROD_INST WHERE DAY_ID='20170212' AND  
PROV_ID  ='842' limit 10; where DAY_ID and PROVE_ID is partition in hive.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19810) Remove support for Scala 2.10

2017-03-03 Thread Sean Owen (JIRA)
Sean Owen created SPARK-19810:
-

 Summary: Remove support for Scala 2.10
 Key: SPARK-19810
 URL: https://issues.apache.org/jira/browse/SPARK-19810
 Project: Spark
  Issue Type: Task
  Components: ML, Spark Core, SQL
Affects Versions: 2.1.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Critical


This tracks the removal of Scala 2.10 support, as discussed in 
http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html
 and other lists.

The primary motivations are to simplify the code and build, and to enable Scala 
2.12 support later. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16773) Post Spark 2.0 deprecation & warnings cleanup

2017-03-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16773.
---
Resolution: Done

> Post Spark 2.0 deprecation & warnings cleanup
> -
>
> Key: SPARK-16773
> URL: https://issues.apache.org/jira/browse/SPARK-16773
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark, Spark Core, SQL
>Reporter: holdenk
>
> As part of the 2.0 release we deprecated a number of different internal 
> components (one of the largest ones being the old accumulator API), and also 
> upgraded our default build to Scala 2.11.
> This has added a large number of deprecation warnings (internal and external) 
> - some of which can be worked around - and some of which can't (mostly in the 
> Scala 2.10 -> 2.11 reflection API and various tests).
> We should attempt to limit the number of warnings in our build so that we can 
> notice new ones and thoughtfully consider if they are warranted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16775) Reduce internal warnings from deprecated accumulator API

2017-03-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894209#comment-15894209
 ] 

Sean Owen commented on SPARK-16775:
---

Are there still areas where uses of deprecated accumulators can be changed? I'm 
aware they're still referenced from tests, but they kind of have to be in at 
least most of those cases.

> Reduce internal warnings from deprecated accumulator API
> 
>
> Key: SPARK-16775
> URL: https://issues.apache.org/jira/browse/SPARK-16775
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: holdenk
>
> Deprecating the old accumulator API added a large number of warnings - many 
> of these could be fixed with a bit of refactoring to offer a non-deprecated 
> internal class while still preserving the external deprecation warnings.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16775) Reduce internal warnings from deprecated accumulator API

2017-03-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16775:
--
Affects Version/s: 2.1.0
   Issue Type: Improvement  (was: Sub-task)
   Parent: (was: SPARK-16773)

> Reduce internal warnings from deprecated accumulator API
> 
>
> Key: SPARK-16775
> URL: https://issues.apache.org/jira/browse/SPARK-16775
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: holdenk
>
> Deprecating the old accumulator API added a large number of warnings - many 
> of these could be fixed with a bit of refactoring to offer a non-deprecated 
> internal class while still preserving the external deprecation warnings.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19807) Add reason for cancellation when a stage is killed using web UI

2017-03-03 Thread Genmao Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894185#comment-15894185
 ] 

Genmao Yu edited comment on SPARK-19807 at 3/3/17 11:35 AM:


!https://cloud.githubusercontent.com/assets/7402327/23549702/6a0c93f6-0048-11e7-8a3f-bf58befb887b.png!

Do you mean the "Job 0 cancelled" in picture?


was (Author: unclegen):
!https://cloud.githubusercontent.com/assets/7402327/23549478/70888646-0047-11e7-8e2c-e64a3db43711.png!

Do you mean the "Job 0 cancelled" in picture?

> Add reason for cancellation when a stage is killed using web UI
> ---
>
> Key: SPARK-19807
> URL: https://issues.apache.org/jira/browse/SPARK-19807
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> When a user kills a stage using web UI (in Stages page), 
> {{StagesTab.handleKillRequest}} requests {{SparkContext}} to cancel the stage 
> without giving a reason. {{SparkContext}} has {{cancelStage(stageId: Int, 
> reason: String)}} that Spark could use to pass the information for 
> monitoring/debugging purposes.
> {code}
> scala> sc.range(0, 5, 1, 1).mapPartitions { nums => { Thread.sleep(60 * 
> 1000); nums } }.count
> {code}
> Use http://localhost:4040/stages/ and click Kill.
> {code}
> org.apache.spark.SparkException: Job 0 cancelled because Stage 0 was cancelled
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1486)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:1426)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleStageCancellation$1.apply$mcVI$sp(DAGScheduler.scala:1415)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleStageCancellation$1.apply(DAGScheduler.scala:1408)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleStageCancellation$1.apply(DAGScheduler.scala:1408)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:234)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleStageCancellation(DAGScheduler.scala:1408)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1670)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1656)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1645)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2019)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2040)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2059)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2084)
>   at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19807) Add reason for cancellation when a stage is killed using web UI

2017-03-03 Thread Genmao Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894185#comment-15894185
 ] 

Genmao Yu commented on SPARK-19807:
---

!https://cloud.githubusercontent.com/assets/7402327/23549478/70888646-0047-11e7-8e2c-e64a3db43711.png!

Do you mean the "Job 0 cancelled" in picture?

> Add reason for cancellation when a stage is killed using web UI
> ---
>
> Key: SPARK-19807
> URL: https://issues.apache.org/jira/browse/SPARK-19807
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> When a user kills a stage using web UI (in Stages page), 
> {{StagesTab.handleKillRequest}} requests {{SparkContext}} to cancel the stage 
> without giving a reason. {{SparkContext}} has {{cancelStage(stageId: Int, 
> reason: String)}} that Spark could use to pass the information for 
> monitoring/debugging purposes.
> {code}
> scala> sc.range(0, 5, 1, 1).mapPartitions { nums => { Thread.sleep(60 * 
> 1000); nums } }.count
> {code}
> Use http://localhost:4040/stages/ and click Kill.
> {code}
> org.apache.spark.SparkException: Job 0 cancelled because Stage 0 was cancelled
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1486)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:1426)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleStageCancellation$1.apply$mcVI$sp(DAGScheduler.scala:1415)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleStageCancellation$1.apply(DAGScheduler.scala:1408)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleStageCancellation$1.apply(DAGScheduler.scala:1408)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:234)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleStageCancellation(DAGScheduler.scala:1408)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1670)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1656)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1645)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2019)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2040)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2059)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2084)
>   at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19782) Spark query available cores from application

2017-03-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19782.
---
Resolution: Not A Problem

> Spark query available cores from application
> 
>
> Key: SPARK-19782
> URL: https://issues.apache.org/jira/browse/SPARK-19782
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Tom Lewis
>
> It might be helpful for Spark jobs to self regulate resources if they could 
> query how many cores exist on a executing system not just how many are being 
> used at a given time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19701) the `in` operator in pyspark is broken

2017-03-03 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894159#comment-15894159
 ] 

Hyukjin Kwon commented on SPARK-19701:
--

I was thinking a way to work around (e.g., hijacking..) but it seems we can't.
BTW, the below codes seems throwing a {{TypeError}} if {{__nonzero__}} or 
{{__bool__}} returns other types.

{code}
class Column(object):
def __contains__(self, item):
print "I am contains"
return Column()
def __nonzero__(self):
return "a"

>>> 1 in Column()
I am contains
Traceback (most recent call last):
  File "", line 1, in 
TypeError: __nonzero__ should return bool or int, returned str
{code}


> the `in` operator in pyspark is broken
> --
>
> Key: SPARK-19701
> URL: https://issues.apache.org/jira/browse/SPARK-19701
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>
> {code}
> >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md")
> >>> linesWithSpark = textFile.filter("Spark" in textFile.value)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, 
> in __nonzero__
> raise ValueError("Cannot convert column into bool: please use '&' for 
> 'and', '|' for 'or', "
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|' 
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19701) the `in` operator in pyspark is broken

2017-03-03 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894155#comment-15894155
 ] 

Hyukjin Kwon commented on SPARK-19701:
--

[~cloud_fan], I took a look this for my curiosity. It seems this is what 
happens now :

{code}
class Column(object):
def __contains__(self, item):
print "I am contains"
return Column()
def __nonzero__(self):
raise Exception("I am nonzero.")

>>> 1 in Column()
I am contains
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 6, in __nonzero__
Exception: I am nonzero.
{code}

It seems it calls {{__contains__}} first and then {{__nonzero__}} or 
{{__bool__}} is being called against {{Column()}}
to make this a bool.

It seems {{__nonzero__}} (for Python 2), {{__bool__}} (for Python 3) and 
{{__contains__}} forcing the the return
into a bool unlike other operators.

I also referred the references as below to check my assumption and little 
knowledge:

http://stackoverflow.com/questions/12244074/python-source-code-for-built-in-in-operator/12244378#12244378
http://stackoverflow.com/questions/38542543/functionality-of-python-in-vs-contains/38542777

I tested the codes above in 1.6.3, 2.1.0 and in the master branch. It seems it 
has not been working so far.

Should we maybe remove this?

> the `in` operator in pyspark is broken
> --
>
> Key: SPARK-19701
> URL: https://issues.apache.org/jira/browse/SPARK-19701
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>
> {code}
> >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md")
> >>> linesWithSpark = textFile.filter("Spark" in textFile.value)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, 
> in __nonzero__
> raise ValueError("Cannot convert column into bool: please use '&' for 
> 'and', '|' for 'or', "
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|' 
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19801) Remove JDK7 from Travis CI

2017-03-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19801:
-

Assignee: Dongjoon Hyun

> Remove JDK7 from Travis CI
> --
>
> Key: SPARK-19801
> URL: https://issues.apache.org/jira/browse/SPARK-19801
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.2.0
>
>
> Since Spark 2.1.0, Travis CI was supported by SPARK-15207 for automated PR 
> verification (JDK7/JDK8 maven compilation and Java Linter) and contributors 
> can see the additional result via their Travis CI dashboard (or PC).
> This issue aims to make `.travis.yml` up-to-date by removing JDK7 which was 
> removed via SPARK-19550.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19801) Remove JDK7 from Travis CI

2017-03-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19801.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17143
[https://github.com/apache/spark/pull/17143]

> Remove JDK7 from Travis CI
> --
>
> Key: SPARK-19801
> URL: https://issues.apache.org/jira/browse/SPARK-19801
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.2.0
>
>
> Since Spark 2.1.0, Travis CI was supported by SPARK-15207 for automated PR 
> verification (JDK7/JDK8 maven compilation and Java Linter) and contributors 
> can see the additional result via their Travis CI dashboard (or PC).
> This issue aims to make `.travis.yml` up-to-date by removing JDK7 which was 
> removed via SPARK-19550.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19792) In the Master Page,the column named “Memory per Node” ,I think it is not all right

2017-03-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19792:
--
Priority: Trivial  (was: Major)

Hm, I'm honestly not sure. Does this refer to the memory allocated to each 
executor by the worker, or, does it refer to the amount of memory the worker 
can assign to executors?

> In the Master Page,the column named “Memory per Node” ,I think  it is not all 
> right
> ---
>
> Key: SPARK-19792
> URL: https://issues.apache.org/jira/browse/SPARK-19792
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: liuxian
>Priority: Trivial
>
> Open the spark web page,in the Master Page ,have two tables:Running 
> Applications table and  Completed Applications table, to the column named 
> “Memory per Node” ,I think it is not all right ,because a node may be not 
> have only one executor.So I think that should be named as “Memory per 
> Executor”.Otherwise easy to let the user misunderstanding



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19797) ML pipelines document error

2017-03-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19797:
-

Assignee: Zhe Sun

> ML pipelines document error
> ---
>
> Key: SPARK-19797
> URL: https://issues.apache.org/jira/browse/SPARK-19797
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zhe Sun
>Assignee: Zhe Sun
>Priority: Trivial
>  Labels: documentation
> Fix For: 2.1.1, 2.2.0
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Description about pipeline in this paragraph is incorrect 
> https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which 
> misleads the user
> bq. If the Pipeline had more *stages*, it would call the 
> LogisticRegressionModel’s transform() method on the DataFrame before passing 
> the DataFrame to the next stage.
> The description is not accurate, because *Transformer* could also be a stage. 
> But only another Estimator will invoke an extra transform call.
> So, the description should be corrected as: *If the Pipeline had more 
> _Estimators_*. 
> The code to prove it is here 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19797) ML pipelines document error

2017-03-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19797.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 17137
[https://github.com/apache/spark/pull/17137]

> ML pipelines document error
> ---
>
> Key: SPARK-19797
> URL: https://issues.apache.org/jira/browse/SPARK-19797
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zhe Sun
>Priority: Trivial
>  Labels: documentation
> Fix For: 2.1.1, 2.2.0
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Description about pipeline in this paragraph is incorrect 
> https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which 
> misleads the user
> bq. If the Pipeline had more *stages*, it would call the 
> LogisticRegressionModel’s transform() method on the DataFrame before passing 
> the DataFrame to the next stage.
> The description is not accurate, because *Transformer* could also be a stage. 
> But only another Estimator will invoke an extra transform call.
> So, the description should be corrected as: *If the Pipeline had more 
> _Estimators_*. 
> The code to prove it is here 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19339) StatFunctions.multipleApproxQuantiles can give NoSuchElementException: next on empty iterator

2017-03-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19339.
---
Resolution: Duplicate

> StatFunctions.multipleApproxQuantiles can give NoSuchElementException: next 
> on empty iterator
> -
>
> Key: SPARK-19339
> URL: https://issues.apache.org/jira/browse/SPARK-19339
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Barry Becker
>Priority: Minor
>
> This problem is easy to reproduce by running 
> StatFunctions.multipleApproxQuantiles on an empty dataset, but I think it can 
> occur in other cases, like if the column is all null or all one value.
> I have unit tests that can hit it in several different cases.
> The fix that I have introduced locally is to return
> {code}
>  if (sampled.length == 0) 0 else sampled.last.value
> {code}
> instead of 
> {code}
> sampled.last.value
> {code}
> at the end of QuantileSummaries.query.
> Below is the exception:
> {code}
> next on empty iterator
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at 
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
>   at scala.collection.IterableLike$class.head(IterableLike.scala:107)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186)
>   at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
>   at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186)
>   at 
> scala.collection.TraversableLike$class.last(TraversableLike.scala:459)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$last(ArrayOps.scala:186)
>   at 
> scala.collection.IndexedSeqOptimized$class.last(IndexedSeqOptimized.scala:132)
>   at scala.collection.mutable.ArrayOps$ofRef.last(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.util.QuantileSummaries.query(QuantileSummaries.scala:207)
>   at 
> org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply$mcDD$sp(SparkPercentileCalculator.scala:91)
>   at 
> org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(SparkPercentileCalculator.scala:91)
>   at 
> org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(SparkPercentileCalculator.scala:91)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1.apply(SparkPercentileCalculator.scala:91)
>   at 
> org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1.apply(SparkPercentileCalculator.scala:91)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.SparkPercentileCalculator.multipleApproxQuantiles(SparkPercentileCalculator.scala:91)
>   at 
> com.mineset.spark.statistics.model.ContinuousMinesetStats.quartiles$lzycompute(ContinuousMinesetStats.scala:274)
>   at 
> com.mineset.spark.statistics.model.ContinuousMinesetStats.quartiles(ContinuousMinesetStats.scala:272)
>   at 
> com.mineset.spark.statistics.model.MinesetStats.com$mineset$spark$statistics$model$MinesetStats$$serializeContinuousFeature$1(MinesetStats.scala:66)
>   at 
> com.mineset.spark.statistics.model.MinesetStats$$anonfun$calculateWithColumns$1.apply(MinesetStats.scala:118)
>   at 
> com.mineset.spark.statistics.model.MinesetStats$$anonfun$calculateWithColumns$1.apply(MinesetStats.scala:114)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> 

[jira] [Resolved] (SPARK-19739) SparkHadoopUtil.appendS3AndSparkHadoopConfigurations to propagate full set of AWS env vars

2017-03-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19739.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17080
[https://github.com/apache/spark/pull/17080]

> SparkHadoopUtil.appendS3AndSparkHadoopConfigurations to propagate full set of 
> AWS env vars
> --
>
> Key: SPARK-19739
> URL: https://issues.apache.org/jira/browse/SPARK-19739
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
>Priority: Minor
> Fix For: 2.2.0
>
>
> {{SparkHadoopUtil.appendS3AndSparkHadoopConfigurations()}} propagates the AWS 
> user and secret key to s3n and s3a config options, so getting secrets from 
> the user to the cluster, if set.
> AWS also supports session authentication (env var {{AWS_SESSION_TOKEN}}) and 
> region endpoints {{AWS_DEFAULT_REGION}}, the latter being critical if you 
> want to address V4-auth-only endpoints like frankfurt and Seol. 
> These env vars should be picked up and passed down to S3a too. 4+ lines of 
> code, though impossible to test unless the existing code is refactored to 
> take the env var map[String, String], so allowing a test suite to set the 
> values in itds own map.
> side issue: what if only half the env vars are set and users are trying to 
> understand why auth is failing? It may be good to build up a string 
> identifying which env vars had their value propagate, and log that @ debug, 
> while not logging the values, obviously.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19739) SparkHadoopUtil.appendS3AndSparkHadoopConfigurations to propagate full set of AWS env vars

2017-03-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19739:
-

Assignee: Genmao Yu

> SparkHadoopUtil.appendS3AndSparkHadoopConfigurations to propagate full set of 
> AWS env vars
> --
>
> Key: SPARK-19739
> URL: https://issues.apache.org/jira/browse/SPARK-19739
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
>Assignee: Genmao Yu
>Priority: Minor
> Fix For: 2.2.0
>
>
> {{SparkHadoopUtil.appendS3AndSparkHadoopConfigurations()}} propagates the AWS 
> user and secret key to s3n and s3a config options, so getting secrets from 
> the user to the cluster, if set.
> AWS also supports session authentication (env var {{AWS_SESSION_TOKEN}}) and 
> region endpoints {{AWS_DEFAULT_REGION}}, the latter being critical if you 
> want to address V4-auth-only endpoints like frankfurt and Seol. 
> These env vars should be picked up and passed down to S3a too. 4+ lines of 
> code, though impossible to test unless the existing code is refactored to 
> take the env var map[String, String], so allowing a test suite to set the 
> values in itds own map.
> side issue: what if only half the env vars are set and users are trying to 
> understand why auth is failing? It may be good to build up a string 
> identifying which env vars had their value propagate, and log that @ debug, 
> while not logging the values, obviously.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19794) Release HDFS Client after read/write checkpoint

2017-03-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19794.
---
Resolution: Not A Problem

See PR

> Release HDFS Client after read/write checkpoint
> ---
>
> Key: SPARK-19794
> URL: https://issues.apache.org/jira/browse/SPARK-19794
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: darion yaphet
>
> RDD check point write each partation into HDFS and reading from HDFS when RDD 
> need recomputation . After process with HDFS HDFS client and streams should 
> be closed . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19808) About the default blocking arg in unpersist

2017-03-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894087#comment-15894087
 ] 

Sean Owen commented on SPARK-19808:
---

(Maybe you can rewrite this as a proposed change rather than question?)

They should be consistent, but I don't think they're worth changing now because 
it's a behavior change for little gain. Consider also the destroy() and 
unpersist() operations for broadcasts.

However I have never been sure why an application would want to block waiting 
on an unpersist operation. For that reason, I think most calls in Spark are 
blocking=false and I'd personally support making this consistent. That is, 
unless someone highlights why this sometimes isn't a good idea?


> About the default blocking arg in unpersist
> ---
>
> Key: SPARK-19808
> URL: https://issues.apache.org/jira/browse/SPARK-19808
> Project: Spark
>  Issue Type: Question
>  Components: ML, Spark Core
>Affects Versions: 2.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> Now, {{unpersist}} are commonly used with default value in ML.
> Most algorithms like {{KMeans}} use {{RDD.unpersisit}} and the default 
> {{blocking}} is {{true}}
> And for meta algorithms like {{OneVsRest}}, {{CrossValidator}} use 
> {{Dataset.unpersist}} and the default {{blocking}} is {{false}}
> Should the default value for {{RDD.unpersisit}} and {{Dataset.unpersist}} be 
> consistent?
> And all the {{blocking}} arg in ML should be set {{false}}?
> [~srowen] [~mlnick] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

2017-03-03 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894074#comment-15894074
 ] 

DjvuLee commented on SPARK-18085:
-

[~vanzin] This is a nice design.

There is not much information about the delete. The history log can be large 
after a few weeks, does this local db will delete the data as specified by the 
configuration?

> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String

2017-03-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894039#comment-15894039
 ] 

Apache Spark commented on SPARK-19257:
--

User 'windpiger' has created a pull request for this issue:
https://github.com/apache/spark/pull/17149

> The type of CatalogStorageFormat.locationUri should be java.net.URI instead 
> of String
> -
>
> Key: SPARK-19257
> URL: https://issues.apache.org/jira/browse/SPARK-19257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Currently we treat `CatalogStorageFormat.locationUri` as URI string and 
> always convert it to path by `new Path(new URI(locationUri))`
> It will be safer if we can make the type of 
> `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19809) NullPointerException on empty ORC file

2017-03-03 Thread JIRA
Michał Dawid created SPARK-19809:


 Summary: NullPointerException on empty ORC file
 Key: SPARK-19809
 URL: https://issues.apache.org/jira/browse/SPARK-19809
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.0.2, 1.6.3
Reporter: Michał Dawid


When reading from hive ORC table if there are some 0 byte files we get 
NullPointerException:
{code}java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
at 
org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
at 
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
at 
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at 
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
at 
org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
at 
org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.zeppelin.spark.ZeppelinContext.showDF(ZeppelinContext.java:209)
at 
org.apache.zeppelin.spark.SparkSqlInterpreter.interpret(SparkSqlInterpreter.java:129)
  

[jira] [Created] (SPARK-19808) About the default blocking arg in unpersist

2017-03-03 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-19808:


 Summary: About the default blocking arg in unpersist
 Key: SPARK-19808
 URL: https://issues.apache.org/jira/browse/SPARK-19808
 Project: Spark
  Issue Type: Question
  Components: ML, Spark Core
Affects Versions: 2.1.0
Reporter: zhengruifeng
Priority: Minor


Now, {{unpersist}} are commonly used with default value in ML.

Most algorithms like {{KMeans}} use {{RDD.unpersisit}} and the default 
{{blocking}} is {{true}}

And for meta algorithms like {{OneVsRest}}, {{CrossValidator}} use 
{{Dataset.unpersist}} and the default {{blocking}} is {{false}}

Should the default value for {{RDD.unpersisit}} and {{Dataset.unpersist}} be 
consistent?
And all the {{blocking}} arg in ML should be set {{false}}?

[~srowen] [~mlnick] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >