[jira] [Updated] (SPARK-20407) ParquetQuerySuite 'Enabling/disabling ignoreCorruptFiles' flaky test

2017-04-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20407:

Fix Version/s: 2.1.1

> ParquetQuerySuite 'Enabling/disabling ignoreCorruptFiles' flaky test
> 
>
> Key: SPARK-20407
> URL: https://issues.apache.org/jira/browse/SPARK-20407
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
> Fix For: 2.1.1, 2.2.0
>
>
> ParquetQuerySuite test "Enabling/disabling ignoreCorruptFiles" can sometimes 
> fail. This is caused by the fact that when one task fails, the driver call 
> returns and test code continues, but there might still be tasks running that 
> will be killed at the next killing point.
> There are 2 specific issues created by this:
> 1. Files can be closed some time after the test finishes, so 
> DebugFilesystem.assertNoOpenStreams fails. One solution for this is to change 
> SharedSqlContext and call assertNoOpenStreams inside eventually {}
> 2. ParquetFileReader constructor from apache parquet 1.8.2 can leak a stream 
> at line 538. This happens when the next line throws an exception. So, the 
> constructor fails and Spark doesn't have any way to close the file.
> This happens in this test because the test deletes the temporary directory at 
> the end (but while tasks might still be running). Deleting the directory 
> causes the constructor to fail.
> The solution for this could be to Thread.sleep at the end of the test or to 
> somehow wait for all tasks to be definitely killed before finishing the test



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16957) Use weighted midpoints for split values.

2017-04-22 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15980268#comment-15980268
 ] 

Yan Facai (颜发才) commented on SPARK-16957:
-

[~vlad.feinberg] Hi, I found that R's gbm uses mean value, instead of weighted 
mean. Hence the first phrase is removed in the description.

> Use weighted midpoints for split values.
> 
>
> Key: SPARK-16957
> URL: https://issues.apache.org/jira/browse/SPARK-16957
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Priority: Trivial
>
> We should be using weighted split points rather than the actual continuous 
> binned feature values. For instance, in a dataset containing binary features 
> (that are fed in as continuous ones), our splits are selected as {{x <= 0.0}} 
> and {{x > 0.0}}. For any real data with some smoothness qualities, this is 
> asymptotically bad compared to GBM's approach. The split point should be a 
> weighted split point of the two values of the "innermost" feature bins; e.g., 
> if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at 
> {{0.75}}.
> Example:
> {code}
> +++-+-+
> |feature0|feature1|label|count|
> +++-+-+
> | 0.0| 0.0|  0.0|   23|
> | 1.0| 0.0|  0.0|2|
> | 0.0| 0.0|  1.0|2|
> | 0.0| 1.0|  0.0|7|
> | 1.0| 0.0|  1.0|   23|
> | 0.0| 1.0|  1.0|   18|
> | 1.0| 1.0|  1.0|7|
> | 1.0| 1.0|  0.0|   18|
> +++-+-+
> DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
>   If (feature 0 <= 0.0)
>If (feature 1 <= 0.0)
> Predict: -0.56
>Else (feature 1 > 0.0)
> Predict: 0.29333
>   Else (feature 0 > 0.0)
>If (feature 1 <= 0.0)
> Predict: 0.56
>Else (feature 1 > 0.0)
> Predict: -0.29333
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16957) Use weighted midpoints for split values.

2017-04-22 Thread 颜发才

 [ 
https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Facai (颜发才) updated SPARK-16957:

Description: 
We should be using weighted split points rather than the actual continuous 
binned feature values. For instance, in a dataset containing binary features 
(that are fed in as continuous ones), our splits are selected as {{x <= 0.0}} 
and {{x > 0.0}}. For any real data with some smoothness qualities, this is 
asymptotically bad compared to GBM's approach. The split point should be a 
weighted split point of the two values of the "innermost" feature bins; e.g., 
if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at 
{{0.75}}.

Example:
{code}
+++-+-+
|feature0|feature1|label|count|
+++-+-+
| 0.0| 0.0|  0.0|   23|
| 1.0| 0.0|  0.0|2|
| 0.0| 0.0|  1.0|2|
| 0.0| 1.0|  0.0|7|
| 1.0| 0.0|  1.0|   23|
| 0.0| 1.0|  1.0|   18|
| 1.0| 1.0|  1.0|7|
| 1.0| 1.0|  0.0|   18|
+++-+-+

DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
  If (feature 0 <= 0.0)
   If (feature 1 <= 0.0)
Predict: -0.56
   Else (feature 1 > 0.0)
Predict: 0.29333
  Else (feature 0 > 0.0)
   If (feature 1 <= 0.0)
Predict: 0.56
   Else (feature 1 > 0.0)
Predict: -0.29333
{code}

  was:
Just like R's gbm, we should be using weighted split points rather than the 
actual continuous binned feature values. For instance, in a dataset containing 
binary features (that are fed in as continuous ones), our splits are selected 
as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some smoothness 
qualities, this is asymptotically bad compared to GBM's approach. The split 
point should be a weighted split point of the two values of the "innermost" 
feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split 
should be at {{0.75}}.

Example:
{code}
+++-+-+
|feature0|feature1|label|count|
+++-+-+
| 0.0| 0.0|  0.0|   23|
| 1.0| 0.0|  0.0|2|
| 0.0| 0.0|  1.0|2|
| 0.0| 1.0|  0.0|7|
| 1.0| 0.0|  1.0|   23|
| 0.0| 1.0|  1.0|   18|
| 1.0| 1.0|  1.0|7|
| 1.0| 1.0|  0.0|   18|
+++-+-+

DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
  If (feature 0 <= 0.0)
   If (feature 1 <= 0.0)
Predict: -0.56
   Else (feature 1 > 0.0)
Predict: 0.29333
  Else (feature 0 > 0.0)
   If (feature 1 <= 0.0)
Predict: 0.56
   Else (feature 1 > 0.0)
Predict: -0.29333
{code}


> Use weighted midpoints for split values.
> 
>
> Key: SPARK-16957
> URL: https://issues.apache.org/jira/browse/SPARK-16957
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Priority: Trivial
>
> We should be using weighted split points rather than the actual continuous 
> binned feature values. For instance, in a dataset containing binary features 
> (that are fed in as continuous ones), our splits are selected as {{x <= 0.0}} 
> and {{x > 0.0}}. For any real data with some smoothness qualities, this is 
> asymptotically bad compared to GBM's approach. The split point should be a 
> weighted split point of the two values of the "innermost" feature bins; e.g., 
> if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at 
> {{0.75}}.
> Example:
> {code}
> +++-+-+
> |feature0|feature1|label|count|
> +++-+-+
> | 0.0| 0.0|  0.0|   23|
> | 1.0| 0.0|  0.0|2|
> | 0.0| 0.0|  1.0|2|
> | 0.0| 1.0|  0.0|7|
> | 1.0| 0.0|  1.0|   23|
> | 0.0| 1.0|  1.0|   18|
> | 1.0| 1.0|  1.0|7|
> | 1.0| 1.0|  0.0|   18|
> +++-+-+
> DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes
>   If (feature 0 <= 0.0)
>If (feature 1 <= 0.0)
> Predict: -0.56
>Else (feature 1 > 0.0)
> Predict: 0.29333
>   Else (feature 0 > 0.0)
>If (feature 1 <= 0.0)
> Predict: 0.56
>Else (feature 1 > 0.0)
> Predict: -0.29333
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20132) Add documentation for column string functions

2017-04-22 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-20132:
---

Assignee: Michael Patterson

> Add documentation for column string functions
> -
>
> Key: SPARK-20132
> URL: https://issues.apache.org/jira/browse/SPARK-20132
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Assignee: Michael Patterson
>Priority: Minor
>  Labels: documentation, newbie
> Fix For: 2.3.0
>
>
> Four Column string functions do not have documentation for PySpark:
> rlike
> like
> startswith
> endswith
> These functions are called through the _bin_op interface, which allows the 
> passing of a docstring. I have added docstrings with examples to each of the 
> four functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20132) Add documentation for column string functions

2017-04-22 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-20132.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 17469
[https://github.com/apache/spark/pull/17469]

> Add documentation for column string functions
> -
>
> Key: SPARK-20132
> URL: https://issues.apache.org/jira/browse/SPARK-20132
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Assignee: Michael Patterson
>Priority: Minor
>  Labels: documentation, newbie
> Fix For: 2.3.0
>
>
> Four Column string functions do not have documentation for PySpark:
> rlike
> like
> startswith
> endswith
> These functions are called through the _bin_op interface, which allows the 
> passing of a docstring. I have added docstrings with examples to each of the 
> four functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20440) Allow SparkR session and context to have delayed binding

2017-04-22 Thread Vinayak Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinayak Joshi updated SPARK-20440:
--
Description: 
It would be useful if users could do something like this without first invoking 
{{sparkR.session()}}:

{code}
delayedAssign(".sparkRsession", { sparkR.session(..) }, 
assign.env=SparkR:::.sparkREnv)
{code}

This would help providers of interactive environments that bootstrap Spark for 
their users but where, the user code need not always include SparkR.  So the 
possibility of lazy semantics for setting up a SparkSession/Context would be 
very useful. 

Note that SparkR API does not have a single entry object (such as Scala/Python 
SparkSession classes) so it's the only env where such lazy setup is currently 
difficult to achieve, so doing this enhancement will make it easier. 

The changes required are minor and do not affect the external API or 
functionality in any way. I will attach a PR with the changes needed for 
consideration shortly. 


  was:
It would be useful if users could do something like this without first invoking 
{{sparkR.session()}}:

{code}
delayedAssign(".sparkRsession", { sparkR.session(..) }, 
assign.env=SparkR:::.sparkREnv)
{code}

This would help providers of interactive environments that bootstrap Spark for 
their users but the user code need not always include SparkR and so possibility 
of lazy semantics for setting up a SparkSession/Context would be very useful. 

Note that SparkR API does not have a single entry object (such as Scala/Python 
SparkSession classes) so it's the only env where such lazy setup is currently 
difficult to achieve, so doing this enhancement will make it easier. 

The changes required are minor and do not affect the external API or 
functionality in any way. I will attach a PR with the changes needed for 
consideration shortly. 



> Allow SparkR session and context to have delayed binding
> 
>
> Key: SPARK-20440
> URL: https://issues.apache.org/jira/browse/SPARK-20440
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Vinayak Joshi
>
> It would be useful if users could do something like this without first 
> invoking {{sparkR.session()}}:
> {code}
> delayedAssign(".sparkRsession", { sparkR.session(..) }, 
> assign.env=SparkR:::.sparkREnv)
> {code}
> This would help providers of interactive environments that bootstrap Spark 
> for their users but where, the user code need not always include SparkR.  So 
> the possibility of lazy semantics for setting up a SparkSession/Context would 
> be very useful. 
> Note that SparkR API does not have a single entry object (such as 
> Scala/Python SparkSession classes) so it's the only env where such lazy setup 
> is currently difficult to achieve, so doing this enhancement will make it 
> easier. 
> The changes required are minor and do not affect the external API or 
> functionality in any way. I will attach a PR with the changes needed for 
> consideration shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20440) Allow SparkR session and context to have delayed binding

2017-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20440:


Assignee: (was: Apache Spark)

> Allow SparkR session and context to have delayed binding
> 
>
> Key: SPARK-20440
> URL: https://issues.apache.org/jira/browse/SPARK-20440
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Vinayak Joshi
>
> It would be useful if users could do something like this without first 
> invoking {{sparkR.session()}}:
> {code}
> delayedAssign(".sparkRsession", { sparkR.session(..) }, 
> assign.env=SparkR:::.sparkREnv)
> {code}
> This would help providers of interactive environments that bootstrap Spark 
> for their users but the user code need not always include SparkR and so 
> possibility of lazy semantics for setting up a SparkSession/Context would be 
> very useful. 
> Note that SparkR API does not have a single entry object (such as 
> Scala/Python SparkSession classes) so it's the only env where such lazy setup 
> is currently difficult to achieve, so doing this enhancement will make it 
> easier. 
> The changes required are minor and do not affect the external API or 
> functionality in any way. I will attach a PR with the changes needed for 
> consideration shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20440) Allow SparkR session and context to have delayed binding

2017-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20440:


Assignee: Apache Spark

> Allow SparkR session and context to have delayed binding
> 
>
> Key: SPARK-20440
> URL: https://issues.apache.org/jira/browse/SPARK-20440
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Vinayak Joshi
>Assignee: Apache Spark
>
> It would be useful if users could do something like this without first 
> invoking {{sparkR.session()}}:
> {code}
> delayedAssign(".sparkRsession", { sparkR.session(..) }, 
> assign.env=SparkR:::.sparkREnv)
> {code}
> This would help providers of interactive environments that bootstrap Spark 
> for their users but the user code need not always include SparkR and so 
> possibility of lazy semantics for setting up a SparkSession/Context would be 
> very useful. 
> Note that SparkR API does not have a single entry object (such as 
> Scala/Python SparkSession classes) so it's the only env where such lazy setup 
> is currently difficult to achieve, so doing this enhancement will make it 
> easier. 
> The changes required are minor and do not affect the external API or 
> functionality in any way. I will attach a PR with the changes needed for 
> consideration shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20440) Allow SparkR session and context to have delayed binding

2017-04-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15980163#comment-15980163
 ] 

Apache Spark commented on SPARK-20440:
--

User 'vijoshi' has created a pull request for this issue:
https://github.com/apache/spark/pull/17731

> Allow SparkR session and context to have delayed binding
> 
>
> Key: SPARK-20440
> URL: https://issues.apache.org/jira/browse/SPARK-20440
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Vinayak Joshi
>
> It would be useful if users could do something like this without first 
> invoking {{sparkR.session()}}:
> {code}
> delayedAssign(".sparkRsession", { sparkR.session(..) }, 
> assign.env=SparkR:::.sparkREnv)
> {code}
> This would help providers of interactive environments that bootstrap Spark 
> for their users but the user code need not always include SparkR and so 
> possibility of lazy semantics for setting up a SparkSession/Context would be 
> very useful. 
> Note that SparkR API does not have a single entry object (such as 
> Scala/Python SparkSession classes) so it's the only env where such lazy setup 
> is currently difficult to achieve, so doing this enhancement will make it 
> easier. 
> The changes required are minor and do not affect the external API or 
> functionality in any way. I will attach a PR with the changes needed for 
> consideration shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20440) Allow SparkR session and context to have delayed binding

2017-04-22 Thread Vinayak Joshi (JIRA)
Vinayak Joshi created SPARK-20440:
-

 Summary: Allow SparkR session and context to have delayed binding
 Key: SPARK-20440
 URL: https://issues.apache.org/jira/browse/SPARK-20440
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Vinayak Joshi


It would be useful if users could do something like this without first invoking 
{{sparkR.session()}}:

{code}
delayedAssign(".sparkRsession", { sparkR.session(..) }, 
assign.env=SparkR:::.sparkREnv)
{code}

This would help providers of interactive environments that bootstrap Spark for 
their users but the user code need not always include SparkR and so possibility 
of lazy semantics for setting up a SparkSession/Context would be very useful. 

Note that SparkR API does not have a single entry object (such as Scala/Python 
SparkSession classes) so it's the only env where such lazy setup is currently 
difficult to achieve, so doing this enhancement will make it easier. 

The changes required are minor and do not affect the external API or 
functionality in any way. I will attach a PR with the changes needed for 
consideration shortly. 




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20439) Catalog.listTables() depends on all libraries used to create tables

2017-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20439:


Assignee: Xiao Li  (was: Apache Spark)

> Catalog.listTables() depends on all libraries used to create tables
> ---
>
> Key: SPARK-20439
> URL: https://issues.apache.org/jira/browse/SPARK-20439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> spark.catalog.listTables() and getTable
> You may get an error on the table serde library:
> java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> com.amazon.emr.kinesis.hive.KinesisHiveInputFormat
> Or if the database contains any table (e.g., index) with a table type that is 
> not accessible by Spark SQL, it will fail the whole listTable API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20439) Catalog.listTables() depends on all libraries used to create tables

2017-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20439:


Assignee: Apache Spark  (was: Xiao Li)

> Catalog.listTables() depends on all libraries used to create tables
> ---
>
> Key: SPARK-20439
> URL: https://issues.apache.org/jira/browse/SPARK-20439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> spark.catalog.listTables() and getTable
> You may get an error on the table serde library:
> java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> com.amazon.emr.kinesis.hive.KinesisHiveInputFormat
> Or if the database contains any table (e.g., index) with a table type that is 
> not accessible by Spark SQL, it will fail the whole listTable API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20439) Catalog.listTables() depends on all libraries used to create tables

2017-04-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15980155#comment-15980155
 ] 

Apache Spark commented on SPARK-20439:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17730

> Catalog.listTables() depends on all libraries used to create tables
> ---
>
> Key: SPARK-20439
> URL: https://issues.apache.org/jira/browse/SPARK-20439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> spark.catalog.listTables() and getTable
> You may get an error on the table serde library:
> java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> com.amazon.emr.kinesis.hive.KinesisHiveInputFormat
> Or if the database contains any table (e.g., index) with a table type that is 
> not accessible by Spark SQL, it will fail the whole listTable API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20438) R wrappers for split and repeat

2017-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20438:


Assignee: (was: Apache Spark)

> R wrappers for split and repeat
> ---
>
> Key: SPARK-20438
> URL: https://issues.apache.org/jira/browse/SPARK-20438
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> SparkR wrappers for {{o.a.s.sql.functions.split}} and 
> {o.a.s.sql.functions.repeat}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20438) R wrappers for split and repeat

2017-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20438:


Assignee: Apache Spark

> R wrappers for split and repeat
> ---
>
> Key: SPARK-20438
> URL: https://issues.apache.org/jira/browse/SPARK-20438
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>
> SparkR wrappers for {{o.a.s.sql.functions.split}} and 
> {o.a.s.sql.functions.repeat}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20438) R wrappers for split and repeat

2017-04-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15980153#comment-15980153
 ] 

Apache Spark commented on SPARK-20438:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/17729

> R wrappers for split and repeat
> ---
>
> Key: SPARK-20438
> URL: https://issues.apache.org/jira/browse/SPARK-20438
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> SparkR wrappers for {{o.a.s.sql.functions.split}} and 
> {o.a.s.sql.functions.repeat}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20439) Catalog.listTables() depends on all libraries used to create tables

2017-04-22 Thread Xiao Li (JIRA)
Xiao Li created SPARK-20439:
---

 Summary: Catalog.listTables() depends on all libraries used to 
create tables
 Key: SPARK-20439
 URL: https://issues.apache.org/jira/browse/SPARK-20439
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li
Assignee: Xiao Li


spark.catalog.listTables() and getTable

You may get an error on the table serde library:
java.lang.RuntimeException: java.lang.ClassNotFoundException: 
com.amazon.emr.kinesis.hive.KinesisHiveInputFormat

Or if the database contains any table (e.g., index) with a table type that is 
not accessible by Spark SQL, it will fail the whole listTable API.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20438) R wrappers for split and repeat

2017-04-22 Thread Maciej Szymkiewicz (JIRA)
Maciej Szymkiewicz created SPARK-20438:
--

 Summary: R wrappers for split and repeat
 Key: SPARK-20438
 URL: https://issues.apache.org/jira/browse/SPARK-20438
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Maciej Szymkiewicz


SparkR wrappers for {{o.a.s.sql.functions.split}} and 
{o.a.s.sql.functions.repeat}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19552) Upgrade Netty version to 4.1.8 final

2017-04-22 Thread Pawel Szulc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15980133#comment-15980133
 ] 

Pawel Szulc commented on SPARK-19552:
-

Wherever I go this days (while working on Spark based projects) I have to deal 
with this issue. Elastic4s is on 4.1.x; mongo clients are on 4.1.x. I 
understand this is a breaking change, but could that be treated with a higher 
priority? I can only imagine I'm not the only person with this issue...

> Upgrade Netty version to 4.1.8 final
> 
>
> Key: SPARK-19552
> URL: https://issues.apache.org/jira/browse/SPARK-19552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> Netty 4.1.8 was recently released but isn't API compatible with previous 
> major versions (like Netty 4.0.x), see 
> http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details.
> This version does include a fix for a security concern but not one we'd be 
> exposed to with Spark "out of the box". Let's upgrade the version we use to 
> be on the safe side as the security fix I'm especially interested in is not 
> available in the 4.0.x release line. 
> We should move up anyway to take on a bunch of other big fixes cited in the 
> release notes (and if anyone were to use Spark with netty and tcnative, they 
> shouldn't be exposed to the security problem) - we should be good citizens 
> and make this change.
> As this 4.1 version involves API changes we'll need to implement a few 
> methods and possibly adjust the Sasl tests. This JIRA and associated pull 
> request starts the process which I'll work on - and any help would be much 
> appreciated! Currently I know:
> {code}
> @Override
> public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise 
> promise)
>   throws Exception {
>   if (!foundEncryptionHandler) {
> foundEncryptionHandler =
>   ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this 
> returns false and causes test failures
>   }
>   ctx.write(msg, promise);
> }
> {code}
> Here's what changes will be required (at least):
> {code}
> common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code}
>  requires touch, retain and transferred methods
> {code}
> common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code}
>  requires the above methods too
> {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code}
> With "dummy" implementations so we can at least compile and test, we'll see 
> five new test failures to address.
> These are
> {code}
> org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption
> org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption
> org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example

2017-04-22 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15980081#comment-15980081
 ] 

Maciej Szymkiewicz commented on SPARK-20208:


[~felixcheung] I believe this can be marked as resolved.

> Document R fpGrowth support in vignettes, programming guide and code example
> 
>
> Key: SPARK-20208
> URL: https://issues.apache.org/jira/browse/SPARK-20208
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Maciej Szymkiewicz
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20435) More thorough redaction of sensitive information from logs/UI, more unit tests

2017-04-22 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15980056#comment-15980056
 ] 

Marcelo Vanzin commented on SPARK-20435:


bq. Providing passwords that way is supported by Spark

That's not an argument. You can type secrets in any command line and that 
doesn't make it OK.

bq. and ps ax works for users with appropriate privileges

"ps ax" works for all users. Try it for yourself.

{noformat}
vanzin@vanzin-t460p:/work/apache/spark-prs$ sudo sleep 10
{noformat}

{noformat}
vanzin@vanzin-t460p:/tmp$ ps axu | grep sleep
root 32583  0.0  0.0  73264  4524 pts/5S+   10:55   0:00 sudo sleep 
10
root 32584  0.0  0.0   7296   760 pts/5S+   10:55   0:00 sleep 10
vanzin   32586  0.0  0.0  14232  1072 pts/2S+   10:56   0:00 grep 
--color=auto sleep
{noformat}

I'm not saying redacting from logs is useless, but I'm saying that a user that 
is providing secrets in the command line is giving up any security, and 
redaction won't save him.

> More thorough redaction of sensitive information from logs/UI, more unit tests
> --
>
> Key: SPARK-20435
> URL: https://issues.apache.org/jira/browse/SPARK-20435
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>
> SPARK-18535 and SPARK-19720 were works to redact sensitive information (e.g. 
> hadoop credential provider password, AWS access/secret keys) from event logs 
> + YARN logs + UI and from the console output, respectively.
> While some unit tests were added along with these changes - they asserted 
> when a sensitive key was found, that redaction took place for that key. They 
> didn't assert globally that when running a full-fledged Spark app (whether or 
> YARN or locally), that sensitive information was not present in any of the 
> logs or UI. Such a test would also prevent regressions from happening in the 
> future if someone unknowingly adds extra logging that publishes out sensitive 
> information to disk or UI.
> Consequently, it was found that in some Java configurations, sensitive 
> information was still being leaked in the event logs under the 
> {{SparkListenerEnvironmentUpdate}} event, like so:
> {code}
> "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf 
> spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ...
> {code}
> "secret_password" should have been redacted.
> Moreover, previously redaction logic was only checking if the key matched the 
> secret regex pattern, it'd redact it's value. That worked for most cases. 
> However, in the above case, the key (sun.java.command) doesn't tell much, so 
> the value needs to be searched. So the check needs to be expanded to match 
> against values as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20430) Throw an NullPointerException in range when wholeStage is off

2017-04-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20430.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.2.0

> Throw an NullPointerException in range when wholeStage is off
> -
>
> Key: SPARK-20430
> URL: https://issues.apache.org/jira/browse/SPARK-20430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
> Fix For: 2.2.0
>
>
> I hit an exception below in master;
> {code}
> sql("SET spark.sql.codegen.wholeStage=false")
> sql("SELECT * FROM range(1)").show
> 17/04/20 17:11:05 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:54)
> at 
> org.apache.spark.sql.execution.RangeExec.numSlices(basicPhysicalOperators.scala:343)
> at 
> org.apache.spark.sql.execution.RangeExec$$anonfun$20.apply(basicPhysicalOperators.scala:506)
> at 
> org.apache.spark.sql.execution.RangeExec$$anonfun$20.apply(basicPhysicalOperators.scala:505)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:320)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20437) R wrappers for rollup and cube

2017-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20437:


Assignee: Apache Spark

> R wrappers for rollup and cube
> --
>
> Key: SPARK-20437
> URL: https://issues.apache.org/jira/browse/SPARK-20437
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Minor
>
> Add SparkR wrappers for {{Dataset.cube}} and {{Dataset.rollup}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20437) R wrappers for rollup and cube

2017-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20437:


Assignee: (was: Apache Spark)

> R wrappers for rollup and cube
> --
>
> Key: SPARK-20437
> URL: https://issues.apache.org/jira/browse/SPARK-20437
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Add SparkR wrappers for {{Dataset.cube}} and {{Dataset.rollup}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20437) R wrappers for rollup and cube

2017-04-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979949#comment-15979949
 ] 

Apache Spark commented on SPARK-20437:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/17728

> R wrappers for rollup and cube
> --
>
> Key: SPARK-20437
> URL: https://issues.apache.org/jira/browse/SPARK-20437
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Add SparkR wrappers for {{Dataset.cube}} and {{Dataset.rollup}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20437) R wrappers for rollup and cube

2017-04-22 Thread Maciej Szymkiewicz (JIRA)
Maciej Szymkiewicz created SPARK-20437:
--

 Summary: R wrappers for rollup and cube
 Key: SPARK-20437
 URL: https://issues.apache.org/jira/browse/SPARK-20437
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Maciej Szymkiewicz
Priority: Minor


Add SparkR wrappers for {{Dataset.cube}} and {{Dataset.rollup}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20386) The log info "Added %s in memory on %s (size: %s, free: %s)" in function "org.apache.spark.storage.BlockManagerInfo.updateBlockInfo" is not accurate if the block exist

2017-04-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20386:
-

Assignee: eaton

> The log info "Added %s in memory on %s (size: %s, free: %s)"  in function 
> "org.apache.spark.storage.BlockManagerInfo.updateBlockInfo" is not accurate 
> if the block exists on the slave already
> --
>
> Key: SPARK-20386
> URL: https://issues.apache.org/jira/browse/SPARK-20386
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: eaton
>Assignee: eaton
>Priority: Trivial
> Fix For: 2.2.0
>
>
> The log info"Added %s in memory on %s (size: %s, free: %s)"  in function 
> "org.apache.spark.storage.BlockManagerInfo.updateBlockInfo" is not accurate 
> if the block exists on the slave already;
> the current code is:
> if (storageLevel.useMemory) {
> blockStatus = BlockStatus(storageLevel, memSize = memSize, diskSize = 
> 0)
> _blocks.put(blockId, blockStatus)
> _remainingMem -= memSize
> logInfo("Added %s in memory on %s (size: %s, free: %s)".format(
>   blockId, blockManagerId.hostPort, Utils.bytesToString(memSize),
>   Utils.bytesToString(_remainingMem)))
>   }
> If  the block exists on the slave already, the added memory should be memSize 
> - originalMemSize, the originalMemSize is _blocks.get(blockId).memSize



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20386) The log info "Added %s in memory on %s (size: %s, free: %s)" in function "org.apache.spark.storage.BlockManagerInfo.updateBlockInfo" is not accurate if the block exist

2017-04-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20386.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17683
[https://github.com/apache/spark/pull/17683]

> The log info "Added %s in memory on %s (size: %s, free: %s)"  in function 
> "org.apache.spark.storage.BlockManagerInfo.updateBlockInfo" is not accurate 
> if the block exists on the slave already
> --
>
> Key: SPARK-20386
> URL: https://issues.apache.org/jira/browse/SPARK-20386
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: eaton
>Priority: Trivial
> Fix For: 2.2.0
>
>
> The log info"Added %s in memory on %s (size: %s, free: %s)"  in function 
> "org.apache.spark.storage.BlockManagerInfo.updateBlockInfo" is not accurate 
> if the block exists on the slave already;
> the current code is:
> if (storageLevel.useMemory) {
> blockStatus = BlockStatus(storageLevel, memSize = memSize, diskSize = 
> 0)
> _blocks.put(blockId, blockStatus)
> _remainingMem -= memSize
> logInfo("Added %s in memory on %s (size: %s, free: %s)".format(
>   blockId, blockManagerId.hostPort, Utils.bytesToString(memSize),
>   Utils.bytesToString(_remainingMem)))
>   }
> If  the block exists on the slave already, the added memory should be memSize 
> - originalMemSize, the originalMemSize is _blocks.get(blockId).memSize



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20436) NullPointerException when restart from checkpoint file

2017-04-22 Thread fangfengbin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979820#comment-15979820
 ] 

fangfengbin commented on SPARK-20436:
-

There are other people have this problem:
http://stackoverflow.com/questions/39039157/check-pointing-several-filestreams-in-my-spark-streaming-context


> NullPointerException when restart from checkpoint file
> --
>
> Key: SPARK-20436
> URL: https://issues.apache.org/jira/browse/SPARK-20436
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.5.0
>Reporter: fangfengbin
>
> I have written a Spark Streaming application which have two DStreams.
> Code is :
> object KafkaTwoInkfk {
>   def main(args: Array[String]) {
> val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args
> val ssc = StreamingContext.getOrCreate(checkPointDir, () => 
> createContext(args))
> ssc.start()
> ssc.awaitTermination()
>   }
>   def createContext(args : Array[String]) : StreamingContext = {
> val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args
> val sparkConf = new SparkConf().setAppName("KafkaWordCount")
> val ssc = new StreamingContext(sparkConf, Seconds(batchSize.toLong))
> ssc.checkpoint(checkPointDir)
> val topicArr1 = topic1.split(",")
> val topicSet1 = topicArr1.toSet
> val topicArr2 = topic2.split(",")
> val topicSet2 = topicArr2.toSet
> val kafkaParams = Map[String, String](
>   "metadata.broker.list" -> brokers
> )
> val lines1 = KafkaUtils.createDirectStream[String, String, StringDecoder, 
> StringDecoder](ssc, kafkaParams, topicSet1)
> val words1 = lines1.map(_._2).flatMap(_.split(" "))
> val wordCounts1 = words1.map(x => {
>   (x, 1L)}).reduceByKey(_ + _)
> wordCounts1.print()
> val lines2 = KafkaUtils.createDirectStream[String, String, StringDecoder, 
> StringDecoder](ssc, kafkaParams, topicSet2)
> val words2 = lines1.map(_._2).flatMap(_.split(" "))
> val wordCounts2 = words2.map(x => {
>   (x, 1L)}).reduceByKey(_ + _)
> wordCounts2.print()
> return ssc
>   }
> }
> when  restart from checkpoint file, it throw NullPointerException:
> java.lang.NullPointerException
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply$mcV$sp(DStreamCheckpointData.scala:126)
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124)
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291)
>   at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.writeObject(DStreamCheckpointData.scala:124)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply$mcV$sp(DStream.scala:528)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291)
>   at 
> org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:523)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at 

[jira] [Created] (SPARK-20436) NullPointerException when restart from checkpoint file

2017-04-22 Thread fangfengbin (JIRA)
fangfengbin created SPARK-20436:
---

 Summary: NullPointerException when restart from checkpoint file
 Key: SPARK-20436
 URL: https://issues.apache.org/jira/browse/SPARK-20436
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 1.5.0
Reporter: fangfengbin


I have written a Spark Streaming application which have two DStreams.
Code is :
object KafkaTwoInkfk {
  def main(args: Array[String]) {
val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args
val ssc = StreamingContext.getOrCreate(checkPointDir, () => 
createContext(args))

ssc.start()
ssc.awaitTermination()
  }

  def createContext(args : Array[String]) : StreamingContext = {
val Array(checkPointDir, brokers, topic1, topic2, batchSize) = args
val sparkConf = new SparkConf().setAppName("KafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(batchSize.toLong))

ssc.checkpoint(checkPointDir)
val topicArr1 = topic1.split(",")
val topicSet1 = topicArr1.toSet
val topicArr2 = topic2.split(",")
val topicSet2 = topicArr2.toSet

val kafkaParams = Map[String, String](
  "metadata.broker.list" -> brokers
)

val lines1 = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, topicSet1)
val words1 = lines1.map(_._2).flatMap(_.split(" "))
val wordCounts1 = words1.map(x => {
  (x, 1L)}).reduceByKey(_ + _)
wordCounts1.print()

val lines2 = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, topicSet2)
val words2 = lines1.map(_._2).flatMap(_.split(" "))
val wordCounts2 = words2.map(x => {
  (x, 1L)}).reduceByKey(_ + _)
wordCounts2.print()

return ssc
  }
}

when  restart from checkpoint file, it throw NullPointerException:
java.lang.NullPointerException
at 
org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply$mcV$sp(DStreamCheckpointData.scala:126)
at 
org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124)
at 
org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$writeObject$1.apply(DStreamCheckpointData.scala:124)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291)
at 
org.apache.spark.streaming.dstream.DStreamCheckpointData.writeObject(DStreamCheckpointData.scala:124)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at 
java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply$mcV$sp(DStream.scala:528)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply(DStream.scala:523)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1291)
at 
org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:523)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at 

[jira] [Assigned] (SPARK-17928) No driver.memoryOverhead setting for mesos cluster mode

2017-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17928:


Assignee: Apache Spark

> No driver.memoryOverhead setting for mesos cluster mode
> ---
>
> Key: SPARK-17928
> URL: https://issues.apache.org/jira/browse/SPARK-17928
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.1
>Reporter: Drew Robb
>Assignee: Apache Spark
>
> Mesos cluster mode does not have a configuration setting for the driver's 
> memory overhead. This makes scheduling long running drivers on mesos using 
> dispatcher very unreliable. There is an equivalent setting for yarn-- 
> spark.yarn.driver.memoryOverhead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17928) No driver.memoryOverhead setting for mesos cluster mode

2017-04-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17928:


Assignee: (was: Apache Spark)

> No driver.memoryOverhead setting for mesos cluster mode
> ---
>
> Key: SPARK-17928
> URL: https://issues.apache.org/jira/browse/SPARK-17928
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.1
>Reporter: Drew Robb
>
> Mesos cluster mode does not have a configuration setting for the driver's 
> memory overhead. This makes scheduling long running drivers on mesos using 
> dispatcher very unreliable. There is an equivalent setting for yarn-- 
> spark.yarn.driver.memoryOverhead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17928) No driver.memoryOverhead setting for mesos cluster mode

2017-04-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979818#comment-15979818
 ] 

Apache Spark commented on SPARK-17928:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/17726

> No driver.memoryOverhead setting for mesos cluster mode
> ---
>
> Key: SPARK-17928
> URL: https://issues.apache.org/jira/browse/SPARK-17928
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.1
>Reporter: Drew Robb
>
> Mesos cluster mode does not have a configuration setting for the driver's 
> memory overhead. This makes scheduling long running drivers on mesos using 
> dispatcher very unreliable. There is an equivalent setting for yarn-- 
> spark.yarn.driver.memoryOverhead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19897) Unable to access History server web UI

2017-04-22 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19897.
--
Resolution: Invalid

Question should be asked into mailing list. I am resolving this.

> Unable to access History server web UI
> --
>
> Key: SPARK-19897
> URL: https://issues.apache.org/jira/browse/SPARK-19897
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 2.1.0
> Environment: centos-7
>Reporter: Eduardo Rodrigues
>
> cant access history server web UI althoug logs after executing 
> ./sbin/start-history-server.sh says : 
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 17/03/10 11:53:37 INFO HistoryServer: Started daemon with process name: 
> 25390@centos-1
> 17/03/10 11:53:37 INFO SignalUtils: Registered signal handler for TERM
> 17/03/10 11:53:37 INFO SignalUtils: Registered signal handler for HUP
> 17/03/10 11:53:37 INFO SignalUtils: Registered signal handler for INT
> 17/03/10 11:53:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/03/10 11:53:37 INFO SecurityManager: Changing view acls to: centos
> 17/03/10 11:53:37 INFO SecurityManager: Changing modify acls to: centos
> 17/03/10 11:53:37 INFO SecurityManager: Changing view acls groups to:
> 17/03/10 11:53:37 INFO SecurityManager: Changing modify acls groups to:
> 17/03/10 11:53:37 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(centos); groups 
> with view permissions: Set(); users  with modify perm$
> 17/03/10 11:53:38 INFO FsHistoryProvider: Replaying log path: 
> file:/home/centos/spark-2.1.0-bin-hadoop2.7/work/app-20170310115034-0017
> 17/03/10 11:53:38 INFO Utils: Successfully started service on port 18080.
> 17/03/10 11:53:38 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://10.1.0.185:18080
> 17/03/10 11:53:38 INFO FsHistoryProvider: Replaying log path: 
> file:/home/centos/spark-2.1.0-bin-hadoop2.7/work/app-20170310113453-0016
> 17/03/10 11:53:38 INFO FsHistoryProvider: Replaying log path: 
> file:/home/centos/spark-2.1.0-bin-hadoop2.7/work/app-20170310105045-0010
> 17/03/10 11:53:38 INFO FsHistoryProvider: Replaying log path: 
> file:/home/centos/spark-2.1.0-bin-hadoop2.7/work/app-20170310104828-0009
> 17/03/10 11:53:38 INFO FsHistoryProvider: Replaying log path: 
> file:/home/centos/spark-2.1.0-bin-hadoop2.7/work/app-20170310104352-0007
> i changed spark-defaults  to 
> spark.eventLog.enabled true
> spark.eventLog.dir /home/centos/spark-2.1.0-bin-hadoop2.7/work/
> spark.history.fs.logDirectory 
> file:/home/centos/spark-2.1.0-bin-hadoop2.7/work/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-22 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979788#comment-15979788
 ] 

Helena Edelson commented on SPARK-18057:


Confirming that https://issues.apache.org/jira/browse/KAFKA-4879 - 
KafkaConsumer.position may hang forever when deleting a topic - is the only 
blocker. I upgraded in my fork with some minor code changes and the 
delete-related tests in spark-sql-kafka-0-10 hang. I can submit this as a PR as 
soon as that is resolved.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org