date:20160607


[ 
https://issues.apache.org/jira/browse/SPARK-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15320036#comment-15320036
 ] 

Yanbo Liang commented on SPARK-9623:


[~MechCoder] I'm not working on this, please feel free to take over.

> RandomForestRegressor: provide variance of predictions
> --
>
> Key: SPARK-9623
> URL: https://issues.apache.org/jira/browse/SPARK-9623
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Variance of predicted value, as estimated from training data.
> Analogous to class probabilities for classification.
> See [SPARK-3727] for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15369) Investigate selectively using Jython for parts of PySpark

2016-06-07 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319966#comment-15319966
 ] 

holdenk commented on SPARK-15369:
-

WIP design document 
https://docs.google.com/document/d/1L-F12nVWSLEOW72sqOn6Mt1C0bcPFP9ck7gEMH2_IXE/edit?usp=sharing

> Investigate selectively using Jython for parts of PySpark
> -
>
> Key: SPARK-15369
> URL: https://issues.apache.org/jira/browse/SPARK-15369
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Priority: Minor
>
> Transfering data from the JVM to the Python executor can be a substantial 
> bottleneck. While JYthon is not suitable for all UDFs or map functions, it 
> may be suitable for some simple ones. We should investigate the option of 
> using JYthon to accelerate these small functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading


[ 
https://issues.apache.org/jira/browse/SPARK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319930#comment-15319930
 ] 

Apache Spark commented on SPARK-15813:
--

User 'peterableda' has created a pull request for this issue:
https://github.com/apache/spark/pull/13552

> Spark Dyn Allocation Cancel log message misleading
> --
>
> Key: SPARK-15813
> URL: https://issues.apache.org/jira/browse/SPARK-15813
> Project: Spark
>  Issue Type: Bug
>Reporter: Peter Ableda
>Priority: Trivial
>
> *Driver requested* message is logged before the *Canceling* message but has 
> the updated executor number. The messages are misleading.
> See log snippet:
> {code}
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
> 619 executor(s).
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor 
> containers
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage 
> 0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage 
> 0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage 
> 0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage 
> 0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage 
> 0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage 
> 0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage 
> 0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage 
> 0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage 
> 0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage 
> 0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
> 614 executor(s).
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor 
> containers
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage 
> 0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 
> bytes)
> {code}
> The easy solution is to update the message to use past tense. This is 
> consistent with the other messages there.
> *Canceled requests for 5 executor container(s).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading


 [ 
https://issues.apache.org/jira/browse/SPARK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15813:


Assignee: (was: Apache Spark)

> Spark Dyn Allocation Cancel log message misleading
> --
>
> Key: SPARK-15813
> URL: https://issues.apache.org/jira/browse/SPARK-15813
> Project: Spark
>  Issue Type: Bug
>Reporter: Peter Ableda
>Priority: Trivial
>
> *Driver requested* message is logged before the *Canceling* message but has 
> the updated executor number. The messages are misleading.
> See log snippet:
> {code}
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
> 619 executor(s).
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor 
> containers
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage 
> 0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage 
> 0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage 
> 0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage 
> 0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage 
> 0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage 
> 0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage 
> 0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage 
> 0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage 
> 0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage 
> 0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
> 614 executor(s).
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor 
> containers
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage 
> 0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 
> bytes)
> {code}
> The easy solution is to update the message to use past tense. This is 
> consistent with the other messages there.
> *Canceled requests for 5 executor container(s).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading


 [ 
https://issues.apache.org/jira/browse/SPARK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15813:


Assignee: Apache Spark

> Spark Dyn Allocation Cancel log message misleading
> --
>
> Key: SPARK-15813
> URL: https://issues.apache.org/jira/browse/SPARK-15813
> Project: Spark
>  Issue Type: Bug
>Reporter: Peter Ableda
>Assignee: Apache Spark
>Priority: Trivial
>
> *Driver requested* message is logged before the *Canceling* message but has 
> the updated executor number. The messages are misleading.
> See log snippet:
> {code}
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
> 619 executor(s).
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor 
> containers
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage 
> 0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage 
> 0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage 
> 0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage 
> 0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage 
> 0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage 
> 0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage 
> 0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage 
> 0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage 
> 0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000)
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage 
> 0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 
> bytes)
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
> 614 executor(s).
> 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor 
> containers
> 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage 
> 0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 
> bytes)
> {code}
> The easy solution is to update the message to use past tense. This is 
> consistent with the other messages there.
> *Canceled requests for 5 executor container(s).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading

2016-06-07 Thread Peter Ableda (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Ableda updated SPARK-15813:
-
Description: 
*Driver requested* message is logged before the *Canceling* message but has the 
updated executor number. The messages are misleading.

See log snippet:
{code}
16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
619 executor(s).
16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor 
containers
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage 
0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage 
0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage 
0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage 
0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage 
0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage 
0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage 
0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage 
0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage 
0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage 
0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
614 executor(s).
16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor 
containers
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage 
0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 bytes)
{code}

The easy solution is to update the message to use past sentence. This is 
consistent with the other messages there.

*Canceled requests for 5 executor container(s).*

  was:
Driver requested message is logged before the *Canceling* message but has the 
updated executor number. The messages are misleading.

See log snippet:
{code}
16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
619 executor(s).
16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor 
containers
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage 
0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage 
0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage 
0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage 
0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage 
0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage 
0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage 
0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage 
0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage 
0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage 
0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
614 executor(s).
16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor 
containers
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage 
0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 bytes)
{code}

The easy solution is to update the message to use past sentence. This is 
consistent with the other messages there.

*Canceled requests for 5 executor container(s).*


> Spark Dyn Allocation Cancel log message misleading
>

[jira] [Updated] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading

2016-06-07 Thread Peter Ableda (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Ableda updated SPARK-15813:
-
Description: 
*Driver requested* message is logged before the *Canceling* message but has the 
updated executor number. The messages are misleading.

See log snippet:
{code}
16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
619 executor(s).
16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor 
containers
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage 
0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage 
0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage 
0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage 
0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage 
0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage 
0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage 
0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage 
0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage 
0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage 
0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
614 executor(s).
16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor 
containers
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage 
0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 bytes)
{code}

The easy solution is to update the message to use past tense. This is 
consistent with the other messages there.

*Canceled requests for 5 executor container(s).*

  was:
*Driver requested* message is logged before the *Canceling* message but has the 
updated executor number. The messages are misleading.

See log snippet:
{code}
16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
619 executor(s).
16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor 
containers
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage 
0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage 
0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage 
0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage 
0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage 
0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage 
0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage 
0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage 
0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage 
0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage 
0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
614 executor(s).
16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor 
containers
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage 
0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 bytes)
{code}

The easy solution is to update the message to use past sentence. This is 
consistent with the other messages there.

*Canceled requests for 5 executor container(s).*


> Spark Dyn Allocation Cancel log message misleading
>

[jira] [Created] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading

2016-06-07 Thread Peter Ableda (JIRA)

Peter Ableda created SPARK-15813:


 Summary: Spark Dyn Allocation Cancel log message misleading
 Key: SPARK-15813
 URL: https://issues.apache.org/jira/browse/SPARK-15813
 Project: Spark
  Issue Type: Bug
Reporter: Peter Ableda
Priority: Trivial


Driver requested message is logged before the *Canceling* message but has the 
updated executor number. The messages are misleading.

See log snippet:
{code}
16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
619 executor(s).
16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor 
containers
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage 
0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage 
0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage 
0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage 
0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage 
0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage 
0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage 
0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage 
0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage 
0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000)
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage 
0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 bytes)
16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 
614 executor(s).
16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor 
containers
16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage 
0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 bytes)
{code}

The easy solution is to update the message to use past sentence. This is 
consistent with the other messages there.

*Canceled requests for 5 executor container(s).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-15755) java.lang.NullPointerException when run spark 2.0 setting spark.serializer=org.apache.spark.serializer.KryoSerializer

2016-06-07 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell closed SPARK-15755.
-
Resolution: Duplicate

> java.lang.NullPointerException when run spark 2.0 setting 
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> -
>
> Key: SPARK-15755
> URL: https://issues.apache.org/jira/browse/SPARK-15755
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: marymwu
>
> java.lang.NullPointerException when run spark 2.0 setting 
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:329)
>   at java.util.PriorityQueue.add(PriorityQueue.java:306)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
>   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:711)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
>   ... 15 more
> 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
>

[jira] [Commented] (SPARK-15802) SparkSQL connection fail using shell command "bin/beeline -u "jdbc:hive2://...:10000/default""

2016-06-07 Thread marymwu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319908#comment-15319908
 ] 

marymwu commented on SPARK-15802:
-

looking forward to your reply, thanks

> SparkSQL connection fail using shell command "bin/beeline -u 
> "jdbc:hive2://*.*.*.*:1/default""
> --
>
> Key: SPARK-15802
> URL: https://issues.apache.org/jira/browse/SPARK-15802
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: marymwu
>
> reproduce steps:
> 1. execute shell "sbin/start-thriftserver.sh --master yarn";
> 2. execute shell "bin/beeline -u "jdbc:hive2://*.*.*.*:1/default"";
> Actually result:
> SparkSQL connection failed and the log shows as follows:
> 16/06/07 14:49:18 WARN HttpParser: Illegal character 0x1 in state=START for 
> buffer 
> HeapByteBuffer@485a5ad9[p=1,l=35,c=16384,r=34]={\x01<<<\x00\x00\x00\x05PLAIN\x05\x00\x00\x00\x14\x00an...ymous\x00anonymous>>>Type:
>  application...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00}
> 16/06/07 14:49:18 WARN HttpParser: badMessage: 400 Illegal character 0x1 for 
> HttpChannelOverHttp@718db102{r=0,c=false,a=IDLE,uri=}
> 16/06/07 14:49:19 WARN HttpParser: Illegal character 0x1 in state=START for 
> buffer 
> HeapByteBuffer@485a5ad9[p=1,l=35,c=16384,r=34]={\x01<<<\x00\x00\x00\x05PLAIN\x05\x00\x00\x00\x14\x00an...ymous\x00anonymous>>>Type:
>  application...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00}
> 16/06/07 14:49:19 WARN HttpParser: badMessage: 400 Illegal character 0x1 for 
> HttpChannelOverHttp@195db217{r=0,c=false,a=IDLE,uri=}
> note:
> SparkSQL connection succeeded, if using shell command "bin/beeline -u 
> "jdbc:hive2://*.*.*.*:1/default;transportMode=http;httpPath=cliservice""
> Two parameters(transportMode) have been added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15802) SparkSQL connection fail using shell command "bin/beeline -u "jdbc:hive2://...:10000/default""

2016-06-07 Thread marymwu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319906#comment-15319906
 ] 

marymwu commented on SPARK-15802:
-

what's the right protocol?  how to specify it ?

> SparkSQL connection fail using shell command "bin/beeline -u 
> "jdbc:hive2://*.*.*.*:1/default""
> --
>
> Key: SPARK-15802
> URL: https://issues.apache.org/jira/browse/SPARK-15802
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: marymwu
>
> reproduce steps:
> 1. execute shell "sbin/start-thriftserver.sh --master yarn";
> 2. execute shell "bin/beeline -u "jdbc:hive2://*.*.*.*:1/default"";
> Actually result:
> SparkSQL connection failed and the log shows as follows:
> 16/06/07 14:49:18 WARN HttpParser: Illegal character 0x1 in state=START for 
> buffer 
> HeapByteBuffer@485a5ad9[p=1,l=35,c=16384,r=34]={\x01<<<\x00\x00\x00\x05PLAIN\x05\x00\x00\x00\x14\x00an...ymous\x00anonymous>>>Type:
>  application...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00}
> 16/06/07 14:49:18 WARN HttpParser: badMessage: 400 Illegal character 0x1 for 
> HttpChannelOverHttp@718db102{r=0,c=false,a=IDLE,uri=}
> 16/06/07 14:49:19 WARN HttpParser: Illegal character 0x1 in state=START for 
> buffer 
> HeapByteBuffer@485a5ad9[p=1,l=35,c=16384,r=34]={\x01<<<\x00\x00\x00\x05PLAIN\x05\x00\x00\x00\x14\x00an...ymous\x00anonymous>>>Type:
>  application...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00}
> 16/06/07 14:49:19 WARN HttpParser: badMessage: 400 Illegal character 0x1 for 
> HttpChannelOverHttp@195db217{r=0,c=false,a=IDLE,uri=}
> note:
> SparkSQL connection succeeded, if using shell command "bin/beeline -u 
> "jdbc:hive2://*.*.*.*:1/default;transportMode=http;httpPath=cliservice""
> Two parameters(transportMode) have been added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14485) Task finished cause fetch failure when its executor has already been removed by driver


 [ 
https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-14485:
--

Assignee: iward

> Task finished cause fetch failure when its executor has already been removed 
> by driver 
> ---
>
> Key: SPARK-14485
> URL: https://issues.apache.org/jira/browse/SPARK-14485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.5.2
>Reporter: iward
>Assignee: iward
> Fix For: 2.0.0
>
>
> Now, when executor is removed by driver with heartbeats timeout, driver will 
> re-queue the task on this executor and send a kill command to cluster to kill 
> this executor.
> But, in a situation, the running task of this executor is finished and return 
> result to driver before this executor killed by kill command sent by driver. 
> At this situation, driver will accept the task finished event and ignore  
> speculative task and re-queued task. But, as we know, this executor has 
> removed by driver, the result of this finished task can not save in driver 
> because the *BlockManagerId* has also removed from *BlockManagerMaster* by 
> driver. So, the result data of this stage is not complete, and then, it will 
> cause fetch failure.
> For example, the following is the task log:
> {noformat}
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing 
> executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 
> 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 
> 256015 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing 
> tasks for 322 from TaskSet 107.0
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 
> 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): 
> ExecutorLostFailure (executor 322 lost)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 
> 322 (epoch 11)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: 
> Trying to remove executor 322 from BlockManagerMaster.
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 
> 322 successfully in removeExecutor
> {noformat}
> {noformat}
> 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 
> 229.0 in stage 107.0 (TID 10384) in 272315 ms on 
> BJHC-HERA-16168.hadoop.jd.local (579/700)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring 
> task-finished event for 229.1 in stage 107.0 because task 229 has already 
> completed successfully
> {noformat}
> {noformat}
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 
> missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at 
> mapPartitions at Exchange.scala:137)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task 
> set 107.1 with 3 tasks
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, 
> 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> {noformat}
> Driver will check the stage's result is not complete, and submit missing 
> task, but this time, the next stage has run because previous stage has finish 
> for its task is all finished although its result is not complete.
> {noformat}
> 2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task 
> 39.0 in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): 
> FetchFailed(null, shuffleId=11, mapId=-1, reduceId=39, message=
> 2015-12-31 04:40:13 INFO 
> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
> location for shuffle 11
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
>

[jira] [Assigned] (SPARK-15755) java.lang.NullPointerException when run spark 2.0 setting spark.serializer=org.apache.spark.serializer.KryoSerializer


 [ 
https://issues.apache.org/jira/browse/SPARK-15755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15755:


Assignee: (was: Apache Spark)

> java.lang.NullPointerException when run spark 2.0 setting 
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> -
>
> Key: SPARK-15755
> URL: https://issues.apache.org/jira/browse/SPARK-15755
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: marymwu
>
> java.lang.NullPointerException when run spark 2.0 setting 
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:329)
>   at java.util.PriorityQueue.add(PriorityQueue.java:306)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
>   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:711)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
>   ... 15 more
> 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
>

[jira] [Assigned] (SPARK-15755) java.lang.NullPointerException when run spark 2.0 setting spark.serializer=org.apache.spark.serializer.KryoSerializer


 [ 
https://issues.apache.org/jira/browse/SPARK-15755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15755:


Assignee: Apache Spark

> java.lang.NullPointerException when run spark 2.0 setting 
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> -
>
> Key: SPARK-15755
> URL: https://issues.apache.org/jira/browse/SPARK-15755
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: marymwu
>Assignee: Apache Spark
>
> java.lang.NullPointerException when run spark 2.0 setting 
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:329)
>   at java.util.PriorityQueue.add(PriorityQueue.java:306)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
>   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:711)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
>   ... 15 more
> 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at

[jira] [Commented] (SPARK-15755) java.lang.NullPointerException when run spark 2.0 setting spark.serializer=org.apache.spark.serializer.KryoSerializer


[ 
https://issues.apache.org/jira/browse/SPARK-15755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319866#comment-15319866
 ] 

Apache Spark commented on SPARK-15755:
--

User 'marymwu' has created a pull request for this issue:
https://github.com/apache/spark/pull/13550

> java.lang.NullPointerException when run spark 2.0 setting 
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> -
>
> Key: SPARK-15755
> URL: https://issues.apache.org/jira/browse/SPARK-15755
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: marymwu
>
> java.lang.NullPointerException when run spark 2.0 setting 
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:329)
>   at java.util.PriorityQueue.add(PriorityQueue.java:306)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
>   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:711)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
>   ... 15 more
> 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
>   at 
>

[jira] [Created] (SPARK-15812) Allow sorting on aggregated streaming dataframe when the output mode is Complete

2016-06-07 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-15812:
-

 Summary: Allow sorting on aggregated streaming dataframe when the 
output mode is Complete
 Key: SPARK-15812
 URL: https://issues.apache.org/jira/browse/SPARK-15812
 Project: Spark
  Issue Type: Sub-task
Reporter: Tathagata Das
Assignee: Tathagata Das


When the output mode is complete, then the output of a streaming aggregation 
essentially will contain the complete aggregates every time. So this is not 
different from a batch dataset within an incremental execution. Other 
non-streaming operations should be supported on this dataset. In this JIRA, we 
are just adding support for sorting, as it is a common useful functionality. 
Support for other operations will come later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15517) Add support for complete output mode

2016-06-07 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-15517.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Add support for complete output mode 
> -
>
> Key: SPARK-15517
> URL: https://issues.apache.org/jira/browse/SPARK-15517
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.0
>
>
> Currently structured streaming only supports append output mode. This task is 
> to do the following. 
> - Add support for complete output mode in the planner
> - Add public API for users to specify output mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15046) When running hive-thriftserver with yarn on a secure cluster the workers fail with java.lang.NumberFormatException

2016-06-07 Thread Jie Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319817#comment-15319817
 ] 

Jie Huang commented on SPARK-15046:
---

OK, I see. Thanks [~tleftwich]. If so, it seems we'd better to use the new 
configure API, like:

 {code:borderStyle=solid}
sparkConf.get(TOKEN_RENEWAL_INTERVAL, (24 hours).toMillis)
{code}



> When running hive-thriftserver with yarn on a secure cluster the workers fail 
> with java.lang.NumberFormatException
> --
>
> Key: SPARK-15046
> URL: https://issues.apache.org/jira/browse/SPARK-15046
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Trystan Leftwich
>
> When running hive-thriftserver with yarn on a secure cluster 
> (spark.yarn.principal and spark.yarn.keytab are set) the workers fail with 
> the following error.
> {code}
> 16/04/30 22:40:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
> java.lang.NumberFormatException: For input string: "86400079ms"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Long.parseLong(Long.java:441)
>   at java.lang.Long.parseLong(Long.java:483)
>   at 
> scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
>   at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at scala.Option.map(Option.scala:146)
>   at org.apache.spark.SparkConf.getLong(SparkConf.scala:380)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.getTimeFromNowToRenewal(SparkHadoopUtil.scala:289)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.org$apache$spark$deploy$yarn$AMDelegationTokenRenewer$$scheduleRenewal$1(AMDelegationTokenRenewer.scala:89)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.scheduleLoginFromKeytab(AMDelegationTokenRenewer.scala:121)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:723)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:721)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:748)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15789) Allow reserved keywords in most places

2016-06-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15789:

Assignee: Herman van Hovell

> Allow reserved keywords in most places
> --
>
> Key: SPARK-15789
> URL: https://issues.apache.org/jira/browse/SPARK-15789
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>
> The current parser doesn't allow a number SQL keywords to be used as 
> identifiers (for tables and fields). We should allow this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15789) Allow reserved keywords in most places

2016-06-07 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15789.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13534
[https://github.com/apache/spark/pull/13534]

> Allow reserved keywords in most places
> --
>
> Key: SPARK-15789
> URL: https://issues.apache.org/jira/browse/SPARK-15789
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
> Fix For: 2.0.0
>
>
> The current parser doesn't allow a number SQL keywords to be used as 
> identifiers (for tables and fields). We should allow this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15811) UDFs do not work in Spark 2.0-preview built with scala 2.10


 [ 
https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franklyn Dsouza updated SPARK-15811:

Description: 
I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
{code}
./dev/change-version-to-2.10.sh
./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
-Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
{code}
and then ran the following code in a pyspark shell
{code:python}
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StructField, StructType
from pyspark.sql.functions import udf
from pyspark.sql.types import Row
spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate()
add_one = udf(lambda x: x + 1, IntegerType())
schema = StructType([StructField('a', IntegerType(), False)])
df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
df.select(add_one(df.a).alias('incremented')).collect()
{code}

This never returns with a result. 


  was:
I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
{code}
./dev/change-version-to-2.10.sh
./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
-Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
{code}
and then ran the following code in a pyspark shell
{code:python}
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StructField, StructType
from pyspark.sql.functions import udf
from pyspark.sql.types import Row
spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate()
add_one = udf(lambda x: x + 1, IntegerType())
schema = StructType([StructField('a', IntegerType(), False)])
df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
df.select(add_one(df.a).alias('incremented')).collect()
{code:xml}
This never returns with a result. 



> UDFs do not work in Spark 2.0-preview built with scala 2.10
> ---
>
> Key: SPARK-15811
> URL: https://issues.apache.org/jira/browse/SPARK-15811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
> {code}
> ./dev/change-version-to-2.10.sh
> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
> {code}
> and then ran the following code in a pyspark shell
> {code:python}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import IntegerType, StructField, StructType
> from pyspark.sql.functions import udf
> from pyspark.sql.types import Row
> spark = SparkSession.builder.master('local[4]').appName('2.0 
> DF').getOrCreate()
> add_one = udf(lambda x: x + 1, IntegerType())
> schema = StructType([StructField('a', IntegerType(), False)])
> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
> df.select(add_one(df.a).alias('incremented')).collect()
> {code}
> This never returns with a result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15811) UDFs do not work in Spark 2.0-preview built with scala 2.10


 [ 
https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franklyn Dsouza updated SPARK-15811:

Description: 
I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
{code}
./dev/change-version-to-2.10.sh
./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
-Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
{code}
and then ran the following code in a pyspark shell
{code:python}
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StructField, StructType
from pyspark.sql.functions import udf
from pyspark.sql.types import Row
spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate()
add_one = udf(lambda x: x + 1, IntegerType())
schema = StructType([StructField('a', IntegerType(), False)])
df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
df.select(add_one(df.a).alias('incremented')).collect()
{code:xml}
This never returns with a result. 


  was:
I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following

./dev/change-version-to-2.10.sh
./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
-Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive

and then ran the following code in a pyspark shell

from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StructField, StructType
from pyspark.sql.functions import udf
from pyspark.sql.types import Row
spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate()
add_one = udf(lambda x: x + 1, IntegerType())
schema = StructType([StructField('a', IntegerType(), False)])
df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
df.select(add_one(df.a).alias('incremented')).collect()

This never returns with a result. 



> UDFs do not work in Spark 2.0-preview built with scala 2.10
> ---
>
> Key: SPARK-15811
> URL: https://issues.apache.org/jira/browse/SPARK-15811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
> {code}
> ./dev/change-version-to-2.10.sh
> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
> {code}
> and then ran the following code in a pyspark shell
> {code:python}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import IntegerType, StructField, StructType
> from pyspark.sql.functions import udf
> from pyspark.sql.types import Row
> spark = SparkSession.builder.master('local[4]').appName('2.0 
> DF').getOrCreate()
> add_one = udf(lambda x: x + 1, IntegerType())
> schema = StructType([StructField('a', IntegerType(), False)])
> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
> df.select(add_one(df.a).alias('incremented')).collect()
> {code:xml}
> This never returns with a result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15811) UDFs do not work in Spark 2.0-preview built with scala 2.10


 [ 
https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franklyn Dsouza updated SPARK-15811:

Description: 
I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following

{code}
./dev/change-version-to-2.10.sh
./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
-Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
{code}

and then ran the following code in a pyspark shell

{code}
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StructField, StructType
from pyspark.sql.functions import udf
from pyspark.sql.types import Row
spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate()
add_one = udf(lambda x: x + 1, IntegerType())
schema = StructType([StructField('a', IntegerType(), False)])
df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
df.select(add_one(df.a).alias('incremented')).collect()
{code}

This never returns with a result. 


  was:
I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
{code}
./dev/change-version-to-2.10.sh
./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
-Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
{code}
and then ran the following code in a pyspark shell
{code:python}
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StructField, StructType
from pyspark.sql.functions import udf
from pyspark.sql.types import Row
spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate()
add_one = udf(lambda x: x + 1, IntegerType())
schema = StructType([StructField('a', IntegerType(), False)])
df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
df.select(add_one(df.a).alias('incremented')).collect()
{code}

This never returns with a result. 



> UDFs do not work in Spark 2.0-preview built with scala 2.10
> ---
>
> Key: SPARK-15811
> URL: https://issues.apache.org/jira/browse/SPARK-15811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Franklyn Dsouza
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following
> {code}
> ./dev/change-version-to-2.10.sh
> ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
> -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive
> {code}
> and then ran the following code in a pyspark shell
> {code}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import IntegerType, StructField, StructType
> from pyspark.sql.functions import udf
> from pyspark.sql.types import Row
> spark = SparkSession.builder.master('local[4]').appName('2.0 
> DF').getOrCreate()
> add_one = udf(lambda x: x + 1, IntegerType())
> schema = StructType([StructField('a', IntegerType(), False)])
> df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
> df.select(add_one(df.a).alias('incremented')).collect()
> {code}
> This never returns with a result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15811) UDFs do not work in Spark 2.0-preview built with scala 2.10

Franklyn Dsouza created SPARK-15811:
---

 Summary: UDFs do not work in Spark 2.0-preview built with scala 
2.10
 Key: SPARK-15811
 URL: https://issues.apache.org/jira/browse/SPARK-15811
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Franklyn Dsouza
Priority: Blocker
 Fix For: 2.0.0


I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following

./dev/change-version-to-2.10.sh
./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 
-Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6  -Pyarn -Phive

and then ran the following code in a pyspark shell

from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StructField, StructType
from pyspark.sql.functions import udf
from pyspark.sql.types import Row
spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate()
add_one = udf(lambda x: x + 1, IntegerType())
schema = StructType([StructField('a', IntegerType(), False)])
df = spark.createDataFrame([Row(a=1),Row(a=2)], schema)
df.select(add_one(df.a).alias('incremented')).collect()

This never returns with a result. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15804) Manually added metadata not saving with parquet

2016-06-07 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319706#comment-15319706
 ] 

kevin yu commented on SPARK-15804:
--

I will submit a PR soon. Thanks.

> Manually added metadata not saving with parquet
> ---
>
> Key: SPARK-15804
> URL: https://issues.apache.org/jira/browse/SPARK-15804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
>
> Adding metadata with col().as(_, metadata) then saving the resultant 
> dataframe does not save the metadata. No error is thrown. Only see the schema 
> contains the metadata before saving and does not contain the metadata after 
> saving and loading the dataframe. Was working fine with 1.6.1.
> {code}
> case class TestRow(a: String, b: Int)
> val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil
> val df = spark.createDataFrame(rows)
> import org.apache.spark.sql.types.MetadataBuilder
> val md = new MetadataBuilder().putString("key", "value").build()
> val dfWithMeta = df.select(col("a"), col("b").as("b", md))
> println(dfWithMeta.schema.json)
> dfWithMeta.write.parquet("dfWithMeta")
> val dfWithMeta2 = spark.read.parquet("dfWithMeta")
> println(dfWithMeta2.schema.json)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15580) Add ContinuousQueryInfo to make ContinuousQueryListener events serializable

2016-06-07 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-15580.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13335
[https://github.com/apache/spark/pull/13335]

> Add ContinuousQueryInfo to make ContinuousQueryListener events serializable
> ---
>
> Key: SPARK-15580
> URL: https://issues.apache.org/jira/browse/SPARK-15580
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14485) Task finished cause fetch failure when its executor has already been removed by driver


 [ 
https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-14485.

   Resolution: Fixed
Fix Version/s: 2.0.0

> Task finished cause fetch failure when its executor has already been removed 
> by driver 
> ---
>
> Key: SPARK-14485
> URL: https://issues.apache.org/jira/browse/SPARK-14485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.5.2
>Reporter: iward
> Fix For: 2.0.0
>
>
> Now, when executor is removed by driver with heartbeats timeout, driver will 
> re-queue the task on this executor and send a kill command to cluster to kill 
> this executor.
> But, in a situation, the running task of this executor is finished and return 
> result to driver before this executor killed by kill command sent by driver. 
> At this situation, driver will accept the task finished event and ignore  
> speculative task and re-queued task. But, as we know, this executor has 
> removed by driver, the result of this finished task can not save in driver 
> because the *BlockManagerId* has also removed from *BlockManagerMaster* by 
> driver. So, the result data of this stage is not complete, and then, it will 
> cause fetch failure.
> For example, the following is the task log:
> {noformat}
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing 
> executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 
> 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 
> 256015 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing 
> tasks for 322 from TaskSet 107.0
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 
> 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): 
> ExecutorLostFailure (executor 322 lost)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 
> 322 (epoch 11)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: 
> Trying to remove executor 322 from BlockManagerMaster.
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 
> 322 successfully in removeExecutor
> {noformat}
> {noformat}
> 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 
> 229.0 in stage 107.0 (TID 10384) in 272315 ms on 
> BJHC-HERA-16168.hadoop.jd.local (579/700)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring 
> task-finished event for 229.1 in stage 107.0 because task 229 has already 
> completed successfully
> {noformat}
> {noformat}
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 
> missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at 
> mapPartitions at Exchange.scala:137)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task 
> set 107.1 with 3 tasks
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, 
> 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> {noformat}
> Driver will check the stage's result is not complete, and submit missing 
> task, but this time, the next stage has run because previous stage has finish 
> for its task is all finished although its result is not complete.
> {noformat}
> 2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task 
> 39.0 in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): 
> FetchFailed(null, shuffleId=11, mapId=-1, reduceId=39, message=
> 2015-12-31 04:40:13 INFO 
> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
> location for shuffle 11
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
>

[jira] [Commented] (SPARK-11106) Should ML Models contains single models or Pipelines?

2016-06-07 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319685#comment-15319685
 ] 

Xusen Yin commented on SPARK-11106:
---

RFormula is easy to use, but it may not always do right things. For example, 
RFormula indexes categorical features with OneHotEncoder, but in some scenario 
(like RandomForest), a StringIndexer is better.

> Should ML Models contains single models or Pipelines?
> -
>
> Key: SPARK-11106
> URL: https://issues.apache.org/jira/browse/SPARK-11106
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> This JIRA is for discussing whether an ML Estimators should do feature 
> processing.
> h2. Issue
> Currently, almost all ML Estimators require strict input types.  E.g., 
> DecisionTreeClassifier requires that the label column be Double type and have 
> metadata indicating the number of classes.
> This requires users to know how to preprocess data.
> h2. Ideal workflow
> A user should be able to pass any reasonable data to a Transformer or 
> Estimator and have it "do the right thing."
> E.g.:
> * If DecisionTreeClassifier is given a String column for labels, it should 
> know to index the Strings.
> * See [SPARK-10513] for a similar issue with OneHotEncoder.
> h2. Possible solutions
> There are a few solutions I have thought of.  Please comment with feedback or 
> alternative ideas!
> h3. Leave as is
> Pro: The current setup is good in that it forces the user to be very aware of 
> what they are doing.  Feature transformations will not happen silently.
> Con: The user has to write boilerplate code for transformations.  The API is 
> not what some users would expect; e.g., coming from R, a user might expect 
> some automatic transformations.
> h3. All Transformers can contain PipelineModels
> We could allow all Transformers and Models to contain arbitrary 
> PipelineModels.  E.g., if a DecisionTreeClassifier were given a String label 
> column, it might return a Model which contains a simple fitted PipelineModel 
> containing StringIndexer + DecisionTreeClassificationModel.
> The API could present this to the user, or it could be hidden from the user.  
> Ideally, it would be hidden from the beginner user, but accessible for 
> experts.
> The main problem is that we might have to break APIs.  E.g., OneHotEncoder 
> may need to do indexing if given a String input column.  This means it should 
> no longer be a Transformer; it should be an Estimator.
> h3. All Estimators should use RFormula
> The best option I have thought of is to make RFormula be the primary method 
> for automatic feature transformation.  We could start adding an RFormula 
> Param to all Estimators, and it could handle most of these feature 
> transformation issues.
> We could maintain old APIs:
> * If a user sets the input column names, then those can be used in the 
> traditional (no automatic transformation) way.
> * If a user sets the RFormula Param, then it can be used instead.  (This 
> should probably take precedence over the old API.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15780) Support mapValues on KeyValueGroupedDataset


[ 
https://issues.apache.org/jira/browse/SPARK-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319594#comment-15319594
 ] 

koert kuipers edited comment on SPARK-15780 at 6/7/16 10:34 PM:


also see this discussion:
https://www.mail-archive.com/user@spark.apache.org/msg51915.html


was (Author: koert):
also see this discussion:
https://mail.google.com/mail/u/0/#label/Active/1552c23b293b1ac8

> Support mapValues on KeyValueGroupedDataset
> ---
>
> Key: SPARK-15780
> URL: https://issues.apache.org/jira/browse/SPARK-15780
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: koert kuipers
>Priority: Minor
>
> Currently when doing groupByKey on a Dataset the key ends up in the values 
> which can be clumsy:
> {noformat}
> val ds: Dataset[(K, V)] = ...
> val grouped: KeyValueGroupedDataset[(K, (K, V))] = ds.groupByKey(_._1)
> {noformat}
> With mapValues one can create something more similar to PairRDDFunctions[K, 
> V]:
> {noformat}
> val ds: Dataset[(K, V)] = ...
> val grouped: KeyValueGroupedDataset[(K, V)] = 
> ds.groupByKey(_._1).mapValues(_._2)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15780) Support mapValues on KeyValueGroupedDataset


[ 
https://issues.apache.org/jira/browse/SPARK-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319594#comment-15319594
 ] 

koert kuipers commented on SPARK-15780:
---

also see this discussion:
https://mail.google.com/mail/u/0/#label/Active/1552c23b293b1ac8

> Support mapValues on KeyValueGroupedDataset
> ---
>
> Key: SPARK-15780
> URL: https://issues.apache.org/jira/browse/SPARK-15780
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: koert kuipers
>Priority: Minor
>
> Currently when doing groupByKey on a Dataset the key ends up in the values 
> which can be clumsy:
> {noformat}
> val ds: Dataset[(K, V)] = ...
> val grouped: KeyValueGroupedDataset[(K, (K, V))] = ds.groupByKey(_._1)
> {noformat}
> With mapValues one can create something more similar to PairRDDFunctions[K, 
> V]:
> {noformat}
> val ds: Dataset[(K, V)] = ...
> val grouped: KeyValueGroupedDataset[(K, V)] = 
> ds.groupByKey(_._1).mapValues(_._2)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14816) Update MLlib, GraphX, SparkR websites for 2.0


 [ 
https://issues.apache.org/jira/browse/SPARK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-14816:
---

Assignee: Yanbo Liang

> Update MLlib, GraphX, SparkR websites for 2.0
> -
>
> Key: SPARK-14816
> URL: https://issues.apache.org/jira/browse/SPARK-14816
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Update the sub-projects' websites to include new features in this release.
> For MLlib, make it clear that the DataFrame-based API is the primary one now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13590) Document the behavior of spark.ml logistic regression and AFT survival regression when there are constant features


 [ 
https://issues.apache.org/jira/browse/SPARK-13590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-13590.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Document the behavior of spark.ml logistic regression and AFT survival 
> regression when there are constant features
> --
>
> Key: SPARK-13590
> URL: https://issues.apache.org/jira/browse/SPARK-13590
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> As discussed in SPARK-13029, we decided to keep the current behavior that 
> sets all coefficients associated with constant feature columns to zero, 
> regardless of intercept, regularization, and standardization settings. This 
> is the same behavior as in glmnet. Since this is different from LIBSVM, we 
> should document the behavior correctly, add tests, and generate warning 
> messages if there are constant columns and `addIntercept` is false.
> cc [~coderxiang] [~dbtsai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14816) Update MLlib, GraphX, SparkR websites for 2.0


 [ 
https://issues.apache.org/jira/browse/SPARK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-14816:

Assignee: (was: Yanbo Liang)

> Update MLlib, GraphX, SparkR websites for 2.0
> -
>
> Key: SPARK-14816
> URL: https://issues.apache.org/jira/browse/SPARK-14816
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Update the sub-projects' websites to include new features in this release.
> For MLlib, make it clear that the DataFrame-based API is the primary one now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15674) Deprecates "CREATE TEMPORARY TABLE USING...", use "CREATE TEMPORARY VIEW USING..." instead.

2016-06-07 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-15674.
---
Resolution: Resolved
  Assignee: Sean Zhong

> Deprecates "CREATE TEMPORARY TABLE USING...", use "CREATE TEMPORARY VIEW 
> USING..." instead.
> ---
>
> Key: SPARK-15674
> URL: https://issues.apache.org/jira/browse/SPARK-15674
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>Priority: Minor
>
> The current implementation of "CREATE TEMPORARY TABLE USING..." is actually 
> creating a temporary VIEW behind the scene.
> We probably should just use "CREATE TEMPORARY VIEW USING..." instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15810) Aggregator doesn't play nice with Option


 [ 
https://issues.apache.org/jira/browse/SPARK-15810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koert kuipers updated SPARK-15810:
--
Description: 
{noformat}
  val ds1 = List(("a", 1), ("a", 2), ("a", 3)).toDS
  val ds2 = ds1.map{ case (k, v) => (k, if (v > 1) Some(v) else None) }
  val ds3 = ds2.groupByKey(_._1).agg(new Aggregator[(String, Option[Int]), 
Option[Int], Option[Int]]{
def zero: Option[Int] = None

def reduce(b: Option[Int], a: (String, Option[Int])): Option[Int] = 
b.map(bv => a._2.map(av => bv + av).getOrElse(bv)).orElse(a._2)

def merge(b1: Option[Int], b2: Option[Int]): Option[Int] = b1.map(b1v 
=> b2.map(b2v => b1v + b2v).getOrElse(b1v)).orElse(b2)

def finish(reduction: Option[Int]): Option[Int] = reduction

def bufferEncoder: Encoder[Option[Int]] = 
implicitly[Encoder[Option[Int]]]

def outputEncoder: Encoder[Option[Int]] = 
implicitly[Encoder[Option[Int]]]
  }.toColumn)
  ds3.printSchema
  ds3.show
{noformat}

i get as output a somewhat odd looking schema, and after that the program just 
hangs pinning one cpu at 100%. the data never shows.
output:
{noformat}
root
 |-- value: string (nullable = true)
 |-- $anon$1(scala.Tuple2): struct (nullable = true)
 ||-- value: integer (nullable = true)
{noformat}


  was:
{noformat}
  val ds1 = List(("a", 1), ("a", 2), ("a", 3)).toDS
  val df1 = ds1.map{ case (k, v) => (k, if (v > 1) Some(v) else None) 
}.toDF("k", "v")
  val df2 = df1.groupBy("k").agg(new Aggregator[(String, Option[Int]), 
Option[Int], Option[Int]]{
def zero: Option[Int] = None

def reduce(b: Option[Int], a: (String, Option[Int])): Option[Int] = 
b.map(bv => a._2.map(av => bv + av).getOrElse(bv)).orElse(a._2)

def merge(b1: Option[Int], b2: Option[Int]): Option[Int] = b1.map(b1v 
=> b2.map(b2v => b1v + b2v).getOrElse(b1v)).orElse(b2)

def finish(reduction: Option[Int]): Option[Int] = reduction

def bufferEncoder: Encoder[Option[Int]] = 
implicitly[Encoder[Option[Int]]]

def outputEncoder: Encoder[Option[Int]] = 
implicitly[Encoder[Option[Int]]]
  }.toColumn)
  df2.printSchema
  df2.show
{noformat}

i get as output a somewhat odd looking schema, and after that the program just 
hangs pinning one cpu at 100%. the data never shows.
output:
{noformat}
root
 |-- k: string (nullable = true)
 |-- $anon$1(org.apache.spark.sql.Row): struct (nullable = true)
 ||-- value: integer (nullable = true)
{noformat}



> Aggregator doesn't play nice with Option
> 
>
> Key: SPARK-15810
> URL: https://issues.apache.org/jira/browse/SPARK-15810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: spark 2.0.0-SNAPSHOT
>Reporter: koert kuipers
>
> {noformat}
>   val ds1 = List(("a", 1), ("a", 2), ("a", 3)).toDS
>   val ds2 = ds1.map{ case (k, v) => (k, if (v > 1) Some(v) else None) }
>   val ds3 = ds2.groupByKey(_._1).agg(new Aggregator[(String, 
> Option[Int]), Option[Int], Option[Int]]{
> def zero: Option[Int] = None
> def reduce(b: Option[Int], a: (String, Option[Int])): Option[Int] = 
> b.map(bv => a._2.map(av => bv + av).getOrElse(bv)).orElse(a._2)
> def merge(b1: Option[Int], b2: Option[Int]): Option[Int] = b1.map(b1v 
> => b2.map(b2v => b1v + b2v).getOrElse(b1v)).orElse(b2)
> def finish(reduction: Option[Int]): Option[Int] = reduction
> def bufferEncoder: Encoder[Option[Int]] = 
> implicitly[Encoder[Option[Int]]]
> def outputEncoder: Encoder[Option[Int]] = 
> implicitly[Encoder[Option[Int]]]
>   }.toColumn)
>   ds3.printSchema
>   ds3.show
> {noformat}
> i get as output a somewhat odd looking schema, and after that the program 
> just hangs pinning one cpu at 100%. the data never shows.
> output:
> {noformat}
> root
>  |-- value: string (nullable = true)
>  |-- $anon$1(scala.Tuple2): struct (nullable = true)
>  ||-- value: integer (nullable = true)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15810) Aggregator doesn't play nice with Option

koert kuipers created SPARK-15810:
-

 Summary: Aggregator doesn't play nice with Option
 Key: SPARK-15810
 URL: https://issues.apache.org/jira/browse/SPARK-15810
 Project: Spark
  Issue Type: Bug
  Components: SQL
 Environment: spark 2.0.0-SNAPSHOT
Reporter: koert kuipers


{noformat}
  val ds1 = List(("a", 1), ("a", 2), ("a", 3)).toDS
  val df1 = ds1.map{ case (k, v) => (k, if (v > 1) Some(v) else None) 
}.toDF("k", "v")
  val df2 = df1.groupBy("k").agg(new Aggregator[(String, Option[Int]), 
Option[Int], Option[Int]]{
def zero: Option[Int] = None

def reduce(b: Option[Int], a: (String, Option[Int])): Option[Int] = 
b.map(bv => a._2.map(av => bv + av).getOrElse(bv)).orElse(a._2)

def merge(b1: Option[Int], b2: Option[Int]): Option[Int] = b1.map(b1v 
=> b2.map(b2v => b1v + b2v).getOrElse(b1v)).orElse(b2)

def finish(reduction: Option[Int]): Option[Int] = reduction

def bufferEncoder: Encoder[Option[Int]] = 
implicitly[Encoder[Option[Int]]]

def outputEncoder: Encoder[Option[Int]] = 
implicitly[Encoder[Option[Int]]]
  }.toColumn)
  df2.printSchema
  df2.show
{noformat}

i get as output a somewhat odd looking schema, and after that the program just 
hangs pinning one cpu at 100%. the data never shows.
output:
{noformat}
root
 |-- k: string (nullable = true)
 |-- $anon$1(org.apache.spark.sql.Row): struct (nullable = true)
 ||-- value: integer (nullable = true)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9623) RandomForestRegressor: provide variance of predictions

2016-06-07 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319450#comment-15319450
 ] 

Manoj Kumar commented on SPARK-9623:


[~yanboliang] Are you still working on this? Would you mind if I take over?

> RandomForestRegressor: provide variance of predictions
> --
>
> Key: SPARK-9623
> URL: https://issues.apache.org/jira/browse/SPARK-9623
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Variance of predicted value, as estimated from training data.
> Analogous to class probabilities for classification.
> See [SPARK-3727] for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14279) Improve the spark build to pick the version information from the pom file and add git commit information


 [ 
https://issues.apache.org/jira/browse/SPARK-14279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-14279:
---
Fix Version/s: (was: 2.1.0)
   2.0.0

> Improve the spark build to pick the version information from the pom file and 
> add git commit information
> 
>
> Key: SPARK-14279
> URL: https://issues.apache.org/jira/browse/SPARK-14279
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Sanket Reddy
>Assignee: Dhruve Ashar
>Priority: Minor
> Fix For: 2.0.0
>
>
> Right now the spark-submit --version and other parts of the code pick up 
> version information from a static SPARK_VERSION. We would want to  pick the 
> version from the pom.version probably stored inside a properties file. Also, 
> it might be nice to have other details like branch, build information and 
> other specific details when having a spark-submit --version
> Note, the motivation is to more easily tie this to automated continuous 
> integration and deployment and to easily have traceability.
> Part of this is right now you have to manually change a java file to change 
> the version that comes out when you run spark-submit --version. With 
> continuous integration the build numbers could be something like 1.6.1.X 
> (where X increments on each change) and I want to see the exact version 
> easily. Having to manually change a java file makes that hard. obviously that 
> should make the apache spark releases easier as you don't have to manually 
> change this file as well.
> The other important part for me is the git information. This easily lets me 
> trace it back to exact commits. We have a multi-tenant YARN cluster and users 
> can run many different versions at once. I want to be able to see exactly 
> which version they are running. The reason to know exact version can range 
> from helping debug some problem to making sure someone didn't hack something 
> in Spark to cause bad things (generally they should use approved version), 
> etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15809) PySpark SQL UDF default returnType

2016-06-07 Thread Vladimir Feinberg (JIRA)

Vladimir Feinberg created SPARK-15809:
-

 Summary: PySpark SQL UDF default returnType
 Key: SPARK-15809
 URL: https://issues.apache.org/jira/browse/SPARK-15809
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Vladimir Feinberg
Priority: Minor


The current signature for the pyspark UDF creation function is:

{code:python}
pyspark.sql.functions.udf(f, returnType=StringType)
{code}

Is there a reason that there's a default parameter for {{returnType}}? 
Returning a string by default doesn't strike me as so much more a frequent use 
case than, say, returning an integer to merit the default.

In fact, it seems the only reason that the default was chosen is that if we 
*had to choose* a default type, it would be a {{StringType}} because that's 
what we can implicitly convert everything to.

But this only seems to do two things to me: (1) cause unintentional, annoying 
conversions to strings for new users and (2) make call sites less consistent 
(if people drop the type specification to actually use the default).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15808) Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to Mismatched File Formats


 [ 
https://issues.apache.org/jira/browse/SPARK-15808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15808:


Assignee: Apache Spark

> Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to 
> Mismatched File Formats
> ---
>
> Key: SPARK-15808
> URL: https://issues.apache.org/jira/browse/SPARK-15808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Example 1: PARQUET -> CSV
> {noformat}
> createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc")
> {noformat}
> Error we got: 
> {noformat}
> Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): 
> java.lang.RuntimeException: 
> file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc
>  is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but 
> found [79, 82, 67, 23]
> {noformat}
> Example 2: Json -> CSV
> {noformat}
> createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV")
> {noformat}
> No exception, but wrong results:
> {noformat}
> +++
> |  c1|  c2|
> +++
> |null|null|
> |null|null|
> |null|null|
> |null|null|
> |   0|str0|
> |   1|str1|
> |   2|str2|
> |   3|str3|
> |   4|str4|
> |   5|str5|
> |   6|str6|
> |   7|str7|
> |   8|str8|
> |   9|str9|
> +++
> {noformat}
> Example 3: Json -> Text
> {noformat}
> createDF(0, 9).write.format("json").saveAsTable("appendJsonToText")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText")
> {noformat}
> Error we got: 
> {noformat}
> Text data source supports only a single column, and you have 2 columns.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15808) Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to Mismatched File Formats


[ 
https://issues.apache.org/jira/browse/SPARK-15808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319156#comment-15319156
 ] 

Apache Spark commented on SPARK-15808:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/13546

> Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to 
> Mismatched File Formats
> ---
>
> Key: SPARK-15808
> URL: https://issues.apache.org/jira/browse/SPARK-15808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Example 1: PARQUET -> CSV
> {noformat}
> createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc")
> {noformat}
> Error we got: 
> {noformat}
> Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): 
> java.lang.RuntimeException: 
> file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc
>  is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but 
> found [79, 82, 67, 23]
> {noformat}
> Example 2: Json -> CSV
> {noformat}
> createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV")
> {noformat}
> No exception, but wrong results:
> {noformat}
> +++
> |  c1|  c2|
> +++
> |null|null|
> |null|null|
> |null|null|
> |null|null|
> |   0|str0|
> |   1|str1|
> |   2|str2|
> |   3|str3|
> |   4|str4|
> |   5|str5|
> |   6|str6|
> |   7|str7|
> |   8|str8|
> |   9|str9|
> +++
> {noformat}
> Example 3: Json -> Text
> {noformat}
> createDF(0, 9).write.format("json").saveAsTable("appendJsonToText")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText")
> {noformat}
> Error we got: 
> {noformat}
> Text data source supports only a single column, and you have 2 columns.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15808) Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to Mismatched File Formats


 [ 
https://issues.apache.org/jira/browse/SPARK-15808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15808:


Assignee: (was: Apache Spark)

> Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to 
> Mismatched File Formats
> ---
>
> Key: SPARK-15808
> URL: https://issues.apache.org/jira/browse/SPARK-15808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Example 1: PARQUET -> CSV
> {noformat}
> createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc")
> {noformat}
> Error we got: 
> {noformat}
> Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): 
> java.lang.RuntimeException: 
> file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc
>  is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but 
> found [79, 82, 67, 23]
> {noformat}
> Example 2: Json -> CSV
> {noformat}
> createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV")
> {noformat}
> No exception, but wrong results:
> {noformat}
> +++
> |  c1|  c2|
> +++
> |null|null|
> |null|null|
> |null|null|
> |null|null|
> |   0|str0|
> |   1|str1|
> |   2|str2|
> |   3|str3|
> |   4|str4|
> |   5|str5|
> |   6|str6|
> |   7|str7|
> |   8|str8|
> |   9|str9|
> +++
> {noformat}
> Example 3: Json -> Text
> {noformat}
> createDF(0, 9).write.format("json").saveAsTable("appendJsonToText")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText")
> {noformat}
> Error we got: 
> {noformat}
> Text data source supports only a single column, and you have 2 columns.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15808) Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to Mismatched File Formats

2016-06-07 Thread Xiao Li (JIRA)

Xiao Li created SPARK-15808:
---

 Summary: Wrong Results or Strange Errors In Append-mode DataFrame 
Writing Due to Mismatched File Formats
 Key: SPARK-15808
 URL: https://issues.apache.org/jira/browse/SPARK-15808
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


Example 1: PARQUET -> CSV

{noformat}
createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc")
createDF(10, 
19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc")
{noformat}

Error we got: 
{noformat}
Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): 
java.lang.RuntimeException: 
file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc
 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but 
found [79, 82, 67, 23]
{noformat}

Example 2: Json -> CSV

createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV")
createDF(10, 
19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV")

No exception, but wrong results:
{noformat}
+++
|  c1|  c2|
+++
|null|null|
|null|null|
|null|null|
|null|null|
|   0|str0|
|   1|str1|
|   2|str2|
|   3|str3|
|   4|str4|
|   5|str5|
|   6|str6|
|   7|str7|
|   8|str8|
|   9|str9|
+++
{noformat}


Example 3: Json -> Text

createDF(0, 9).write.format("json").saveAsTable("appendJsonToText")
createDF(10, 
19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText")

Error we got: 
{noformat}
Text data source supports only a single column, and you have 2 columns.
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15808) Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to Mismatched File Formats

2016-06-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-15808:

Description: 
Example 1: PARQUET -> CSV

{noformat}
createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc")
createDF(10, 
19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc")
{noformat}

Error we got: 
{noformat}
Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): 
java.lang.RuntimeException: 
file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc
 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but 
found [79, 82, 67, 23]
{noformat}

Example 2: Json -> CSV

createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV")
createDF(10, 
19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV")

No exception, but wrong results:
{noformat}
+++
|  c1|  c2|
+++
|null|null|
|null|null|
|null|null|
|null|null|
|   0|str0|
|   1|str1|
|   2|str2|
|   3|str3|
|   4|str4|
|   5|str5|
|   6|str6|
|   7|str7|
|   8|str8|
|   9|str9|
+++
{noformat}


Example 3: Json -> Text
{noformat}
createDF(0, 9).write.format("json").saveAsTable("appendJsonToText")
createDF(10, 
19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText")
{noformat}

Error we got: 
{noformat}
Text data source supports only a single column, and you have 2 columns.
{noformat}


  was:
Example 1: PARQUET -> CSV

{noformat}
createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc")
createDF(10, 
19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc")
{noformat}

Error we got: 
{noformat}
Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): 
java.lang.RuntimeException: 
file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc
 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but 
found [79, 82, 67, 23]
{noformat}

Example 2: Json -> CSV

createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV")
createDF(10, 
19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV")

No exception, but wrong results:
{noformat}
+++
|  c1|  c2|
+++
|null|null|
|null|null|
|null|null|
|null|null|
|   0|str0|
|   1|str1|
|   2|str2|
|   3|str3|
|   4|str4|
|   5|str5|
|   6|str6|
|   7|str7|
|   8|str8|
|   9|str9|
+++
{noformat}


Example 3: Json -> Text

createDF(0, 9).write.format("json").saveAsTable("appendJsonToText")
createDF(10, 
19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText")

Error we got: 
{noformat}
Text data source supports only a single column, and you have 2 columns.
{noformat}



> Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to 
> Mismatched File Formats
> ---
>
> Key: SPARK-15808
> URL: https://issues.apache.org/jira/browse/SPARK-15808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Example 1: PARQUET -> CSV
> {noformat}
> createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc")
> {noformat}
> Error we got: 
> {noformat}
> Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): 
> java.lang.RuntimeException: 
> file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc
>  is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but 
> found [79, 82, 67, 23]
> {noformat}
> Example 2: Json -> CSV
> createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV")
> No exception, but wrong results:
> {noformat}
> +++
> |  c1|  c2|
> +++
> |null|null|
> |null|null|
> |null|null|
> |null|null|
> |   0|str0|
> |   1|str1|
> |   2|str2|
> |   3|str3|
> |   4|str4|
> |   5|str5|
> |   6|str6|
> |   7|str7|
> |   8|str8|
> |   9|str9|
> +++
> {noformat}
> Example 3: Json -> Text
> {noformat}
> createDF(0, 9).write.format("json").saveAsTable("appendJsonToText")
> createDF(10,

[jira] [Updated] (SPARK-15808) Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to Mismatched File Formats

2016-06-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-15808:

Description: 
Example 1: PARQUET -> CSV

{noformat}
createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc")
createDF(10, 
19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc")
{noformat}

Error we got: 
{noformat}
Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): 
java.lang.RuntimeException: 
file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc
 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but 
found [79, 82, 67, 23]
{noformat}

Example 2: Json -> CSV
{noformat}
createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV")
createDF(10, 
19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV")
{noformat}

No exception, but wrong results:
{noformat}
+++
|  c1|  c2|
+++
|null|null|
|null|null|
|null|null|
|null|null|
|   0|str0|
|   1|str1|
|   2|str2|
|   3|str3|
|   4|str4|
|   5|str5|
|   6|str6|
|   7|str7|
|   8|str8|
|   9|str9|
+++
{noformat}


Example 3: Json -> Text
{noformat}
createDF(0, 9).write.format("json").saveAsTable("appendJsonToText")
createDF(10, 
19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText")
{noformat}

Error we got: 
{noformat}
Text data source supports only a single column, and you have 2 columns.
{noformat}


  was:
Example 1: PARQUET -> CSV

{noformat}
createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc")
createDF(10, 
19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc")
{noformat}

Error we got: 
{noformat}
Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): 
java.lang.RuntimeException: 
file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc
 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but 
found [79, 82, 67, 23]
{noformat}

Example 2: Json -> CSV

createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV")
createDF(10, 
19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV")

No exception, but wrong results:
{noformat}
+++
|  c1|  c2|
+++
|null|null|
|null|null|
|null|null|
|null|null|
|   0|str0|
|   1|str1|
|   2|str2|
|   3|str3|
|   4|str4|
|   5|str5|
|   6|str6|
|   7|str7|
|   8|str8|
|   9|str9|
+++
{noformat}


Example 3: Json -> Text
{noformat}
createDF(0, 9).write.format("json").saveAsTable("appendJsonToText")
createDF(10, 
19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText")
{noformat}

Error we got: 
{noformat}
Text data source supports only a single column, and you have 2 columns.
{noformat}



> Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to 
> Mismatched File Formats
> ---
>
> Key: SPARK-15808
> URL: https://issues.apache.org/jira/browse/SPARK-15808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Example 1: PARQUET -> CSV
> {noformat}
> createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc")
> {noformat}
> Error we got: 
> {noformat}
> Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): 
> java.lang.RuntimeException: 
> file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc
>  is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but 
> found [79, 82, 67, 23]
> {noformat}
> Example 2: Json -> CSV
> {noformat}
> createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV")
> createDF(10, 
> 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV")
> {noformat}
> No exception, but wrong results:
> {noformat}
> +++
> |  c1|  c2|
> +++
> |null|null|
> |null|null|
> |null|null|
> |null|null|
> |   0|str0|
> |   1|str1|
> |   2|str2|
> |   3|str3|
> |   4|str4|
> |   5|str5|
> |   6|str6|
> |   7|str7|
> |   8|str8|
> |   9|str9|
> +++
> {noformat}
> Example 3: Json -> Text
> {noformat}
> createDF(0,

[jira] [Updated] (SPARK-15804) Manually added metadata not saving with parquet

2016-06-07 Thread Charlie Evans (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charlie Evans updated SPARK-15804:
--
Description: 
Adding metadata with col().as(_, metadata) then saving the resultant dataframe 
does not save the metadata. No error is thrown. Only see the schema contains 
the metadata before saving and does not contain the metadata after saving and 
loading the dataframe. Was working fine with 1.6.1.

{code}
case class TestRow(a: String, b: Int)
val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil
val df = spark.createDataFrame(rows)
import org.apache.spark.sql.types.MetadataBuilder
val md = new MetadataBuilder().putString("key", "value").build()
val dfWithMeta = df.select(col("a"), col("b").as("b", md))
println(dfWithMeta.schema.json)
dfWithMeta.write.parquet("dfWithMeta")

val dfWithMeta2 = spark.read.parquet("dfWithMeta")
println(dfWithMeta2.schema.json)
{code}

  was:
Adding metadata with col().as(_, metadata) then saving the resultant dataframe 
does not save the metadata. No error is thrown. Only see the schema contains 
the metadata before saving and does not contain the metadata after saving and 
loading the dataframe.

{code}
case class TestRow(a: String, b: Int)
val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil
val df = spark.createDataFrame(rows)
import org.apache.spark.sql.types.MetadataBuilder
val md = new MetadataBuilder().putString("key", "value").build()
val dfWithMeta = df.select(col("a"), col("b").as("b", md))
println(dfWithMeta.schema.json)
dfWithMeta.write.parquet("dfWithMeta")

val dfWithMeta2 = spark.read.parquet("dfWithMeta")
println(dfWithMeta2.schema.json)
{code}


> Manually added metadata not saving with parquet
> ---
>
> Key: SPARK-15804
> URL: https://issues.apache.org/jira/browse/SPARK-15804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
>
> Adding metadata with col().as(_, metadata) then saving the resultant 
> dataframe does not save the metadata. No error is thrown. Only see the schema 
> contains the metadata before saving and does not contain the metadata after 
> saving and loading the dataframe. Was working fine with 1.6.1.
> {code}
> case class TestRow(a: String, b: Int)
> val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil
> val df = spark.createDataFrame(rows)
> import org.apache.spark.sql.types.MetadataBuilder
> val md = new MetadataBuilder().putString("key", "value").build()
> val dfWithMeta = df.select(col("a"), col("b").as("b", md))
> println(dfWithMeta.schema.json)
> dfWithMeta.write.parquet("dfWithMeta")
> val dfWithMeta2 = spark.read.parquet("dfWithMeta")
> println(dfWithMeta2.schema.json)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15785) Add initialModel param to Gaussian Mixture Model (GMM) in spark.ml

2016-06-07 Thread Gayathri Murali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319100#comment-15319100
 ] 

Gayathri Murali commented on SPARK-15785:
-

I will work on this. Thanks!

> Add initialModel param to Gaussian Mixture Model (GMM) in spark.ml
> --
>
> Key: SPARK-15785
> URL: https://issues.apache.org/jira/browse/SPARK-15785
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this param is needed for SPARK-4591: algorithm/model parity for 
> spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15807) Support varargs for distinct/dropDuplicates in Dataset/DataFrame


 [ 
https://issues.apache.org/jira/browse/SPARK-15807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15807:


Assignee: Apache Spark

> Support varargs for distinct/dropDuplicates in Dataset/DataFrame
> 
>
> Key: SPARK-15807
> URL: https://issues.apache.org/jira/browse/SPARK-15807
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> This issue adds `varargs`-types `distinct/dropDuplicates` functions in 
> `Dataset/DataFrame`. Currently, `distinct` does not get arguments, and 
> `dropDuplicates` supports only `Seq` or `Array`.
> {code}
> scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2)))
> ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
> scala> ds.dropDuplicates(Seq("_1", "_2"))
> res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, 
> _2: int]
> scala> ds.dropDuplicates("_1", "_2")
> :26: error: overloaded method value dropDuplicates with alternatives:
>   (colNames: 
> Array[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 
>   (colNames: 
> Seq[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 
>   ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
>  cannot be applied to (String, String)
>ds.dropDuplicates("_1", "_2")
>   ^
> scala> ds.distinct("_1", "_2")
> :26: error: too many arguments for method distinct: 
> ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
>ds.distinct("_1", "_2")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15807) Support varargs for distinct/dropDuplicates in Dataset/DataFrame


 [ 
https://issues.apache.org/jira/browse/SPARK-15807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15807:


Assignee: (was: Apache Spark)

> Support varargs for distinct/dropDuplicates in Dataset/DataFrame
> 
>
> Key: SPARK-15807
> URL: https://issues.apache.org/jira/browse/SPARK-15807
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> This issue adds `varargs`-types `distinct/dropDuplicates` functions in 
> `Dataset/DataFrame`. Currently, `distinct` does not get arguments, and 
> `dropDuplicates` supports only `Seq` or `Array`.
> {code}
> scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2)))
> ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
> scala> ds.dropDuplicates(Seq("_1", "_2"))
> res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, 
> _2: int]
> scala> ds.dropDuplicates("_1", "_2")
> :26: error: overloaded method value dropDuplicates with alternatives:
>   (colNames: 
> Array[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 
>   (colNames: 
> Seq[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 
>   ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
>  cannot be applied to (String, String)
>ds.dropDuplicates("_1", "_2")
>   ^
> scala> ds.distinct("_1", "_2")
> :26: error: too many arguments for method distinct: 
> ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
>ds.distinct("_1", "_2")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15807) Support varargs for distinct/dropDuplicates in Dataset/DataFrame


[ 
https://issues.apache.org/jira/browse/SPARK-15807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319090#comment-15319090
 ] 

Apache Spark commented on SPARK-15807:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/13545

> Support varargs for distinct/dropDuplicates in Dataset/DataFrame
> 
>
> Key: SPARK-15807
> URL: https://issues.apache.org/jira/browse/SPARK-15807
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>
> This issue adds `varargs`-types `distinct/dropDuplicates` functions in 
> `Dataset/DataFrame`. Currently, `distinct` does not get arguments, and 
> `dropDuplicates` supports only `Seq` or `Array`.
> {code}
> scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2)))
> ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
> scala> ds.dropDuplicates(Seq("_1", "_2"))
> res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, 
> _2: int]
> scala> ds.dropDuplicates("_1", "_2")
> :26: error: overloaded method value dropDuplicates with alternatives:
>   (colNames: 
> Array[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 
>   (colNames: 
> Seq[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 
>   ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
>  cannot be applied to (String, String)
>ds.dropDuplicates("_1", "_2")
>   ^
> scala> ds.distinct("_1", "_2")
> :26: error: too many arguments for method distinct: 
> ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
>ds.distinct("_1", "_2")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15807) Support varargs for distinct/dropDuplicates in Dataset/DataFrame

2016-06-07 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-15807:
-

 Summary: Support varargs for distinct/dropDuplicates in 
Dataset/DataFrame
 Key: SPARK-15807
 URL: https://issues.apache.org/jira/browse/SPARK-15807
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Dongjoon Hyun


This issue adds `varargs`-types `distinct/dropDuplicates` functions in 
`Dataset/DataFrame`. Currently, `distinct` does not get arguments, and 
`dropDuplicates` supports only `Seq` or `Array`.

{code}
scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2)))
ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int]

scala> ds.dropDuplicates(Seq("_1", "_2"))
res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, _2: 
int]

scala> ds.dropDuplicates("_1", "_2")
:26: error: overloaded method value dropDuplicates with alternatives:
  (colNames: 
Array[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 
  (colNames: Seq[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 

  ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
 cannot be applied to (String, String)
   ds.dropDuplicates("_1", "_2")
  ^

scala> ds.distinct("_1", "_2")
:26: error: too many arguments for method distinct: 
()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
   ds.distinct("_1", "_2")
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15804) Manually added metadata not saving with parquet

2016-06-07 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318897#comment-15318897
 ] 

Takeshi Yamamuro commented on SPARK-15804:
--

`MetadataBuilder` is one of developer apis, so is this functionality useful for 
developers?
Any useful scenario to use this?
Anyway, this is related to not only `parquet but also other formats such as 
orc, csv, json...

> Manually added metadata not saving with parquet
> ---
>
> Key: SPARK-15804
> URL: https://issues.apache.org/jira/browse/SPARK-15804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
>
> Adding metadata with col().as(_, metadata) then saving the resultant 
> dataframe does not save the metadata. No error is thrown. Only see the schema 
> contains the metadata before saving and does not contain the metadata after 
> saving and loading the dataframe.
> {code}
> case class TestRow(a: String, b: Int)
> val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil
> val df = spark.createDataFrame(rows)
> import org.apache.spark.sql.types.MetadataBuilder
> val md = new MetadataBuilder().putString("key", "value").build()
> val dfWithMeta = df.select(col("a"), col("b").as("b", md))
> println(dfWithMeta.schema.json)
> dfWithMeta.write.parquet("dfWithMeta")
> val dfWithMeta2 = spark.read.parquet("dfWithMeta")
> println(dfWithMeta2.schema.json)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15760) Documentation missing for package-related config options


 [ 
https://issues.apache.org/jira/browse/SPARK-15760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-15760.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.0.0

> Documentation missing for package-related config options
> 
>
> Key: SPARK-15760
> URL: https://issues.apache.org/jira/browse/SPARK-15760
> Project: Spark
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.0.0
>
>
> There's no documentation about the config options that correlate to the 
> "--packages" (and friends) arguments of spark-submit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15684) Not mask startsWith and endsWith in R

2016-06-07 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-15684.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13476
[https://github.com/apache/spark/pull/13476]

> Not mask startsWith and endsWith in R
> -
>
> Key: SPARK-15684
> URL: https://issues.apache.org/jira/browse/SPARK-15684
> Project: Spark
>  Issue Type: Improvement
>Reporter: Miao Wang
> Fix For: 2.0.0
>
>
> In R 3.3.0, it has startsWith and endsWith. We should not mask this two 
> methods in Spark. Actually, Spark R has startsWith and endsWith working for 
> column. But making them work for both column and string is not easy. I create 
> this JIRA for discussions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15684) Not mask startsWith and endsWith in R

2016-06-07 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-15684:
--
Assignee: Miao Wang

> Not mask startsWith and endsWith in R
> -
>
> Key: SPARK-15684
> URL: https://issues.apache.org/jira/browse/SPARK-15684
> Project: Spark
>  Issue Type: Improvement
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.0.0
>
>
> In R 3.3.0, it has startsWith and endsWith. We should not mask this two 
> methods in Spark. Actually, Spark R has startsWith and endsWith working for 
> column. But making them work for both column and string is not easy. I create 
> this JIRA for discussions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2016-06-07 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318769#comment-15318769
 ] 

Shivaram Venkataraman commented on SPARK-15799:
---

I dont think there are any license issues and at least before we merged SparkR 
into the apache the package passed all the CRAN checks. The only problem is 
that we might need to ship the entire Spark assembly JAR (or all the jars that 
we have with the new release structure) to make the package work without 
additional downloads. Some other minor things that might make it challenging to 
use SparkR directly from CRAN
1. Matching versions between client and cluster versions of Spark. This is 
still a requirement today but the main difference is that people might upgrade 
CRAN packages separately from their Spark clusters say.
2. Figuring out where to put scripts like spark-submit that can be used to 
submit batch jobs. This isn't something normal R packages offer so I'm not sure 
there are existing practices we can follow here.

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15805) update the whole sql programming guide


 [ 
https://issues.apache.org/jira/browse/SPARK-15805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15805:


Assignee: (was: Apache Spark)

> update the whole sql programming guide
> --
>
> Key: SPARK-15805
> URL: https://issues.apache.org/jira/browse/SPARK-15805
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The sql programming guide of spark is out-of-date in many places, including:
> should using `SparkSession` instead of `SQLContext`
> should using `SparkSession.builder.enableHiveSupport` instead of `HiveContext`
> should using `dataFrame.write.saveAsTable` instead of `dataFrame.saveAsTable`
> should using `sparkSession.catalog.cacheTable/uncacheTable` instead of 
> `SQLContext.cacheTable/uncacheTable`
> and so on...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15805) update the whole sql programming guide


[ 
https://issues.apache.org/jira/browse/SPARK-15805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318766#comment-15318766
 ] 

Apache Spark commented on SPARK-15805:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/13544

> update the whole sql programming guide
> --
>
> Key: SPARK-15805
> URL: https://issues.apache.org/jira/browse/SPARK-15805
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The sql programming guide of spark is out-of-date in many places, including:
> should using `SparkSession` instead of `SQLContext`
> should using `SparkSession.builder.enableHiveSupport` instead of `HiveContext`
> should using `dataFrame.write.saveAsTable` instead of `dataFrame.saveAsTable`
> should using `sparkSession.catalog.cacheTable/uncacheTable` instead of 
> `SQLContext.cacheTable/uncacheTable`
> and so on...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15805) update the whole sql programming guide


 [ 
https://issues.apache.org/jira/browse/SPARK-15805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15805:


Assignee: Apache Spark

> update the whole sql programming guide
> --
>
> Key: SPARK-15805
> URL: https://issues.apache.org/jira/browse/SPARK-15805
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The sql programming guide of spark is out-of-date in many places, including:
> should using `SparkSession` instead of `SQLContext`
> should using `SparkSession.builder.enableHiveSupport` instead of `HiveContext`
> should using `dataFrame.write.saveAsTable` instead of `dataFrame.saveAsTable`
> should using `sparkSession.catalog.cacheTable/uncacheTable` instead of 
> `SQLContext.cacheTable/uncacheTable`
> and so on...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15806) Update doc for SPARK_MASTER_IP


 [ 
https://issues.apache.org/jira/browse/SPARK-15806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15806:


Assignee: (was: Apache Spark)

> Update doc for SPARK_MASTER_IP
> --
>
> Key: SPARK-15806
> URL: https://issues.apache.org/jira/browse/SPARK-15806
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Bo Meng
>Priority: Minor
>
> SPARK_MASTER_IP is a deprecated environment variable. It is replaced by 
> SPARK_MASTER_HOST according to MasterArguments.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15806) Update doc for SPARK_MASTER_IP


[ 
https://issues.apache.org/jira/browse/SPARK-15806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318761#comment-15318761
 ] 

Apache Spark commented on SPARK-15806:
--

User 'bomeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/13543

> Update doc for SPARK_MASTER_IP
> --
>
> Key: SPARK-15806
> URL: https://issues.apache.org/jira/browse/SPARK-15806
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Bo Meng
>Priority: Minor
>
> SPARK_MASTER_IP is a deprecated environment variable. It is replaced by 
> SPARK_MASTER_HOST according to MasterArguments.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15806) Update doc for SPARK_MASTER_IP


 [ 
https://issues.apache.org/jira/browse/SPARK-15806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15806:


Assignee: Apache Spark

> Update doc for SPARK_MASTER_IP
> --
>
> Key: SPARK-15806
> URL: https://issues.apache.org/jira/browse/SPARK-15806
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Bo Meng
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK_MASTER_IP is a deprecated environment variable. It is replaced by 
> SPARK_MASTER_HOST according to MasterArguments.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15806) Update doc for SPARK_MASTER_IP

2016-06-07 Thread Bo Meng (JIRA)

Bo Meng created SPARK-15806:
---

 Summary: Update doc for SPARK_MASTER_IP
 Key: SPARK-15806
 URL: https://issues.apache.org/jira/browse/SPARK-15806
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Bo Meng
Priority: Minor


SPARK_MASTER_IP is a deprecated environment variable. It is replaced by 
SPARK_MASTER_HOST according to MasterArguments.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15805) update the whole sql programming guide

2016-06-07 Thread Weichen Xu (JIRA)

Weichen Xu created SPARK-15805:
--

 Summary: update the whole sql programming guide
 Key: SPARK-15805
 URL: https://issues.apache.org/jira/browse/SPARK-15805
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 2.0.0
Reporter: Weichen Xu


The sql programming guide of spark is out-of-date in many places, including:

should using `SparkSession` instead of `SQLContext`
should using `SparkSession.builder.enableHiveSupport` instead of `HiveContext`
should using `dataFrame.write.saveAsTable` instead of `dataFrame.saveAsTable`
should using `sparkSession.catalog.cacheTable/uncacheTable` instead of 
`SQLContext.cacheTable/uncacheTable`

and so on...




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15801) spark-submit --num-executors switch also works without YARN


[ 
https://issues.apache.org/jira/browse/SPARK-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318721#comment-15318721
 ] 

Marcelo Vanzin commented on SPARK-15801:


I'm not really sure of how standalone works these days after all the changes 
for dynamic allocation. [~andrewor14] might be a better person to ask.

> spark-submit --num-executors switch also works without YARN
> ---
>
> Key: SPARK-15801
> URL: https://issues.apache.org/jira/browse/SPARK-15801
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> Based on this [issue|https://issues.apache.org/jira/browse/SPARK-15781] 
> regarding the SPARK_WORKER_INSTANCES property, I also found that the 
> {{--num-executors}} switch documented in the spark-submit help is partially 
> incorrect. 
> Here's one part of the output (produced by {{spark-submit --help}}): 
> {code}
> YARN-only:
>   --driver-cores NUM  Number of cores used by the driver, only in 
> cluster mode
>   (Default: 1).
>   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
> "default").
>   --num-executors NUM Number of executors to launch (Default: 2).
> {code}
> Correct me if I am wrong, but the num-executors switch also works in Spark 
> standalone mode *without YARN*.
> I tried by only launching a master and a worker with 4 executors specified, 
> and they were all successfully spawned. The master switch pointed to the 
> master's url, and not to the yarn value. 
> Here's the exact command : {{spark-submit --master spark://[local 
> machine]:7077 --num-executors 4 --executor-cores 2}}
> By default it is *1* executor per worker in Spark standalone mode without 
> YARN, but this option enables to specify the number of executors (per worker 
> ?) if, and only if, the {{--executor-cores}} switch is also set. I do believe 
> it defaults to 2 in YARN mode. 
> I would propose to move the option from the *YARN-only* section to the *Spark 
> standalone and YARN only* section.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15652) Missing org.apache.spark.launcher.SparkAppHandle.Listener notification if SparkSubmit JVM shutsdown


[ 
https://issues.apache.org/jira/browse/SPARK-15652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318711#comment-15318711
 ] 

Marcelo Vanzin commented on SPARK-15652:


I'm a little worried about that because it touches a public API, even though 
it's just adding something that shouldn't cause issues. I also haven't seen 
much activity towards a new 1.6 point release... let me think about it.

> Missing org.apache.spark.launcher.SparkAppHandle.Listener notification if 
> SparkSubmit JVM shutsdown
> ---
>
> Key: SPARK-15652
> URL: https://issues.apache.org/jira/browse/SPARK-15652
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Subroto Sanyal
>Assignee: Subroto Sanyal
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: SPARK-15652-1.patch, spark-launcher-client-hang.jar
>
>
> h6. Problem
> In case SparkSubmit JVM goes down even before sending the job complete 
> notification; the _org.apache.spark.launcher.SparkAppHandle.Listener_ will 
> not receive any notification which may lead to the client using SparkLauncher 
> hang indefinitely.
> h6. Root Cause
> No proper exception handling at 
> org.apache.spark.launcher.LauncherConnection#run when an EOFException is 
> encountered while reading over Socket Stream. Mostly EOFException will be 
> thrown at the suggested 
> point(_org.apache.spark.launcher.LauncherConnection.run(LauncherConnection.java:58)_)
>  if the SparkSubmit JVM is shutdown. 
> Probably, it was assumed that SparkSubmit JVM can shut down only with normal 
> healthy completion but, there could be scenarios where this is not the case:
> # OS kill the SparkSubmit process using OOM Killer.
> # Exception while SparkSubmit submits the job, even before it starts 
> monitoring the application. This can happen if SparkLauncher is not 
> configured properly. There might be no exception handling in 
> org.apache.spark.deploy.yarn.Client#submitApplication(), which may lead to 
> any exception/throwable at this point lead to shutting down of JVM without 
> proper finalisation
> h6. Possible Solutions
> # In case of EOFException or any other exception notify the listeners that 
> job has failed
> # Better exception handling on the SparkSubmit JVM side (though this may not 
> resolve the problem completely)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15755) java.lang.NullPointerException when run spark 2.0 setting spark.serializer=org.apache.spark.serializer.KryoSerializer

2016-06-07 Thread Bo Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318692#comment-15318692
 ] 

Bo Meng commented on SPARK-15755:
-

Could you provide a test case to reproduce the issue?

> java.lang.NullPointerException when run spark 2.0 setting 
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> -
>
> Key: SPARK-15755
> URL: https://issues.apache.org/jira/browse/SPARK-15755
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: marymwu
>
> java.lang.NullPointerException when run spark 2.0 setting 
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:627)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:329)
>   at java.util.PriorityQueue.add(PriorityQueue.java:306)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
>   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:711)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
>   ... 15 more
> 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
>   at 
>

[jira] [Commented] (SPARK-15652) Missing org.apache.spark.launcher.SparkAppHandle.Listener notification if SparkSubmit JVM shutsdown

2016-06-07 Thread Subroto Sanyal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318683#comment-15318683
 ] 

Subroto Sanyal commented on SPARK-15652:


hi [~vanzin]
Can this be merged to 1.6 branch?

> Missing org.apache.spark.launcher.SparkAppHandle.Listener notification if 
> SparkSubmit JVM shutsdown
> ---
>
> Key: SPARK-15652
> URL: https://issues.apache.org/jira/browse/SPARK-15652
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Subroto Sanyal
>Assignee: Subroto Sanyal
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: SPARK-15652-1.patch, spark-launcher-client-hang.jar
>
>
> h6. Problem
> In case SparkSubmit JVM goes down even before sending the job complete 
> notification; the _org.apache.spark.launcher.SparkAppHandle.Listener_ will 
> not receive any notification which may lead to the client using SparkLauncher 
> hang indefinitely.
> h6. Root Cause
> No proper exception handling at 
> org.apache.spark.launcher.LauncherConnection#run when an EOFException is 
> encountered while reading over Socket Stream. Mostly EOFException will be 
> thrown at the suggested 
> point(_org.apache.spark.launcher.LauncherConnection.run(LauncherConnection.java:58)_)
>  if the SparkSubmit JVM is shutdown. 
> Probably, it was assumed that SparkSubmit JVM can shut down only with normal 
> healthy completion but, there could be scenarios where this is not the case:
> # OS kill the SparkSubmit process using OOM Killer.
> # Exception while SparkSubmit submits the job, even before it starts 
> monitoring the application. This can happen if SparkLauncher is not 
> configured properly. There might be no exception handling in 
> org.apache.spark.deploy.yarn.Client#submitApplication(), which may lead to 
> any exception/throwable at this point lead to shutting down of JVM without 
> proper finalisation
> h6. Possible Solutions
> # In case of EOFException or any other exception notify the listeners that 
> job has failed
> # Better exception handling on the SparkSubmit JVM side (though this may not 
> resolve the problem completely)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15730) [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take effect in spark-sql session


 [ 
https://issues.apache.org/jira/browse/SPARK-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15730:


Assignee: (was: Apache Spark)

> [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take 
> effect in spark-sql session
> -
>
> Key: SPARK-15730
> URL: https://issues.apache.org/jira/browse/SPARK-15730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> /usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g 
> --executor-cores 5 --num-executors 31 --master yarn-client --conf 
> spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01
> spark-sql> use test;
> 16/06/02 21:36:15 INFO execution.SparkSqlParser: Parsing command: use test
> 16/06/02 21:36:15 INFO spark.SparkContext: Starting job: processCmd at 
> CliDriver.java:376
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Got job 2 (processCmd at 
> CliDriver.java:376) with 1 output partitions
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 
> (processCmd at CliDriver.java:376)
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Parents of final stage: List()
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Missing parents: List()
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting ResultStage 2 
> (MapPartitionsRDD[8] at processCmd at CliDriver.java:376), which has no 
> missing parents
> 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2 stored as values 
> in memory (estimated size 3.2 KB, free 2.4 GB)
> 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as 
> bytes in memory (estimated size 1964.0 B, free 2.4 GB)
> 16/06/02 21:36:15 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on 192.168.3.11:36189 (size: 1964.0 B, free: 2.4 GB)
> 16/06/02 21:36:15 INFO spark.SparkContext: Created broadcast 2 from broadcast 
> at DAGScheduler.scala:1012
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from ResultStage 2 (MapPartitionsRDD[8] at processCmd at CliDriver.java:376)
> 16/06/02 21:36:15 INFO cluster.YarnScheduler: Adding task set 2.0 with 1 tasks
> 16/06/02 21:36:15 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
> 2.0 (TID 2, 192.168.3.13, partition 0, PROCESS_LOCAL, 5362 bytes)
> 16/06/02 21:36:15 INFO cluster.YarnClientSchedulerBackend: Launching task 2 
> on executor id: 10 hostname: 192.168.3.13.
> 16/06/02 21:36:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on hw-node3:45924 (size: 1964.0 B, free: 4.4 GB)
> 16/06/02 21:36:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 
> 2.0 (TID 2) in 1934 ms on 192.168.3.13 (1/1)
> 16/06/02 21:36:17 INFO cluster.YarnScheduler: Removed TaskSet 2.0, whose 
> tasks have all completed, from pool
> 16/06/02 21:36:17 INFO scheduler.DAGScheduler: ResultStage 2 (processCmd at 
> CliDriver.java:376) finished in 1.937 s
> 16/06/02 21:36:17 INFO scheduler.DAGScheduler: Job 2 finished: processCmd at 
> CliDriver.java:376, took 1.962631 s
> Time taken: 2.027 seconds
> 16/06/02 21:36:17 INFO CliDriver: Time taken: 2.027 seconds
> spark-sql> DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE};
> 16/06/02 21:36:36 INFO execution.SparkSqlParser: Parsing command: DROP TABLE 
> IF EXISTS ${hiveconf:RESULT_TABLE}
> Error in query:
> mismatched input '$' expecting {'ADD', 'AS', 'ALL', 'GROUP', 'BY', 
> 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'LIMIT', 'AT', 'IN', 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'OUTER', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 
> 'RANGE', 'ROWS', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 
> 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 
> 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'TO', 
> 'TABLESAMPLE', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 
> 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'IF', 
> 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 
> 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 
> 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 
> 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'EXTENDED', 
> 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 
> 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 
> 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 
> 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE',

[jira] [Commented] (SPARK-15730) [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take effect in spark-sql session

2016-06-07 Thread Cheng Hao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318654#comment-15318654
 ] 

Cheng Hao commented on SPARK-15730:
---

[~jameszhouyi], can you please verify this fixing?

> [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take 
> effect in spark-sql session
> -
>
> Key: SPARK-15730
> URL: https://issues.apache.org/jira/browse/SPARK-15730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> /usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g 
> --executor-cores 5 --num-executors 31 --master yarn-client --conf 
> spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01
> spark-sql> use test;
> 16/06/02 21:36:15 INFO execution.SparkSqlParser: Parsing command: use test
> 16/06/02 21:36:15 INFO spark.SparkContext: Starting job: processCmd at 
> CliDriver.java:376
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Got job 2 (processCmd at 
> CliDriver.java:376) with 1 output partitions
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 
> (processCmd at CliDriver.java:376)
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Parents of final stage: List()
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Missing parents: List()
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting ResultStage 2 
> (MapPartitionsRDD[8] at processCmd at CliDriver.java:376), which has no 
> missing parents
> 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2 stored as values 
> in memory (estimated size 3.2 KB, free 2.4 GB)
> 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as 
> bytes in memory (estimated size 1964.0 B, free 2.4 GB)
> 16/06/02 21:36:15 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on 192.168.3.11:36189 (size: 1964.0 B, free: 2.4 GB)
> 16/06/02 21:36:15 INFO spark.SparkContext: Created broadcast 2 from broadcast 
> at DAGScheduler.scala:1012
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from ResultStage 2 (MapPartitionsRDD[8] at processCmd at CliDriver.java:376)
> 16/06/02 21:36:15 INFO cluster.YarnScheduler: Adding task set 2.0 with 1 tasks
> 16/06/02 21:36:15 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
> 2.0 (TID 2, 192.168.3.13, partition 0, PROCESS_LOCAL, 5362 bytes)
> 16/06/02 21:36:15 INFO cluster.YarnClientSchedulerBackend: Launching task 2 
> on executor id: 10 hostname: 192.168.3.13.
> 16/06/02 21:36:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on hw-node3:45924 (size: 1964.0 B, free: 4.4 GB)
> 16/06/02 21:36:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 
> 2.0 (TID 2) in 1934 ms on 192.168.3.13 (1/1)
> 16/06/02 21:36:17 INFO cluster.YarnScheduler: Removed TaskSet 2.0, whose 
> tasks have all completed, from pool
> 16/06/02 21:36:17 INFO scheduler.DAGScheduler: ResultStage 2 (processCmd at 
> CliDriver.java:376) finished in 1.937 s
> 16/06/02 21:36:17 INFO scheduler.DAGScheduler: Job 2 finished: processCmd at 
> CliDriver.java:376, took 1.962631 s
> Time taken: 2.027 seconds
> 16/06/02 21:36:17 INFO CliDriver: Time taken: 2.027 seconds
> spark-sql> DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE};
> 16/06/02 21:36:36 INFO execution.SparkSqlParser: Parsing command: DROP TABLE 
> IF EXISTS ${hiveconf:RESULT_TABLE}
> Error in query:
> mismatched input '$' expecting {'ADD', 'AS', 'ALL', 'GROUP', 'BY', 
> 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'LIMIT', 'AT', 'IN', 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'OUTER', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 
> 'RANGE', 'ROWS', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 
> 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 
> 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'TO', 
> 'TABLESAMPLE', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 
> 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'IF', 
> 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 
> 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 
> 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 
> 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'EXTENDED', 
> 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 
> 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 
> 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 
> 'FILEFORMAT', 'TOUCH',

[jira] [Commented] (SPARK-15730) [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take effect in spark-sql session


[ 
https://issues.apache.org/jira/browse/SPARK-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318648#comment-15318648
 ] 

Apache Spark commented on SPARK-15730:
--

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/13542

> [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take 
> effect in spark-sql session
> -
>
> Key: SPARK-15730
> URL: https://issues.apache.org/jira/browse/SPARK-15730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> /usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g 
> --executor-cores 5 --num-executors 31 --master yarn-client --conf 
> spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01
> spark-sql> use test;
> 16/06/02 21:36:15 INFO execution.SparkSqlParser: Parsing command: use test
> 16/06/02 21:36:15 INFO spark.SparkContext: Starting job: processCmd at 
> CliDriver.java:376
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Got job 2 (processCmd at 
> CliDriver.java:376) with 1 output partitions
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 
> (processCmd at CliDriver.java:376)
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Parents of final stage: List()
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Missing parents: List()
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting ResultStage 2 
> (MapPartitionsRDD[8] at processCmd at CliDriver.java:376), which has no 
> missing parents
> 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2 stored as values 
> in memory (estimated size 3.2 KB, free 2.4 GB)
> 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as 
> bytes in memory (estimated size 1964.0 B, free 2.4 GB)
> 16/06/02 21:36:15 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on 192.168.3.11:36189 (size: 1964.0 B, free: 2.4 GB)
> 16/06/02 21:36:15 INFO spark.SparkContext: Created broadcast 2 from broadcast 
> at DAGScheduler.scala:1012
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from ResultStage 2 (MapPartitionsRDD[8] at processCmd at CliDriver.java:376)
> 16/06/02 21:36:15 INFO cluster.YarnScheduler: Adding task set 2.0 with 1 tasks
> 16/06/02 21:36:15 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
> 2.0 (TID 2, 192.168.3.13, partition 0, PROCESS_LOCAL, 5362 bytes)
> 16/06/02 21:36:15 INFO cluster.YarnClientSchedulerBackend: Launching task 2 
> on executor id: 10 hostname: 192.168.3.13.
> 16/06/02 21:36:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on hw-node3:45924 (size: 1964.0 B, free: 4.4 GB)
> 16/06/02 21:36:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 
> 2.0 (TID 2) in 1934 ms on 192.168.3.13 (1/1)
> 16/06/02 21:36:17 INFO cluster.YarnScheduler: Removed TaskSet 2.0, whose 
> tasks have all completed, from pool
> 16/06/02 21:36:17 INFO scheduler.DAGScheduler: ResultStage 2 (processCmd at 
> CliDriver.java:376) finished in 1.937 s
> 16/06/02 21:36:17 INFO scheduler.DAGScheduler: Job 2 finished: processCmd at 
> CliDriver.java:376, took 1.962631 s
> Time taken: 2.027 seconds
> 16/06/02 21:36:17 INFO CliDriver: Time taken: 2.027 seconds
> spark-sql> DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE};
> 16/06/02 21:36:36 INFO execution.SparkSqlParser: Parsing command: DROP TABLE 
> IF EXISTS ${hiveconf:RESULT_TABLE}
> Error in query:
> mismatched input '$' expecting {'ADD', 'AS', 'ALL', 'GROUP', 'BY', 
> 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'LIMIT', 'AT', 'IN', 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'OUTER', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 
> 'RANGE', 'ROWS', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 
> 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 
> 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'TO', 
> 'TABLESAMPLE', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 
> 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'IF', 
> 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 
> 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 
> 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 
> 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'EXTENDED', 
> 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 
> 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 
> 'STORED', 'DIRECTORIES', 'LOCATION',

[jira] [Assigned] (SPARK-15730) [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take effect in spark-sql session


 [ 
https://issues.apache.org/jira/browse/SPARK-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15730:


Assignee: Apache Spark

> [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take 
> effect in spark-sql session
> -
>
> Key: SPARK-15730
> URL: https://issues.apache.org/jira/browse/SPARK-15730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Assignee: Apache Spark
>Priority: Critical
>
> /usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g 
> --executor-cores 5 --num-executors 31 --master yarn-client --conf 
> spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01
> spark-sql> use test;
> 16/06/02 21:36:15 INFO execution.SparkSqlParser: Parsing command: use test
> 16/06/02 21:36:15 INFO spark.SparkContext: Starting job: processCmd at 
> CliDriver.java:376
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Got job 2 (processCmd at 
> CliDriver.java:376) with 1 output partitions
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 
> (processCmd at CliDriver.java:376)
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Parents of final stage: List()
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Missing parents: List()
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting ResultStage 2 
> (MapPartitionsRDD[8] at processCmd at CliDriver.java:376), which has no 
> missing parents
> 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2 stored as values 
> in memory (estimated size 3.2 KB, free 2.4 GB)
> 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as 
> bytes in memory (estimated size 1964.0 B, free 2.4 GB)
> 16/06/02 21:36:15 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on 192.168.3.11:36189 (size: 1964.0 B, free: 2.4 GB)
> 16/06/02 21:36:15 INFO spark.SparkContext: Created broadcast 2 from broadcast 
> at DAGScheduler.scala:1012
> 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting 1 missing tasks 
> from ResultStage 2 (MapPartitionsRDD[8] at processCmd at CliDriver.java:376)
> 16/06/02 21:36:15 INFO cluster.YarnScheduler: Adding task set 2.0 with 1 tasks
> 16/06/02 21:36:15 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 
> 2.0 (TID 2, 192.168.3.13, partition 0, PROCESS_LOCAL, 5362 bytes)
> 16/06/02 21:36:15 INFO cluster.YarnClientSchedulerBackend: Launching task 2 
> on executor id: 10 hostname: 192.168.3.13.
> 16/06/02 21:36:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on hw-node3:45924 (size: 1964.0 B, free: 4.4 GB)
> 16/06/02 21:36:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 
> 2.0 (TID 2) in 1934 ms on 192.168.3.13 (1/1)
> 16/06/02 21:36:17 INFO cluster.YarnScheduler: Removed TaskSet 2.0, whose 
> tasks have all completed, from pool
> 16/06/02 21:36:17 INFO scheduler.DAGScheduler: ResultStage 2 (processCmd at 
> CliDriver.java:376) finished in 1.937 s
> 16/06/02 21:36:17 INFO scheduler.DAGScheduler: Job 2 finished: processCmd at 
> CliDriver.java:376, took 1.962631 s
> Time taken: 2.027 seconds
> 16/06/02 21:36:17 INFO CliDriver: Time taken: 2.027 seconds
> spark-sql> DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE};
> 16/06/02 21:36:36 INFO execution.SparkSqlParser: Parsing command: DROP TABLE 
> IF EXISTS ${hiveconf:RESULT_TABLE}
> Error in query:
> mismatched input '$' expecting {'ADD', 'AS', 'ALL', 'GROUP', 'BY', 
> 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'LIMIT', 'AT', 'IN', 'NO', 
> 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 
> 'ASC', 'DESC', 'FOR', 'OUTER', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 
> 'RANGE', 'ROWS', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 
> 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 
> 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'SHOW', 'TABLES', 
> 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'TO', 
> 'TABLESAMPLE', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 
> 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'IF', 
> 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 
> 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 
> 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 
> 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'EXTENDED', 
> 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 
> 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 
> 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 
> 'FILEFORMAT', 'TOUCH', 'COMPACT',

[jira] [Resolved] (SPARK-13570) pyspark save with partitionBy is very slow


 [ 
https://issues.apache.org/jira/browse/SPARK-13570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13570.
---
Resolution: Incomplete

> pyspark save with partitionBy is very slow
> --
>
> Key: SPARK-13570
> URL: https://issues.apache.org/jira/browse/SPARK-13570
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Shubhanshu Mishra
>  Labels: dataframe, partitioning, pyspark, save
>
> Running the following code to store data from each year and pos in a seperate 
> folder for a very large dataframe is taking a huge amount of time. (>37 hours 
> for 60% of the work)
> {code}
> ## IPYTHON was started using the following command: 
> # IPYTHON=1 "$SPARK_HOME/bin/pyspark" --driver-memory 50g 
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext, Row
> from pyspark.sql.types import *
> conf = SparkConf()
> conf.setMaster("local[30]")
> conf.setAppName("analysis")
> conf.set("spark.local.dir", "./tmp")
> conf.set("spark.executor.memory", "50g")
> conf.set("spark.driver.maxResultSize", "5g")
> sc = SparkContext(conf=conf)
> sqlContext = SQLContext(sc)
> df = sqlContext.read.format("csv").options(header=False, inferschema=True, 
> delimiter="\t").load("out/new_features")
> df = df.selectExpr(*("%s as %s" % (df.columns[i], k) for i,k in 
> enumerate(columns)))
> # year can take values from [1902,2015]
> # pos takes integer values from [-1,0,1,2]
> # df is a dataframe with 20 columns and 1 billion rows
> # Running on  Machine with 32 cores and 500 GB RAM
> df.write.save("out/model_input_partitioned", format="csv", 
> partitionBy=["year", "pos"], delimiter="\t")
> {code}
> Currently, the code is at: 
> [Stage 12:==>(1367 + 30) / 
> 2290]
> And it has already been more than 37 hours. A single sweep on this data for 
> filter by value takes less than 6.5 minutes. 
> The spark web interface shows the following lines for the 2 stages of the job:
> Stage Description Submitted   DurationTasks:succeeded/total   
> Input   Output  Shuffle ReadShuffle Write
> 11load at NativeMethodAccessorImpl.java:-2 +details 2016/02/27 23:07:04   
> 6.5 min 2290/2290   66.8 GB
> 12save at NativeMethodAccessorImpl.java:-2 +details 2016/02/27 23:15:59   
> 37.1 h  1370/2290   40.9 GB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching


[ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318558#comment-15318558
 ] 

Sean Owen commented on SPARK-15796:
---

I'm not sure what you mean about storing RDDs that don't fit in memory, but 
that's perfectly fine. I am suggesting that it's not surprising that you need 
to do some tuning to use nearly all the heap, since GC time will increase a lot 
as you get close to this limit and needs some extra help to work efficiently. 
This is what this boils down to: the settings are causing Spark to mis-use the 
new generation, really, and it's expensive to keep GCing the long-lived objects 
there that never die.

But this isn't an exotic use case and really ought not happen out of the box. I 
agree that I don't think it makes sense to allow Spark to cache (inherently, 
long lived objects) more memory than is available in the old gen (the place for 
long-lived objects that don't need much GC attention). I think the resolution 
is to change the defaults accordingly.

> Spark 1.6 default memory settings can cause heavy GC when caching
> -
>
> Key: SPARK-15796
> URL: https://issues.apache.org/jira/browse/SPARK-15796
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gabor Feher
>Priority: Minor
>
> While debugging performance issues in a Spark program, I've found a simple 
> way to slow down Spark 1.6 significantly by filling the RDD memory cache. 
> This seems to be a regression, because setting 
> "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is 
> just a simple program that fills the memory cache of Spark using a 
> MEMORY_ONLY cached RDD (but of course this comes up in more complex 
> situations, too):
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.storage.StorageLevel
> object CacheDemoApp { 
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Cache Demo Application")   
> 
> val sc = new SparkContext(conf)
> val startTime = System.currentTimeMillis()
>   
> 
> val cacheFiller = sc.parallelize(1 to 5, 1000)
> 
>   .mapPartitionsWithIndex {
> case (ix, it) =>
>   println(s"CREATE DATA PARTITION ${ix}") 
> 
>   val r = new scala.util.Random(ix)
>   it.map(x => (r.nextLong, r.nextLong))
>   }
> cacheFiller.persist(StorageLevel.MEMORY_ONLY)
> cacheFiller.foreach(identity)
> val finishTime = System.currentTimeMillis()
> val elapsedTime = (finishTime - startTime) / 1000
> println(s"TIME= $elapsedTime s")
>   }
> }
> {code}
> If I call it the following way, it completes in around 5 minutes on my 
> Laptop, while often stopping for slow Full GC cycles. I can also see with 
> jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled.
> {code}
> sbt package
> ~/spark-1.6.0/bin/spark-submit \
>   --class "CacheDemoApp" \
>   --master "local[2]" \
>   --driver-memory 3g \
>   --driver-java-options "-XX:+PrintGCDetails" \
>   target/scala-2.10/simple-project_2.10-1.0.jar
> {code}
> If I add any one of the below flags, then the run-time drops to around 40-50 
> seconds and the difference is coming from the drop in GC times:
>   --conf "spark.memory.fraction=0.6"
> OR
>   --conf "spark.memory.useLegacyMode=true"
> OR
>   --driver-java-options "-XX:NewRatio=3"
> All the other cache types except for DISK_ONLY produce similar symptoms. It 
> looks like that the problem is that the amount of data Spark wants to store 
> long-term ends up being larger than the old generation size in the JVM and 
> this triggers Full GC repeatedly.
> I did some research:
> * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
> defaults to 0.75.
> * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache 
> size. It defaults to 0.6 and...
> * http://spark.apache.org/docs/1.5.2/configuration.html even says that it 
> shouldn't be bigger than the size of the old generation.
> * On the other hand, OpenJDK's default NewRatio is 2, which means an old 
> generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
> advice.
> http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
> generation is running close to full, then setting 
> spark.memory.storageFraction to a lower value should help. I have tried with 
> spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is 
> not a surprise:

[jira] [Comment Edited] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

2016-06-07 Thread Gabor Feher (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318530#comment-15318530
]

Gabor Feher edited comment on SPARK-15796 at 6/7/16 2:15 PM:
-

MEMORY_ONLY caching works in a way that when a partition doesn't fit into the
memory, then it won't save it in the memory cache region. It prints stuff like
this:
{code}
16/06/07 06:35:27 INFO MemoryStore: Will not store rdd_1_464 as it would
require dropping another block from the same RDD
16/06/07 06:35:27 WARN MemoryStore: Not enough space to cache rdd_1_464 in
memory! (computed 5.5 MB so far)
{code}
MEMORY_AND_DISK caching works in a way that if a partition doesn't fit into the
memory, then it saves it to the disk. It prints stuff like this:
{code}
16/06/07 06:46:39 WARN CacheManager: Persisting partition rdd_1_99 to disk
instead.
{code}

In the MEMORY_ONLY case, if I shouldn't expect it to work with too much data as
you suggest, then why Spark even bothers dropping the blocks from memory? If
it's a non-goal to store oversized RDDs, then it would be much simpler to just
throw an OOM.
In the MEMORY_AND_DISK case, I can see the exact same GC issue with
MEMORY_ONLY. But there the whole point should be that we are caching RDDs that
don't fit into the memory, no?

So, these two behaviors made me assume that Spark will work even if I try to
cache too big stuff. I understand if you say that this is a JVM-implementation
dependent issue, I have no idea how many people are using other JVMs than
OpenJDK. But this raises the question: are there any situations when it makes
sense to raise "spark.memory.fraction" above the old generation size? With
caching I can say it doesn't make sense, but maybe execution could use it
meaningfully?

Maybe it is worth mentioning that my use case is not that exotic: we are
developing a program based on Spark that works with user-provided data: so
there is no way to say at implementation time whether a particular RDD will fit
into memory or not.

Speaking of storageFraction, I was not trying to say that there is a problem
with it. But the following sentence in
http://spark.apache.org/docs/1.6.1/tuning.html is not correct, if I understand
correctly:
{quote}
In the GC stats that are printed, if the OldGen is close to being full, reduce
the amount of memory used for caching by lowering spark.memory.storageFraction;
it is better to cache fewer objects than to slow down task execution!
{quote}
Because storageFraction will not actually reduce the amount of cache unless
execution needs more memory.

Thanks for looking into the issue! To sum up, this is at least a bug in the
documentation:
* tuning.html should have better advice for when OldGen is close to being full
* I'd prefer a mention of these GC issues somewhere near the cache docs, given
that many people are using OpenJDK with default settings I believe.

was (Author: gfeher):
MEMORY_ONLY caching works in a way that when a partition doesn't fit into the
memory, then it won't save it in the memory cache region. It prints stuff like
this:
{{code}]
16/06/07 06:35:27 INFO MemoryStore: Will not store rdd_1_464 as it would
require dropping another block from the same RDD
16/06/07 06:35:27 WARN MemoryStore: Not enough space to cache rdd_1_464 in
memory! (computed 5.5 MB so far)
{{code}}
MEMORY_AND_DISK caching works in a way that if a partition doesn't fit into the
memory, then it saves it to the disk. It prints stuff like this:
{{code}}
16/06/07 06:46:39 WARN CacheManager: Persisting partition rdd_1_99 to disk
instead.
{{code}}

Speaking of storageFraction, I was not trying to say that there is a problem
with it. But the following sentence in

[jira] [Commented] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

2016-06-07 Thread Gabor Feher (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318530#comment-15318530
 ] 

Gabor Feher commented on SPARK-15796:
-

MEMORY_ONLY caching works in a way that when a partition doesn't fit into the 
memory, then it won't save it in the memory cache region. It prints stuff like 
this:
{{code}]
16/06/07 06:35:27 INFO MemoryStore: Will not store rdd_1_464 as it would 
require dropping another block from the same RDD
16/06/07 06:35:27 WARN MemoryStore: Not enough space to cache rdd_1_464 in 
memory! (computed 5.5 MB so far)
{{code}}
MEMORY_AND_DISK caching works in a way that if a partition doesn't fit into the 
memory, then it saves it to the disk. It prints stuff like this:
{{code}}
16/06/07 06:46:39 WARN CacheManager: Persisting partition rdd_1_99 to disk 
instead.
{{code}}

In the MEMORY_ONLY case, if I shouldn't expect it to work with too much data as 
you suggest, then why Spark even bothers dropping the blocks from memory? If 
it's a non-goal to store oversized RDDs, then it would be much simpler to just 
throw an OOM.
In the MEMORY_AND_DISK case,  I can see the exact same GC issue with 
MEMORY_ONLY. But there the whole point should be that we are caching RDDs that 
don't fit into the memory, no?

So, these two behaviors made me assume that Spark will work even if I try to 
cache too big stuff. I understand if you say that this is a JVM-implementation 
dependent issue, I have no idea how many people are using other JVMs than 
OpenJDK. But this raises the question: are there any situations when it makes 
sense to raise "spark.memory.fraction" above the old generation size? With 
caching I can say it doesn't make sense, but maybe execution could use it 
meaningfully?

Maybe it is worth mentioning that my use case is not that exotic: we are 
developing a program based on Spark that works with user-provided data: so 
there is no way to say at implementation time whether a particular RDD will fit 
into memory or not.

Speaking of storageFraction, I was not trying to say that there is a problem 
with it. But the following sentence in 
http://spark.apache.org/docs/1.6.1/tuning.html is not correct, if I understand 
correctly:
{{quote}}
In the GC stats that are printed, if the OldGen is close to being full, reduce 
the amount of memory used for caching by lowering spark.memory.storageFraction; 
it is better to cache fewer objects than to slow down task execution!
{{quote}}
Because storageFraction will not actually reduce the amount of cache unless 
execution needs more memory.

Thanks for looking into the issue! To sum up, this is at least a bug in the 
documentation:
* tuning.html should have better advice for when OldGen is close to being full
* I'd prefer a mention of these GC issues somewhere near the cache docs, given 
that many people are using OpenJDK with default settings I believe.

> Spark 1.6 default memory settings can cause heavy GC when caching
> -
>
> Key: SPARK-15796
> URL: https://issues.apache.org/jira/browse/SPARK-15796
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gabor Feher
>Priority: Minor
>
> While debugging performance issues in a Spark program, I've found a simple 
> way to slow down Spark 1.6 significantly by filling the RDD memory cache. 
> This seems to be a regression, because setting 
> "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is 
> just a simple program that fills the memory cache of Spark using a 
> MEMORY_ONLY cached RDD (but of course this comes up in more complex 
> situations, too):
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.storage.StorageLevel
> object CacheDemoApp { 
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Cache Demo Application")   
> 
> val sc = new SparkContext(conf)
> val startTime = System.currentTimeMillis()
>   
> 
> val cacheFiller = sc.parallelize(1 to 5, 1000)
> 
>   .mapPartitionsWithIndex {
> case (ix, it) =>
>   println(s"CREATE DATA PARTITION ${ix}") 
> 
>   val r = new scala.util.Random(ix)
>   it.map(x => (r.nextLong, r.nextLong))
>   }
> cacheFiller.persist(StorageLevel.MEMORY_ONLY)
> cacheFiller.foreach(identity)
> val finishTime = System.currentTimeMillis()
> val elapsedTime = (finishTime - startTime) / 1000
> println(s"TIME= $elapsedTime s")
>   }
> }
> {code}
> If I call it the following way, it

[jira] [Commented] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching


[ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318526#comment-15318526
 ] 

Sean Owen commented on SPARK-15796:
---

To leave a little extra room and to match the old behavior -- yeah reasonable 
to me. CC [~andrewor14]?

> Spark 1.6 default memory settings can cause heavy GC when caching
> -
>
> Key: SPARK-15796
> URL: https://issues.apache.org/jira/browse/SPARK-15796
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gabor Feher
>Priority: Minor
>
> While debugging performance issues in a Spark program, I've found a simple 
> way to slow down Spark 1.6 significantly by filling the RDD memory cache. 
> This seems to be a regression, because setting 
> "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is 
> just a simple program that fills the memory cache of Spark using a 
> MEMORY_ONLY cached RDD (but of course this comes up in more complex 
> situations, too):
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.storage.StorageLevel
> object CacheDemoApp { 
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Cache Demo Application")   
> 
> val sc = new SparkContext(conf)
> val startTime = System.currentTimeMillis()
>   
> 
> val cacheFiller = sc.parallelize(1 to 5, 1000)
> 
>   .mapPartitionsWithIndex {
> case (ix, it) =>
>   println(s"CREATE DATA PARTITION ${ix}") 
> 
>   val r = new scala.util.Random(ix)
>   it.map(x => (r.nextLong, r.nextLong))
>   }
> cacheFiller.persist(StorageLevel.MEMORY_ONLY)
> cacheFiller.foreach(identity)
> val finishTime = System.currentTimeMillis()
> val elapsedTime = (finishTime - startTime) / 1000
> println(s"TIME= $elapsedTime s")
>   }
> }
> {code}
> If I call it the following way, it completes in around 5 minutes on my 
> Laptop, while often stopping for slow Full GC cycles. I can also see with 
> jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled.
> {code}
> sbt package
> ~/spark-1.6.0/bin/spark-submit \
>   --class "CacheDemoApp" \
>   --master "local[2]" \
>   --driver-memory 3g \
>   --driver-java-options "-XX:+PrintGCDetails" \
>   target/scala-2.10/simple-project_2.10-1.0.jar
> {code}
> If I add any one of the below flags, then the run-time drops to around 40-50 
> seconds and the difference is coming from the drop in GC times:
>   --conf "spark.memory.fraction=0.6"
> OR
>   --conf "spark.memory.useLegacyMode=true"
> OR
>   --driver-java-options "-XX:NewRatio=3"
> All the other cache types except for DISK_ONLY produce similar symptoms. It 
> looks like that the problem is that the amount of data Spark wants to store 
> long-term ends up being larger than the old generation size in the JVM and 
> this triggers Full GC repeatedly.
> I did some research:
> * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
> defaults to 0.75.
> * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache 
> size. It defaults to 0.6 and...
> * http://spark.apache.org/docs/1.5.2/configuration.html even says that it 
> shouldn't be bigger than the size of the old generation.
> * On the other hand, OpenJDK's default NewRatio is 2, which means an old 
> generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
> advice.
> http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
> generation is running close to full, then setting 
> spark.memory.storageFraction to a lower value should help. I have tried with 
> spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is 
> not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html 
> explains that storageFraction is not an upper-limit but a lower limit-like 
> thing on the size of Spark's cache. The real upper limit is 
> spark.memory.fraction.
> To sum up my questions/issues:
> * At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. 
> Maybe the old generation size should also be mentioned in configuration.html 
> near spark.memory.fraction.
> * Is it a goal for Spark to support heavy caching with default parameters and 
> without GC breakdown? If so, then better default values are needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail:

[jira] [Commented] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

2016-06-07 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318523#comment-15318523
 ] 

Daniel Darabos commented on SPARK-15796:


> The only argument against it was that it's specific to the OpenJDK default.

I think Gabor has only tested with OpenJDK, but the default for {{NewRatio}} is 
the same in Oracle Java 8 Server JVM according to 
https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/sizing.html.

> I think this issue still exists even with the fraction set to 0.66, because 
> of course if you are using any memory at all for other stuff, some of that 
> can't fit in the old generation. There will always be some need to tune GC 
> params when that becomes the bottleneck.

Good point. Maybe 0.6 would be the best default? If everything fit in old-gen 
in 1.5, it would probably still fit in the old-gen that way.

> Spark 1.6 default memory settings can cause heavy GC when caching
> -
>
> Key: SPARK-15796
> URL: https://issues.apache.org/jira/browse/SPARK-15796
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gabor Feher
>Priority: Minor
>
> While debugging performance issues in a Spark program, I've found a simple 
> way to slow down Spark 1.6 significantly by filling the RDD memory cache. 
> This seems to be a regression, because setting 
> "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is 
> just a simple program that fills the memory cache of Spark using a 
> MEMORY_ONLY cached RDD (but of course this comes up in more complex 
> situations, too):
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.storage.StorageLevel
> object CacheDemoApp { 
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Cache Demo Application")   
> 
> val sc = new SparkContext(conf)
> val startTime = System.currentTimeMillis()
>   
> 
> val cacheFiller = sc.parallelize(1 to 5, 1000)
> 
>   .mapPartitionsWithIndex {
> case (ix, it) =>
>   println(s"CREATE DATA PARTITION ${ix}") 
> 
>   val r = new scala.util.Random(ix)
>   it.map(x => (r.nextLong, r.nextLong))
>   }
> cacheFiller.persist(StorageLevel.MEMORY_ONLY)
> cacheFiller.foreach(identity)
> val finishTime = System.currentTimeMillis()
> val elapsedTime = (finishTime - startTime) / 1000
> println(s"TIME= $elapsedTime s")
>   }
> }
> {code}
> If I call it the following way, it completes in around 5 minutes on my 
> Laptop, while often stopping for slow Full GC cycles. I can also see with 
> jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled.
> {code}
> sbt package
> ~/spark-1.6.0/bin/spark-submit \
>   --class "CacheDemoApp" \
>   --master "local[2]" \
>   --driver-memory 3g \
>   --driver-java-options "-XX:+PrintGCDetails" \
>   target/scala-2.10/simple-project_2.10-1.0.jar
> {code}
> If I add any one of the below flags, then the run-time drops to around 40-50 
> seconds and the difference is coming from the drop in GC times:
>   --conf "spark.memory.fraction=0.6"
> OR
>   --conf "spark.memory.useLegacyMode=true"
> OR
>   --driver-java-options "-XX:NewRatio=3"
> All the other cache types except for DISK_ONLY produce similar symptoms. It 
> looks like that the problem is that the amount of data Spark wants to store 
> long-term ends up being larger than the old generation size in the JVM and 
> this triggers Full GC repeatedly.
> I did some research:
> * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
> defaults to 0.75.
> * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache 
> size. It defaults to 0.6 and...
> * http://spark.apache.org/docs/1.5.2/configuration.html even says that it 
> shouldn't be bigger than the size of the old generation.
> * On the other hand, OpenJDK's default NewRatio is 2, which means an old 
> generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
> advice.
> http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
> generation is running close to full, then setting 
> spark.memory.storageFraction to a lower value should help. I have tried with 
> spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is 
> not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html 
> explains that storageFraction is not an upper-limit but a lower limit-like 
> thing on the size of Spark's

[jira] [Commented] (SPARK-15564) App name is the main class name in Spark streaming jobs


[ 
https://issues.apache.org/jira/browse/SPARK-15564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318520#comment-15318520
 ] 

Sean Owen commented on SPARK-15564:
---

On further review, I don't see how there's a null appName here. There isn't a 
call to createNewSparkContext with a null app name.  The constructor you invoke 
in both cases preserves the provided conf object, which should have its 
spark.app.name already set. Are you sure there isn't something else at work in 
the code that's omitted here? I don't yet see how this could be a difference.

> App name is the main class name in Spark streaming jobs
> ---
>
> Key: SPARK-15564
> URL: https://issues.apache.org/jira/browse/SPARK-15564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Steven Lowenthal
>Priority: Minor
>
> I've tried everything to set the app name to something other than the class 
> name of the job, but spark reports the application name as the class.  This 
> adversely affects the ability to monitor jobs, we can't have dots in the 
> reported app name. 
> {code:title=job.scala}
>   val defaultAppName = "NDS Transform"
>conf.setAppName(defaultAppName)
>println (s"App Name: ${conf.get("spark.app.name")}")
>   ...
>   val ssc = new StreamingContext(conf, streamingBatchWindow)
> {code}
> {code:title=output}
> App Name: NDS Transform
> {code}
> Application IDName
> app-20160526161230-0017 (kill)  com.gracenote.ongo.spark.NDSStreamAvro



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching


[ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318496#comment-15318496
 ] 

Sean Owen commented on SPARK-15796:
---

Yeah, sounds like we should change the default min cache size so that it fits 
in, at least, OpenJDK's default old gen. The only argument against it was that 
it's specific to the OpenJDK default. I don't know if that's a Spark problem, 
just raises the issue of JVM tuning that was always there, but, not surprising 
people out of the box has value too. 

I think this issue still exists even with the fraction set to 0.66, because of 
course if you are using any memory at all for other stuff, some of that can't 
fit in the old generation. There will always be some need to tune GC params 
when that becomes the bottleneck.

> Spark 1.6 default memory settings can cause heavy GC when caching
> -
>
> Key: SPARK-15796
> URL: https://issues.apache.org/jira/browse/SPARK-15796
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gabor Feher
>Priority: Minor
>
> While debugging performance issues in a Spark program, I've found a simple 
> way to slow down Spark 1.6 significantly by filling the RDD memory cache. 
> This seems to be a regression, because setting 
> "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is 
> just a simple program that fills the memory cache of Spark using a 
> MEMORY_ONLY cached RDD (but of course this comes up in more complex 
> situations, too):
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.storage.StorageLevel
> object CacheDemoApp { 
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Cache Demo Application")   
> 
> val sc = new SparkContext(conf)
> val startTime = System.currentTimeMillis()
>   
> 
> val cacheFiller = sc.parallelize(1 to 5, 1000)
> 
>   .mapPartitionsWithIndex {
> case (ix, it) =>
>   println(s"CREATE DATA PARTITION ${ix}") 
> 
>   val r = new scala.util.Random(ix)
>   it.map(x => (r.nextLong, r.nextLong))
>   }
> cacheFiller.persist(StorageLevel.MEMORY_ONLY)
> cacheFiller.foreach(identity)
> val finishTime = System.currentTimeMillis()
> val elapsedTime = (finishTime - startTime) / 1000
> println(s"TIME= $elapsedTime s")
>   }
> }
> {code}
> If I call it the following way, it completes in around 5 minutes on my 
> Laptop, while often stopping for slow Full GC cycles. I can also see with 
> jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled.
> {code}
> sbt package
> ~/spark-1.6.0/bin/spark-submit \
>   --class "CacheDemoApp" \
>   --master "local[2]" \
>   --driver-memory 3g \
>   --driver-java-options "-XX:+PrintGCDetails" \
>   target/scala-2.10/simple-project_2.10-1.0.jar
> {code}
> If I add any one of the below flags, then the run-time drops to around 40-50 
> seconds and the difference is coming from the drop in GC times:
>   --conf "spark.memory.fraction=0.6"
> OR
>   --conf "spark.memory.useLegacyMode=true"
> OR
>   --driver-java-options "-XX:NewRatio=3"
> All the other cache types except for DISK_ONLY produce similar symptoms. It 
> looks like that the problem is that the amount of data Spark wants to store 
> long-term ends up being larger than the old generation size in the JVM and 
> this triggers Full GC repeatedly.
> I did some research:
> * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
> defaults to 0.75.
> * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache 
> size. It defaults to 0.6 and...
> * http://spark.apache.org/docs/1.5.2/configuration.html even says that it 
> shouldn't be bigger than the size of the old generation.
> * On the other hand, OpenJDK's default NewRatio is 2, which means an old 
> generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
> advice.
> http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
> generation is running close to full, then setting 
> spark.memory.storageFraction to a lower value should help. I have tried with 
> spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is 
> not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html 
> explains that storageFraction is not an upper-limit but a lower limit-like 
> thing on the size of Spark's cache. The real upper limit is 
> spark.memory.fraction.
> To sum up my questions/issues:
> *

[jira] [Commented] (SPARK-15065) HiveSparkSubmitSuite's "set spark.sql.warehouse.dir" is flaky

2016-06-07 Thread Pete Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318492#comment-15318492
 ] 

Pete Robbins commented on SPARK-15065:
--

I think this may be related to 
https://issues.apache.org/jira/browse/SPARK-15606 where there is a deadlock in 
executor shutdown. This test was consistently failing on our machine with only 
2 cores but since my fix to SPARK-15606 it has passed all the time.

> HiveSparkSubmitSuite's "set spark.sql.warehouse.dir" is flaky
> -
>
> Key: SPARK-15065
> URL: https://issues.apache.org/jira/browse/SPARK-15065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Yin Huai
>Priority: Critical
> Attachments: log.txt
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/861/testReport/junit/org.apache.spark.sql.hive/HiveSparkSubmitSuite/dir/
> There are several WARN messages like {{16/05/02 00:51:06 WARN Master: Got 
> status update for unknown executor app-20160502005054-/3}}, which are 
> suspicious. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

2016-06-07 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318486#comment-15318486
 ] 

Daniel Darabos commented on SPARK-15796:


The example program takes less than a minute on Spark 1.5 and 5 minutes on 
Spark 1.6, using the default configuration in both cases. In neither case do we 
run out of memory.

The old generation size defaults to 66% and Spark caching in Spark 1.5 defaults 
to 60%, so with default settings the cache fits in the old generation in 1.5. 
But in 1.6 the default cache size is increased to 75% so it no longer fits in 
the old generation. This kills performance. (And the regression is very hard to 
debug. Kudos to Gabor Feher!)

The default settings have been changed in Spark 1.6 to give a 5x slowdown, and 
the documentation for the current settings does not make a note of this. Only 
the documentation for the deprecated {{spark.storage.memoryFraction}} mentions 
the issue, but its default value had been chosen so that the issue was not 
triggered by default. This also has to be documented for the new settings.

Unless someone never uses cache, they are going to hit this issue if they run 
with the default settings. I think this is bad enough to warrant changing the 
defaults. I propose defaulting {{spark.memory.fraction}} to 0.6. If someone 
wants to set {{spark.memory.fraction}} to 0.75 they need to also set 
{{-XX:NewRatio=3}} to avoid GC thrashing. (Another option is to set 
{{-XX:NewRatio=3}} by default, but I think it's a vendor-specific flag.)

What is the argument against defaulting {{spark.memory.fraction}} to 0.6?

> Spark 1.6 default memory settings can cause heavy GC when caching
> -
>
> Key: SPARK-15796
> URL: https://issues.apache.org/jira/browse/SPARK-15796
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gabor Feher
>Priority: Minor
>
> While debugging performance issues in a Spark program, I've found a simple 
> way to slow down Spark 1.6 significantly by filling the RDD memory cache. 
> This seems to be a regression, because setting 
> "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is 
> just a simple program that fills the memory cache of Spark using a 
> MEMORY_ONLY cached RDD (but of course this comes up in more complex 
> situations, too):
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.storage.StorageLevel
> object CacheDemoApp { 
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Cache Demo Application")   
> 
> val sc = new SparkContext(conf)
> val startTime = System.currentTimeMillis()
>   
> 
> val cacheFiller = sc.parallelize(1 to 5, 1000)
> 
>   .mapPartitionsWithIndex {
> case (ix, it) =>
>   println(s"CREATE DATA PARTITION ${ix}") 
> 
>   val r = new scala.util.Random(ix)
>   it.map(x => (r.nextLong, r.nextLong))
>   }
> cacheFiller.persist(StorageLevel.MEMORY_ONLY)
> cacheFiller.foreach(identity)
> val finishTime = System.currentTimeMillis()
> val elapsedTime = (finishTime - startTime) / 1000
> println(s"TIME= $elapsedTime s")
>   }
> }
> {code}
> If I call it the following way, it completes in around 5 minutes on my 
> Laptop, while often stopping for slow Full GC cycles. I can also see with 
> jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled.
> {code}
> sbt package
> ~/spark-1.6.0/bin/spark-submit \
>   --class "CacheDemoApp" \
>   --master "local[2]" \
>   --driver-memory 3g \
>   --driver-java-options "-XX:+PrintGCDetails" \
>   target/scala-2.10/simple-project_2.10-1.0.jar
> {code}
> If I add any one of the below flags, then the run-time drops to around 40-50 
> seconds and the difference is coming from the drop in GC times:
>   --conf "spark.memory.fraction=0.6"
> OR
>   --conf "spark.memory.useLegacyMode=true"
> OR
>   --driver-java-options "-XX:NewRatio=3"
> All the other cache types except for DISK_ONLY produce similar symptoms. It 
> looks like that the problem is that the amount of data Spark wants to store 
> long-term ends up being larger than the old generation size in the JVM and 
> this triggers Full GC repeatedly.
> I did some research:
> * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
> defaults to 0.75.
> * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache 
> size. It defaults to 0.6 and...
> *

[jira] [Resolved] (SPARK-15787) Display more helpful error messages for several invalid operations


 [ 
https://issues.apache.org/jira/browse/SPARK-15787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15787.
---
   Resolution: Duplicate
Fix Version/s: (was: 1.2.1)

Please comment on the other JIRA with details, and if it's the same issue we 
can reopen it.

> Display more helpful error messages for several invalid operations
> --
>
> Key: SPARK-15787
> URL: https://issues.apache.org/jira/browse/SPARK-15787
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: nalin garg
>
> Referencing SPARK-5063. The issue has reappeared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15801) spark-submit --num-executors switch also works without YARN


[ 
https://issues.apache.org/jira/browse/SPARK-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318447#comment-15318447
 ] 

Jonathan Taws commented on SPARK-15801:
---

Indeed, I am getting the same behavior. After quickly sifting through the code, 
it looks like the num-executors option isn't taken into account for the 
standalone mode, based on the 
{{[allocateWorkerResourceToExecutors|https://github.com/apache/spark/blob/d5911d1173fe0872f21cae6c47abf8ff479345a4/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L673]}}
 method. 

> spark-submit --num-executors switch also works without YARN
> ---
>
> Key: SPARK-15801
> URL: https://issues.apache.org/jira/browse/SPARK-15801
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> Based on this [issue|https://issues.apache.org/jira/browse/SPARK-15781] 
> regarding the SPARK_WORKER_INSTANCES property, I also found that the 
> {{--num-executors}} switch documented in the spark-submit help is partially 
> incorrect. 
> Here's one part of the output (produced by {{spark-submit --help}}): 
> {code}
> YARN-only:
>   --driver-cores NUM  Number of cores used by the driver, only in 
> cluster mode
>   (Default: 1).
>   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
> "default").
>   --num-executors NUM Number of executors to launch (Default: 2).
> {code}
> Correct me if I am wrong, but the num-executors switch also works in Spark 
> standalone mode *without YARN*.
> I tried by only launching a master and a worker with 4 executors specified, 
> and they were all successfully spawned. The master switch pointed to the 
> master's url, and not to the yarn value. 
> Here's the exact command : {{spark-submit --master spark://[local 
> machine]:7077 --num-executors 4 --executor-cores 2}}
> By default it is *1* executor per worker in Spark standalone mode without 
> YARN, but this option enables to specify the number of executors (per worker 
> ?) if, and only if, the {{--executor-cores}} switch is also set. I do believe 
> it defaults to 2 in YARN mode. 
> I would propose to move the option from the *YARN-only* section to the *Spark 
> standalone and YARN only* section.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15779) SQL context fails when Hive uses Tez as its default execution engine

2016-06-07 Thread Alexandre Linte (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318443#comment-15318443
 ] 

Alexandre Linte commented on SPARK-15779:
-

Thank you for your reply Zhang,

You're right, I'm using the same hive-site.xml for Hive and Spark (this is a 
symbolic link). I will try with a copy of the hive-site.xml for spark. 

> SQL context fails when Hive uses Tez as its default execution engine
> 
>
> Key: SPARK-15779
> URL: https://issues.apache.org/jira/browse/SPARK-15779
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, SQL
>Affects Versions: 1.6.1
> Environment: Hadoop 2.7.2, Spark 1.6.1, Hive 2.0.1, Tez 0.8.3
>Reporter: Alexandre Linte
>
> By default, Hive uses MapReduce as its default execution engine. Since Hive 
> 2.0.0, MapReduce is deprecated.
> To avoid this deprecation, I decided to use Tez instead of MapReduce as the 
> default execution engine. Unfortunately, this choice had an impact on Spark.
> Now when I start Spark the SQL context fails with the following error:
> {noformat}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.1
>   /_/
> Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.7.0_85)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context available as sc.
> java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:529)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:204)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
> at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
> at 
> org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:440)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
> at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at 
> org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
> at $iwC$$iwC.(:15)
> at $iwC.(:24)
> at (:26)
> at .(:30)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124)
> at 
> org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
> at 
>

[jira] [Commented] (SPARK-15781) Misleading deprecated property in standalone cluster configuration documentation


[ 
https://issues.apache.org/jira/browse/SPARK-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318439#comment-15318439
 ] 

Sean Owen commented on SPARK-15781:
---

Yeah, I'd love for someone who really knows standalone to confirm that. If it's 
true, OK. Empirically that does look right.

> Misleading deprecated property in standalone cluster configuration 
> documentation
> 
>
> Key: SPARK-15781
> URL: https://issues.apache.org/jira/browse/SPARK-15781
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> I am unsure if this is regarded as an issue or not, but in the 
> [latest|http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts]
>  documentation for the configuration to launch Spark in stand-alone cluster 
> mode, the following property is documented :
> |SPARK_WORKER_INSTANCES|  Number of worker instances to run on each 
> machine (default: 1). You can make this more than 1 if you have have very 
> large machines and would like multiple Spark worker processes. If you do set 
> this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores 
> per worker, or else each worker will try to use all the cores.| 
> However, once I launch Spark with the spark-submit utility and the property 
> {{SPARK_WORKER_INSTANCES}} set in my spark-env.sh file, I get the following 
> deprecated warning : 
> {code}
> 16/06/06 16:38:28 WARN SparkConf: 
> SPARK_WORKER_INSTANCES was detected (set to '4').
> This is deprecated in Spark 1.0+.
> Please instead use:
>  - ./spark-submit with --num-executors to specify the number of executors
>  - Or set SPARK_EXECUTOR_INSTANCES
>  - spark.executor.instances to configure the number of instances in the spark 
> config.
> {code}
> Is this regarded as normal practice to have deprecated fields documented in 
> the documentation ? 
> I would have preferred to directly know about the --num-executors property 
> than to have to submit my application and find a deprecated warning. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15792) [SQL] Allows operator to change the verbosity in explain output.


 [ 
https://issues.apache.org/jira/browse/SPARK-15792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15792:
--
Assignee: Sean Zhong

> [SQL] Allows operator to change the verbosity in explain output.
> 
>
> Key: SPARK-15792
> URL: https://issues.apache.org/jira/browse/SPARK-15792
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>Priority: Minor
> Fix For: 2.0.0
>
>
> We should allows an operator (Physical plan or logical plan) to change 
> verbosity in explain output.
> For example, we may not want to display {{output=[count(a)#48L]}} in 
> less-verbose mode.
> {code}
> scala> spark.sql("select count(a) from df").explain()
> == Physical Plan ==
> *HashAggregate(key=[], functions=[count(1)], output=[count(a)#48L])
> +- Exchange SinglePartition
>+- *HashAggregate(key=[], functions=[partial_count(1)], output=[count#50L])
>   +- LocalTableScan
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15781) Misleading deprecated property in standalone cluster configuration documentation


[ 
https://issues.apache.org/jira/browse/SPARK-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318426#comment-15318426
 ] 

Jonathan Taws edited comment on SPARK-15781 at 6/7/16 1:00 PM:
---

Then a little sentence as this one could do the trick after the end of [this 
section|http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications]
   :
If you are looking to run multiple executors on the same worker, you can pass 
the option --executor-cores , which will create as many workers with 
 cores as there are cores available for this worker. 


was (Author: jonathantaws):
Then a little sentence as this one could do the trick :
If you are looking to run multiple executors on the same worker, you can pass 
the option --executor-cores , which will create as many workers with 
 cores as there are cores available for this worker. 

> Misleading deprecated property in standalone cluster configuration 
> documentation
> 
>
> Key: SPARK-15781
> URL: https://issues.apache.org/jira/browse/SPARK-15781
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> I am unsure if this is regarded as an issue or not, but in the 
> [latest|http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts]
>  documentation for the configuration to launch Spark in stand-alone cluster 
> mode, the following property is documented :
> |SPARK_WORKER_INSTANCES|  Number of worker instances to run on each 
> machine (default: 1). You can make this more than 1 if you have have very 
> large machines and would like multiple Spark worker processes. If you do set 
> this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores 
> per worker, or else each worker will try to use all the cores.| 
> However, once I launch Spark with the spark-submit utility and the property 
> {{SPARK_WORKER_INSTANCES}} set in my spark-env.sh file, I get the following 
> deprecated warning : 
> {code}
> 16/06/06 16:38:28 WARN SparkConf: 
> SPARK_WORKER_INSTANCES was detected (set to '4').
> This is deprecated in Spark 1.0+.
> Please instead use:
>  - ./spark-submit with --num-executors to specify the number of executors
>  - Or set SPARK_EXECUTOR_INSTANCES
>  - spark.executor.instances to configure the number of instances in the spark 
> config.
> {code}
> Is this regarded as normal practice to have deprecated fields documented in 
> the documentation ? 
> I would have preferred to directly know about the --num-executors property 
> than to have to submit my application and find a deprecated warning. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15781) Misleading deprecated property in standalone cluster configuration documentation


[ 
https://issues.apache.org/jira/browse/SPARK-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318426#comment-15318426
 ] 

Jonathan Taws commented on SPARK-15781:
---

Then a little sentence as this one could do the trick :
If you are looking to run multiple executors on the same worker, you can pass 
the option --executor-cores , which will create as many workers with 
 cores as there are cores available for this worker. 

> Misleading deprecated property in standalone cluster configuration 
> documentation
> 
>
> Key: SPARK-15781
> URL: https://issues.apache.org/jira/browse/SPARK-15781
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> I am unsure if this is regarded as an issue or not, but in the 
> [latest|http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts]
>  documentation for the configuration to launch Spark in stand-alone cluster 
> mode, the following property is documented :
> |SPARK_WORKER_INSTANCES|  Number of worker instances to run on each 
> machine (default: 1). You can make this more than 1 if you have have very 
> large machines and would like multiple Spark worker processes. If you do set 
> this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores 
> per worker, or else each worker will try to use all the cores.| 
> However, once I launch Spark with the spark-submit utility and the property 
> {{SPARK_WORKER_INSTANCES}} set in my spark-env.sh file, I get the following 
> deprecated warning : 
> {code}
> 16/06/06 16:38:28 WARN SparkConf: 
> SPARK_WORKER_INSTANCES was detected (set to '4').
> This is deprecated in Spark 1.0+.
> Please instead use:
>  - ./spark-submit with --num-executors to specify the number of executors
>  - Or set SPARK_EXECUTOR_INSTANCES
>  - spark.executor.instances to configure the number of instances in the spark 
> config.
> {code}
> Is this regarded as normal practice to have deprecated fields documented in 
> the documentation ? 
> I would have preferred to directly know about the --num-executors property 
> than to have to submit my application and find a deprecated warning. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15804) Manually added metadata not saving with parquet

2016-06-07 Thread Charlie Evans (JIRA)

Charlie Evans created SPARK-15804:
-

 Summary: Manually added metadata not saving with parquet
 Key: SPARK-15804
 URL: https://issues.apache.org/jira/browse/SPARK-15804
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Charlie Evans


Adding metadata with col().as(_, metadata) then saving the resultant dataframe 
does not save the metadata. No error is thrown. Only see the schema contains 
the metadata before saving and does not contain the metadata after saving and 
loading the dataframe.

{code}
case class TestRow(a: String, b: Int)
val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil
val df = spark.createDataFrame(rows)
import org.apache.spark.sql.types.MetadataBuilder
val md = new MetadataBuilder().putString("key", "value").build()
val dfWithMeta = df.select(col("a"), col("b").as("b", md))
println(dfWithMeta.schema.json)
dfWithMeta.write.parquet("dfWithMeta")

val dfWithMeta2 = spark.read.parquet("dfWithMeta")
println(dfWithMeta2.schema.json)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15801) spark-submit --num-executors switch also works without YARN


[ 
https://issues.apache.org/jira/browse/SPARK-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318392#comment-15318392
 ] 

Sean Owen commented on SPARK-15801:
---

I get the result you get _without_ {{--num-executors}}. I've kind of forgotten 
how standalone mode is supposed to work, so hopefully that is still expected 
behavior. But {{--num-executors}} doesn't seem to do anything. I get 4 
regardless of the value I set. CC [~vanzin] to see if that's supposed to 
generate a warning or whatever.

> spark-submit --num-executors switch also works without YARN
> ---
>
> Key: SPARK-15801
> URL: https://issues.apache.org/jira/browse/SPARK-15801
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Submit
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> Based on this [issue|https://issues.apache.org/jira/browse/SPARK-15781] 
> regarding the SPARK_WORKER_INSTANCES property, I also found that the 
> {{--num-executors}} switch documented in the spark-submit help is partially 
> incorrect. 
> Here's one part of the output (produced by {{spark-submit --help}}): 
> {code}
> YARN-only:
>   --driver-cores NUM  Number of cores used by the driver, only in 
> cluster mode
>   (Default: 1).
>   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
> "default").
>   --num-executors NUM Number of executors to launch (Default: 2).
> {code}
> Correct me if I am wrong, but the num-executors switch also works in Spark 
> standalone mode *without YARN*.
> I tried by only launching a master and a worker with 4 executors specified, 
> and they were all successfully spawned. The master switch pointed to the 
> master's url, and not to the yarn value. 
> Here's the exact command : {{spark-submit --master spark://[local 
> machine]:7077 --num-executors 4 --executor-cores 2}}
> By default it is *1* executor per worker in Spark standalone mode without 
> YARN, but this option enables to specify the number of executors (per worker 
> ?) if, and only if, the {{--executor-cores}} switch is also set. I do believe 
> it defaults to 2 in YARN mode. 
> I would propose to move the option from the *YARN-only* section to the *Spark 
> standalone and YARN only* section.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-15802) SparkSQL connection fail using shell command "bin/beeline -u "jdbc:hive2://...:10000/default""


 [ 
https://issues.apache.org/jira/browse/SPARK-15802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-15802:
---

oops, didn't yet mean to resolve

> SparkSQL connection fail using shell command "bin/beeline -u 
> "jdbc:hive2://*.*.*.*:1/default""
> --
>
> Key: SPARK-15802
> URL: https://issues.apache.org/jira/browse/SPARK-15802
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: marymwu
>
> reproduce steps:
> 1. execute shell "sbin/start-thriftserver.sh --master yarn";
> 2. execute shell "bin/beeline -u "jdbc:hive2://*.*.*.*:1/default"";
> Actually result:
> SparkSQL connection failed and the log shows as follows:
> 16/06/07 14:49:18 WARN HttpParser: Illegal character 0x1 in state=START for 
> buffer 
> HeapByteBuffer@485a5ad9[p=1,l=35,c=16384,r=34]={\x01<<<\x00\x00\x00\x05PLAIN\x05\x00\x00\x00\x14\x00an...ymous\x00anonymous>>>Type:
>  application...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00}
> 16/06/07 14:49:18 WARN HttpParser: badMessage: 400 Illegal character 0x1 for 
> HttpChannelOverHttp@718db102{r=0,c=false,a=IDLE,uri=}
> 16/06/07 14:49:19 WARN HttpParser: Illegal character 0x1 in state=START for 
> buffer 
> HeapByteBuffer@485a5ad9[p=1,l=35,c=16384,r=34]={\x01<<<\x00\x00\x00\x05PLAIN\x05\x00\x00\x00\x14\x00an...ymous\x00anonymous>>>Type:
>  application...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00}
> 16/06/07 14:49:19 WARN HttpParser: badMessage: 400 Illegal character 0x1 for 
> HttpChannelOverHttp@195db217{r=0,c=false,a=IDLE,uri=}
> note:
> SparkSQL connection succeeded, if using shell command "bin/beeline -u 
> "jdbc:hive2://*.*.*.*:1/default;transportMode=http;httpPath=cliservice""
> Two parameters(transportMode) have been added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15802) SparkSQL connection fail using shell command "bin/beeline -u "jdbc:hive2://...:10000/default""


 [ 
https://issues.apache.org/jira/browse/SPARK-15802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15802.
---
Resolution: Fixed

Doesn't that just mean you used the wrong protocol, and when you specified the 
right protocol, it worked? I don't see a Spark problem there.

> SparkSQL connection fail using shell command "bin/beeline -u 
> "jdbc:hive2://*.*.*.*:1/default""
> --
>
> Key: SPARK-15802
> URL: https://issues.apache.org/jira/browse/SPARK-15802
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: marymwu
>
> reproduce steps:
> 1. execute shell "sbin/start-thriftserver.sh --master yarn";
> 2. execute shell "bin/beeline -u "jdbc:hive2://*.*.*.*:1/default"";
> Actually result:
> SparkSQL connection failed and the log shows as follows:
> 16/06/07 14:49:18 WARN HttpParser: Illegal character 0x1 in state=START for 
> buffer 
> HeapByteBuffer@485a5ad9[p=1,l=35,c=16384,r=34]={\x01<<<\x00\x00\x00\x05PLAIN\x05\x00\x00\x00\x14\x00an...ymous\x00anonymous>>>Type:
>  application...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00}
> 16/06/07 14:49:18 WARN HttpParser: badMessage: 400 Illegal character 0x1 for 
> HttpChannelOverHttp@718db102{r=0,c=false,a=IDLE,uri=}
> 16/06/07 14:49:19 WARN HttpParser: Illegal character 0x1 in state=START for 
> buffer 
> HeapByteBuffer@485a5ad9[p=1,l=35,c=16384,r=34]={\x01<<<\x00\x00\x00\x05PLAIN\x05\x00\x00\x00\x14\x00an...ymous\x00anonymous>>>Type:
>  application...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00}
> 16/06/07 14:49:19 WARN HttpParser: badMessage: 400 Illegal character 0x1 for 
> HttpChannelOverHttp@195db217{r=0,c=false,a=IDLE,uri=}
> note:
> SparkSQL connection succeeded, if using shell command "bin/beeline -u 
> "jdbc:hive2://*.*.*.*:1/default;transportMode=http;httpPath=cliservice""
> Two parameters(transportMode) have been added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15781) Misleading deprecated property in standalone cluster configuration documentation


[ 
https://issues.apache.org/jira/browse/SPARK-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318379#comment-15318379
 ] 

Sean Owen commented on SPARK-15781:
---

These are reasonable ideas, though I think the idea is to move away from env 
variables entirely eventually. Hence I'd be fine just removing this deprecated 
one.

> Misleading deprecated property in standalone cluster configuration 
> documentation
> 
>
> Key: SPARK-15781
> URL: https://issues.apache.org/jira/browse/SPARK-15781
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.1
>Reporter: Jonathan Taws
>Priority: Minor
>
> I am unsure if this is regarded as an issue or not, but in the 
> [latest|http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts]
>  documentation for the configuration to launch Spark in stand-alone cluster 
> mode, the following property is documented :
> |SPARK_WORKER_INSTANCES|  Number of worker instances to run on each 
> machine (default: 1). You can make this more than 1 if you have have very 
> large machines and would like multiple Spark worker processes. If you do set 
> this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores 
> per worker, or else each worker will try to use all the cores.| 
> However, once I launch Spark with the spark-submit utility and the property 
> {{SPARK_WORKER_INSTANCES}} set in my spark-env.sh file, I get the following 
> deprecated warning : 
> {code}
> 16/06/06 16:38:28 WARN SparkConf: 
> SPARK_WORKER_INSTANCES was detected (set to '4').
> This is deprecated in Spark 1.0+.
> Please instead use:
>  - ./spark-submit with --num-executors to specify the number of executors
>  - Or set SPARK_EXECUTOR_INSTANCES
>  - spark.executor.instances to configure the number of instances in the spark 
> config.
> {code}
> Is this regarded as normal practice to have deprecated fields documented in 
> the documentation ? 
> I would have preferred to directly know about the --num-executors property 
> than to have to submit my application and find a deprecated warning. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15803) Support with statement syntax for SparkSession


 [ 
https://issues.apache.org/jira/browse/SPARK-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15803:


Assignee: (was: Apache Spark)

> Support with statement syntax for SparkSession
> --
>
> Key: SPARK-15803
> URL: https://issues.apache.org/jira/browse/SPARK-15803
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Priority: Minor
>
> It would be nice to support with statement syntax for SparkSession like 
> following
> {code}
> with SparkSession.builder.(...).getOrCreate() as session:
>   session.sql("show tables").show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15803) Support with statement syntax for SparkSession


[ 
https://issues.apache.org/jira/browse/SPARK-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318375#comment-15318375
 ] 

Apache Spark commented on SPARK-15803:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/13541

> Support with statement syntax for SparkSession
> --
>
> Key: SPARK-15803
> URL: https://issues.apache.org/jira/browse/SPARK-15803
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Priority: Minor
>
> It would be nice to support with statement syntax for SparkSession like 
> following
> {code}
> with SparkSession.builder.(...).getOrCreate() as session:
>   session.sql("show tables").show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15803) Support with statement syntax for SparkSession


 [ 
https://issues.apache.org/jira/browse/SPARK-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15803:


Assignee: Apache Spark

> Support with statement syntax for SparkSession
> --
>
> Key: SPARK-15803
> URL: https://issues.apache.org/jira/browse/SPARK-15803
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> It would be nice to support with statement syntax for SparkSession like 
> following
> {code}
> with SparkSession.builder.(...).getOrCreate() as session:
>   session.sql("show tables").show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15803) Support with statement syntax for SparkSession

2016-06-07 Thread Jeff Zhang (JIRA)

Jeff Zhang created SPARK-15803:
--

 Summary: Support with statement syntax for SparkSession
 Key: SPARK-15803
 URL: https://issues.apache.org/jira/browse/SPARK-15803
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Jeff Zhang
Priority: Minor


It would be nice to support with statement syntax for SparkSession like 
following
{code}
with SparkSession.builder.(...).getOrCreate() as session:
  session.sql("show tables").show()

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15801) spark-submit --num-executors switch also works without YARN