[jira] [Created] (SPARK-15814) Aggregator can return null result
Wenchen Fan created SPARK-15814: --- Summary: Aggregator can return null result Key: SPARK-15814 URL: https://issues.apache.org/jira/browse/SPARK-15814 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9623) RandomForestRegressor: provide variance of predictions
[ https://issues.apache.org/jira/browse/SPARK-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15320036#comment-15320036 ] Yanbo Liang commented on SPARK-9623: [~MechCoder] I'm not working on this, please feel free to take over. > RandomForestRegressor: provide variance of predictions > -- > > Key: SPARK-9623 > URL: https://issues.apache.org/jira/browse/SPARK-9623 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Variance of predicted value, as estimated from training data. > Analogous to class probabilities for classification. > See [SPARK-3727] for discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15369) Investigate selectively using Jython for parts of PySpark
[ https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319966#comment-15319966 ] holdenk commented on SPARK-15369: - WIP design document https://docs.google.com/document/d/1L-F12nVWSLEOW72sqOn6Mt1C0bcPFP9ck7gEMH2_IXE/edit?usp=sharing > Investigate selectively using Jython for parts of PySpark > - > > Key: SPARK-15369 > URL: https://issues.apache.org/jira/browse/SPARK-15369 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk >Priority: Minor > > Transfering data from the JVM to the Python executor can be a substantial > bottleneck. While JYthon is not suitable for all UDFs or map functions, it > may be suitable for some simple ones. We should investigate the option of > using JYthon to accelerate these small functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading
[ https://issues.apache.org/jira/browse/SPARK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319930#comment-15319930 ] Apache Spark commented on SPARK-15813: -- User 'peterableda' has created a pull request for this issue: https://github.com/apache/spark/pull/13552 > Spark Dyn Allocation Cancel log message misleading > -- > > Key: SPARK-15813 > URL: https://issues.apache.org/jira/browse/SPARK-15813 > Project: Spark > Issue Type: Bug >Reporter: Peter Ableda >Priority: Trivial > > *Driver requested* message is logged before the *Canceling* message but has > the updated executor number. The messages are misleading. > See log snippet: > {code} > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of > 619 executor(s). > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor > containers > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage > 0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage > 0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage > 0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage > 0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage > 0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage > 0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage > 0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage > 0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage > 0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage > 0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of > 614 executor(s). > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor > containers > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage > 0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 > bytes) > {code} > The easy solution is to update the message to use past tense. This is > consistent with the other messages there. > *Canceled requests for 5 executor container(s).* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading
[ https://issues.apache.org/jira/browse/SPARK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15813: Assignee: (was: Apache Spark) > Spark Dyn Allocation Cancel log message misleading > -- > > Key: SPARK-15813 > URL: https://issues.apache.org/jira/browse/SPARK-15813 > Project: Spark > Issue Type: Bug >Reporter: Peter Ableda >Priority: Trivial > > *Driver requested* message is logged before the *Canceling* message but has > the updated executor number. The messages are misleading. > See log snippet: > {code} > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of > 619 executor(s). > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor > containers > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage > 0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage > 0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage > 0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage > 0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage > 0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage > 0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage > 0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage > 0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage > 0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage > 0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of > 614 executor(s). > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor > containers > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage > 0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 > bytes) > {code} > The easy solution is to update the message to use past tense. This is > consistent with the other messages there. > *Canceled requests for 5 executor container(s).* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading
[ https://issues.apache.org/jira/browse/SPARK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15813: Assignee: Apache Spark > Spark Dyn Allocation Cancel log message misleading > -- > > Key: SPARK-15813 > URL: https://issues.apache.org/jira/browse/SPARK-15813 > Project: Spark > Issue Type: Bug >Reporter: Peter Ableda >Assignee: Apache Spark >Priority: Trivial > > *Driver requested* message is logged before the *Canceling* message but has > the updated executor number. The messages are misleading. > See log snippet: > {code} > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of > 619 executor(s). > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor > containers > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage > 0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage > 0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage > 0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage > 0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage > 0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage > 0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage > 0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage > 0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage > 0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000) > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage > 0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 > bytes) > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of > 614 executor(s). > 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor > containers > 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage > 0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 > bytes) > {code} > The easy solution is to update the message to use past tense. This is > consistent with the other messages there. > *Canceled requests for 5 executor container(s).* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading
[ https://issues.apache.org/jira/browse/SPARK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Ableda updated SPARK-15813: - Description: *Driver requested* message is logged before the *Canceling* message but has the updated executor number. The messages are misleading. See log snippet: {code} 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 619 executor(s). 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor containers 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage 0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage 0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage 0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage 0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage 0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage 0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage 0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage 0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage 0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage 0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 614 executor(s). 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor containers 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage 0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 bytes) {code} The easy solution is to update the message to use past sentence. This is consistent with the other messages there. *Canceled requests for 5 executor container(s).* was: Driver requested message is logged before the *Canceling* message but has the updated executor number. The messages are misleading. See log snippet: {code} 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 619 executor(s). 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor containers 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage 0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage 0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage 0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage 0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage 0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage 0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage 0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage 0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage 0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage 0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 614 executor(s). 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor containers 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage 0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 bytes) {code} The easy solution is to update the message to use past sentence. This is consistent with the other messages there. *Canceled requests for 5 executor container(s).* > Spark Dyn Allocation Cancel log message misleading >
[jira] [Updated] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading
[ https://issues.apache.org/jira/browse/SPARK-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Ableda updated SPARK-15813: - Description: *Driver requested* message is logged before the *Canceling* message but has the updated executor number. The messages are misleading. See log snippet: {code} 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 619 executor(s). 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor containers 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage 0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage 0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage 0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage 0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage 0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage 0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage 0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage 0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage 0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage 0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 614 executor(s). 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor containers 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage 0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 bytes) {code} The easy solution is to update the message to use past tense. This is consistent with the other messages there. *Canceled requests for 5 executor container(s).* was: *Driver requested* message is logged before the *Canceling* message but has the updated executor number. The messages are misleading. See log snippet: {code} 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 619 executor(s). 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor containers 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage 0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage 0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage 0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage 0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage 0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage 0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage 0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage 0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage 0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage 0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 614 executor(s). 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor containers 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage 0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 bytes) {code} The easy solution is to update the message to use past sentence. This is consistent with the other messages there. *Canceled requests for 5 executor container(s).* > Spark Dyn Allocation Cancel log message misleading >
[jira] [Created] (SPARK-15813) Spark Dyn Allocation Cancel log message misleading
Peter Ableda created SPARK-15813: Summary: Spark Dyn Allocation Cancel log message misleading Key: SPARK-15813 URL: https://issues.apache.org/jira/browse/SPARK-15813 Project: Spark Issue Type: Bug Reporter: Peter Ableda Priority: Trivial Driver requested message is logged before the *Canceling* message but has the updated executor number. The messages are misleading. See log snippet: {code} 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 619 executor(s). 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 4 executor containers 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 382.0 in stage 0.0 (TID 382) in 22 ms on lava-2.vpc.cloudera.com (382/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 383.0 in stage 0.0 (TID 383, lava-2.vpc.cloudera.com, partition 383,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 383.0 in stage 0.0 (TID 383) in 24 ms on lava-2.vpc.cloudera.com (383/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 384.0 in stage 0.0 (TID 384, lava-2.vpc.cloudera.com, partition 384,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 384.0 in stage 0.0 (TID 384) in 19 ms on lava-2.vpc.cloudera.com (384/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 385.0 in stage 0.0 (TID 385, lava-2.vpc.cloudera.com, partition 385,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 385.0 in stage 0.0 (TID 385) in 22 ms on lava-2.vpc.cloudera.com (385/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 386.0 in stage 0.0 (TID 386, lava-2.vpc.cloudera.com, partition 386,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Finished task 386.0 in stage 0.0 (TID 386) in 20 ms on lava-2.vpc.cloudera.com (386/1000) 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 387.0 in stage 0.0 (TID 387, lava-2.vpc.cloudera.com, partition 387,PROCESS_LOCAL, 1980 bytes) 16/06/07 18:53:48 INFO yarn.YarnAllocator: Driver requested a total number of 614 executor(s). 16/06/07 18:53:48 INFO yarn.YarnAllocator: Canceling requests for 5 executor containers 16/06/07 18:53:48 INFO scheduler.TaskSetManager: Starting task 388.0 in stage 0.0 (TID 388, lava-4.vpc.cloudera.com, partition 388,PROCESS_LOCAL, 1980 bytes) {code} The easy solution is to update the message to use past sentence. This is consistent with the other messages there. *Canceled requests for 5 executor container(s).* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15755) java.lang.NullPointerException when run spark 2.0 setting spark.serializer=org.apache.spark.serializer.KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-15755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell closed SPARK-15755. - Resolution: Duplicate > java.lang.NullPointerException when run spark 2.0 setting > spark.serializer=org.apache.spark.serializer.KryoSerializer > - > > Key: SPARK-15755 > URL: https://issues.apache.org/jira/browse/SPARK-15755 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: marymwu > > java.lang.NullPointerException when run spark 2.0 setting > spark.serializer=org.apache.spark.serializer.KryoSerializer > 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > underlying (org.apache.spark.util.BoundedPriorityQueue) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) > at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) > at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) > at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) > at java.util.PriorityQueue.offer(PriorityQueue.java:329) > at java.util.PriorityQueue.add(PriorityQueue.java:306) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31) > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:711) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125) > ... 15 more > 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > underlying (org.apache.spark.util.BoundedPriorityQueue) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56) > at >
[jira] [Commented] (SPARK-15802) SparkSQL connection fail using shell command "bin/beeline -u "jdbc:hive2://*.*.*.*:10000/default""
[ https://issues.apache.org/jira/browse/SPARK-15802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319908#comment-15319908 ] marymwu commented on SPARK-15802: - looking forward to your reply, thanks > SparkSQL connection fail using shell command "bin/beeline -u > "jdbc:hive2://*.*.*.*:1/default"" > -- > > Key: SPARK-15802 > URL: https://issues.apache.org/jira/browse/SPARK-15802 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: marymwu > > reproduce steps: > 1. execute shell "sbin/start-thriftserver.sh --master yarn"; > 2. execute shell "bin/beeline -u "jdbc:hive2://*.*.*.*:1/default""; > Actually result: > SparkSQL connection failed and the log shows as follows: > 16/06/07 14:49:18 WARN HttpParser: Illegal character 0x1 in state=START for > buffer > HeapByteBuffer@485a5ad9[p=1,l=35,c=16384,r=34]={\x01<<<\x00\x00\x00\x05PLAIN\x05\x00\x00\x00\x14\x00an...ymous\x00anonymous>>>Type: > application...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00} > 16/06/07 14:49:18 WARN HttpParser: badMessage: 400 Illegal character 0x1 for > HttpChannelOverHttp@718db102{r=0,c=false,a=IDLE,uri=} > 16/06/07 14:49:19 WARN HttpParser: Illegal character 0x1 in state=START for > buffer > HeapByteBuffer@485a5ad9[p=1,l=35,c=16384,r=34]={\x01<<<\x00\x00\x00\x05PLAIN\x05\x00\x00\x00\x14\x00an...ymous\x00anonymous>>>Type: > application...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00} > 16/06/07 14:49:19 WARN HttpParser: badMessage: 400 Illegal character 0x1 for > HttpChannelOverHttp@195db217{r=0,c=false,a=IDLE,uri=} > note: > SparkSQL connection succeeded, if using shell command "bin/beeline -u > "jdbc:hive2://*.*.*.*:1/default;transportMode=http;httpPath=cliservice"" > Two parameters(transportMode) have been added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15802) SparkSQL connection fail using shell command "bin/beeline -u "jdbc:hive2://*.*.*.*:10000/default""
[ https://issues.apache.org/jira/browse/SPARK-15802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319906#comment-15319906 ] marymwu commented on SPARK-15802: - what's the right protocol? how to specify it ? > SparkSQL connection fail using shell command "bin/beeline -u > "jdbc:hive2://*.*.*.*:1/default"" > -- > > Key: SPARK-15802 > URL: https://issues.apache.org/jira/browse/SPARK-15802 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: marymwu > > reproduce steps: > 1. execute shell "sbin/start-thriftserver.sh --master yarn"; > 2. execute shell "bin/beeline -u "jdbc:hive2://*.*.*.*:1/default""; > Actually result: > SparkSQL connection failed and the log shows as follows: > 16/06/07 14:49:18 WARN HttpParser: Illegal character 0x1 in state=START for > buffer > HeapByteBuffer@485a5ad9[p=1,l=35,c=16384,r=34]={\x01<<<\x00\x00\x00\x05PLAIN\x05\x00\x00\x00\x14\x00an...ymous\x00anonymous>>>Type: > application...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00} > 16/06/07 14:49:18 WARN HttpParser: badMessage: 400 Illegal character 0x1 for > HttpChannelOverHttp@718db102{r=0,c=false,a=IDLE,uri=} > 16/06/07 14:49:19 WARN HttpParser: Illegal character 0x1 in state=START for > buffer > HeapByteBuffer@485a5ad9[p=1,l=35,c=16384,r=34]={\x01<<<\x00\x00\x00\x05PLAIN\x05\x00\x00\x00\x14\x00an...ymous\x00anonymous>>>Type: > application...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00} > 16/06/07 14:49:19 WARN HttpParser: badMessage: 400 Illegal character 0x1 for > HttpChannelOverHttp@195db217{r=0,c=false,a=IDLE,uri=} > note: > SparkSQL connection succeeded, if using shell command "bin/beeline -u > "jdbc:hive2://*.*.*.*:1/default;transportMode=http;httpPath=cliservice"" > Two parameters(transportMode) have been added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14485) Task finished cause fetch failure when its executor has already been removed by driver
[ https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-14485: -- Assignee: iward > Task finished cause fetch failure when its executor has already been removed > by driver > --- > > Key: SPARK-14485 > URL: https://issues.apache.org/jira/browse/SPARK-14485 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1, 1.5.2 >Reporter: iward >Assignee: iward > Fix For: 2.0.0 > > > Now, when executor is removed by driver with heartbeats timeout, driver will > re-queue the task on this executor and send a kill command to cluster to kill > this executor. > But, in a situation, the running task of this executor is finished and return > result to driver before this executor killed by kill command sent by driver. > At this situation, driver will accept the task finished event and ignore > speculative task and re-queued task. But, as we know, this executor has > removed by driver, the result of this finished task can not save in driver > because the *BlockManagerId* has also removed from *BlockManagerMaster* by > driver. So, the result data of this stage is not complete, and then, it will > cause fetch failure. > For example, the following is the task log: > {noformat} > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing > executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor > 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after > 256015 ms > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing > tasks for 322 from TaskSet 107.0 > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task > 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): > ExecutorLostFailure (executor 322 lost) > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: > 322 (epoch 11) > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: > Trying to remove executor 322 from BlockManagerMaster. > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed > 322 successfully in removeExecutor > {noformat} > {noformat} > 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task > 229.0 in stage 107.0 (TID 10384) in 272315 ms on > BJHC-HERA-16168.hadoop.jd.local (579/700) > 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring > task-finished event for 229.1 in stage 107.0 because task 229 has already > completed successfully > {noformat} > {noformat} > 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 > missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at > mapPartitions at Exchange.scala:137) > 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task > set 107.1 with 3 tasks > 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task > 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, > PROCESS_LOCAL, 3745 bytes) > 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task > 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, > 3745 bytes) > 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task > 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, > PROCESS_LOCAL, 3745 bytes) > {noformat} > Driver will check the stage's result is not complete, and submit missing > task, but this time, the next stage has run because previous stage has finish > for its task is all finished although its result is not complete. > {noformat} > 2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task > 39.0 in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): > FetchFailed(null, shuffleId=11, mapId=-1, reduceId=39, message= > 2015-12-31 04:40:13 INFO > org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output > location for shuffle 11 > 2015-12-31 04:40:13 INFO at > org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385) > 2015-12-31 04:40:13 INFO at > org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382) > 2015-12-31 04:40:13 INFO at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > 2015-12-31 04:40:13 INFO at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > 2015-12-31 04:40:13 INFO at >
[jira] [Assigned] (SPARK-15755) java.lang.NullPointerException when run spark 2.0 setting spark.serializer=org.apache.spark.serializer.KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-15755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15755: Assignee: (was: Apache Spark) > java.lang.NullPointerException when run spark 2.0 setting > spark.serializer=org.apache.spark.serializer.KryoSerializer > - > > Key: SPARK-15755 > URL: https://issues.apache.org/jira/browse/SPARK-15755 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: marymwu > > java.lang.NullPointerException when run spark 2.0 setting > spark.serializer=org.apache.spark.serializer.KryoSerializer > 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > underlying (org.apache.spark.util.BoundedPriorityQueue) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) > at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) > at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) > at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) > at java.util.PriorityQueue.offer(PriorityQueue.java:329) > at java.util.PriorityQueue.add(PriorityQueue.java:306) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31) > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:711) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125) > ... 15 more > 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > underlying (org.apache.spark.util.BoundedPriorityQueue) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56) > at >
[jira] [Assigned] (SPARK-15755) java.lang.NullPointerException when run spark 2.0 setting spark.serializer=org.apache.spark.serializer.KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-15755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15755: Assignee: Apache Spark > java.lang.NullPointerException when run spark 2.0 setting > spark.serializer=org.apache.spark.serializer.KryoSerializer > - > > Key: SPARK-15755 > URL: https://issues.apache.org/jira/browse/SPARK-15755 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: marymwu >Assignee: Apache Spark > > java.lang.NullPointerException when run spark 2.0 setting > spark.serializer=org.apache.spark.serializer.KryoSerializer > 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > underlying (org.apache.spark.util.BoundedPriorityQueue) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) > at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) > at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) > at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) > at java.util.PriorityQueue.offer(PriorityQueue.java:329) > at java.util.PriorityQueue.add(PriorityQueue.java:306) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31) > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:711) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125) > ... 15 more > 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > underlying (org.apache.spark.util.BoundedPriorityQueue) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56) > at
[jira] [Commented] (SPARK-15755) java.lang.NullPointerException when run spark 2.0 setting spark.serializer=org.apache.spark.serializer.KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-15755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319866#comment-15319866 ] Apache Spark commented on SPARK-15755: -- User 'marymwu' has created a pull request for this issue: https://github.com/apache/spark/pull/13550 > java.lang.NullPointerException when run spark 2.0 setting > spark.serializer=org.apache.spark.serializer.KryoSerializer > - > > Key: SPARK-15755 > URL: https://issues.apache.org/jira/browse/SPARK-15755 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: marymwu > > java.lang.NullPointerException when run spark 2.0 setting > spark.serializer=org.apache.spark.serializer.KryoSerializer > 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > underlying (org.apache.spark.util.BoundedPriorityQueue) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) > at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) > at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) > at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) > at java.util.PriorityQueue.offer(PriorityQueue.java:329) > at java.util.PriorityQueue.add(PriorityQueue.java:306) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31) > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:711) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125) > ... 15 more > 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > underlying (org.apache.spark.util.BoundedPriorityQueue) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793) > at >
[jira] [Created] (SPARK-15812) Allow sorting on aggregated streaming dataframe when the output mode is Complete
Tathagata Das created SPARK-15812: - Summary: Allow sorting on aggregated streaming dataframe when the output mode is Complete Key: SPARK-15812 URL: https://issues.apache.org/jira/browse/SPARK-15812 Project: Spark Issue Type: Sub-task Reporter: Tathagata Das Assignee: Tathagata Das When the output mode is complete, then the output of a streaming aggregation essentially will contain the complete aggregates every time. So this is not different from a batch dataset within an incremental execution. Other non-streaming operations should be supported on this dataset. In this JIRA, we are just adding support for sorting, as it is a common useful functionality. Support for other operations will come later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15517) Add support for complete output mode
[ https://issues.apache.org/jira/browse/SPARK-15517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-15517. --- Resolution: Fixed Fix Version/s: 2.0.0 > Add support for complete output mode > - > > Key: SPARK-15517 > URL: https://issues.apache.org/jira/browse/SPARK-15517 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.0.0 > > > Currently structured streaming only supports append output mode. This task is > to do the following. > - Add support for complete output mode in the planner > - Add public API for users to specify output mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15046) When running hive-thriftserver with yarn on a secure cluster the workers fail with java.lang.NumberFormatException
[ https://issues.apache.org/jira/browse/SPARK-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319817#comment-15319817 ] Jie Huang commented on SPARK-15046: --- OK, I see. Thanks [~tleftwich]. If so, it seems we'd better to use the new configure API, like: {code:borderStyle=solid} sparkConf.get(TOKEN_RENEWAL_INTERVAL, (24 hours).toMillis) {code} > When running hive-thriftserver with yarn on a secure cluster the workers fail > with java.lang.NumberFormatException > -- > > Key: SPARK-15046 > URL: https://issues.apache.org/jira/browse/SPARK-15046 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Trystan Leftwich > > When running hive-thriftserver with yarn on a secure cluster > (spark.yarn.principal and spark.yarn.keytab are set) the workers fail with > the following error. > {code} > 16/04/30 22:40:50 ERROR yarn.ApplicationMaster: Uncaught exception: > java.lang.NumberFormatException: For input string: "86400079ms" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Long.parseLong(Long.java:441) > at java.lang.Long.parseLong(Long.java:483) > at > scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276) > at scala.collection.immutable.StringOps.toLong(StringOps.scala:29) > at > org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380) > at > org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380) > at scala.Option.map(Option.scala:146) > at org.apache.spark.SparkConf.getLong(SparkConf.scala:380) > at > org.apache.spark.deploy.SparkHadoopUtil.getTimeFromNowToRenewal(SparkHadoopUtil.scala:289) > at > org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.org$apache$spark$deploy$yarn$AMDelegationTokenRenewer$$scheduleRenewal$1(AMDelegationTokenRenewer.scala:89) > at > org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.scheduleLoginFromKeytab(AMDelegationTokenRenewer.scala:121) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:243) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:723) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:721) > at > org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:748) > at > org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15789) Allow reserved keywords in most places
[ https://issues.apache.org/jira/browse/SPARK-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-15789: Assignee: Herman van Hovell > Allow reserved keywords in most places > -- > > Key: SPARK-15789 > URL: https://issues.apache.org/jira/browse/SPARK-15789 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.0.0 > > > The current parser doesn't allow a number SQL keywords to be used as > identifiers (for tables and fields). We should allow this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15789) Allow reserved keywords in most places
[ https://issues.apache.org/jira/browse/SPARK-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-15789. - Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13534 [https://github.com/apache/spark/pull/13534] > Allow reserved keywords in most places > -- > > Key: SPARK-15789 > URL: https://issues.apache.org/jira/browse/SPARK-15789 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hovell > Fix For: 2.0.0 > > > The current parser doesn't allow a number SQL keywords to be used as > identifiers (for tables and fields). We should allow this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15811) UDFs do not work in Spark 2.0-preview built with scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franklyn Dsouza updated SPARK-15811: Description: I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following {code} ./dev/change-version-to-2.10.sh ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive {code} and then ran the following code in a pyspark shell {code:python} from pyspark.sql import SparkSession from pyspark.sql.types import IntegerType, StructField, StructType from pyspark.sql.functions import udf from pyspark.sql.types import Row spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate() add_one = udf(lambda x: x + 1, IntegerType()) schema = StructType([StructField('a', IntegerType(), False)]) df = spark.createDataFrame([Row(a=1),Row(a=2)], schema) df.select(add_one(df.a).alias('incremented')).collect() {code} This never returns with a result. was: I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following {code} ./dev/change-version-to-2.10.sh ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive {code} and then ran the following code in a pyspark shell {code:python} from pyspark.sql import SparkSession from pyspark.sql.types import IntegerType, StructField, StructType from pyspark.sql.functions import udf from pyspark.sql.types import Row spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate() add_one = udf(lambda x: x + 1, IntegerType()) schema = StructType([StructField('a', IntegerType(), False)]) df = spark.createDataFrame([Row(a=1),Row(a=2)], schema) df.select(add_one(df.a).alias('incremented')).collect() {code:xml} This never returns with a result. > UDFs do not work in Spark 2.0-preview built with scala 2.10 > --- > > Key: SPARK-15811 > URL: https://issues.apache.org/jira/browse/SPARK-15811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Franklyn Dsouza >Priority: Blocker > Fix For: 2.0.0 > > > I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following > {code} > ./dev/change-version-to-2.10.sh > ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 > -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive > {code} > and then ran the following code in a pyspark shell > {code:python} > from pyspark.sql import SparkSession > from pyspark.sql.types import IntegerType, StructField, StructType > from pyspark.sql.functions import udf > from pyspark.sql.types import Row > spark = SparkSession.builder.master('local[4]').appName('2.0 > DF').getOrCreate() > add_one = udf(lambda x: x + 1, IntegerType()) > schema = StructType([StructField('a', IntegerType(), False)]) > df = spark.createDataFrame([Row(a=1),Row(a=2)], schema) > df.select(add_one(df.a).alias('incremented')).collect() > {code} > This never returns with a result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15811) UDFs do not work in Spark 2.0-preview built with scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franklyn Dsouza updated SPARK-15811: Description: I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following {code} ./dev/change-version-to-2.10.sh ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive {code} and then ran the following code in a pyspark shell {code:python} from pyspark.sql import SparkSession from pyspark.sql.types import IntegerType, StructField, StructType from pyspark.sql.functions import udf from pyspark.sql.types import Row spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate() add_one = udf(lambda x: x + 1, IntegerType()) schema = StructType([StructField('a', IntegerType(), False)]) df = spark.createDataFrame([Row(a=1),Row(a=2)], schema) df.select(add_one(df.a).alias('incremented')).collect() {code:xml} This never returns with a result. was: I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following ./dev/change-version-to-2.10.sh ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive and then ran the following code in a pyspark shell from pyspark.sql import SparkSession from pyspark.sql.types import IntegerType, StructField, StructType from pyspark.sql.functions import udf from pyspark.sql.types import Row spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate() add_one = udf(lambda x: x + 1, IntegerType()) schema = StructType([StructField('a', IntegerType(), False)]) df = spark.createDataFrame([Row(a=1),Row(a=2)], schema) df.select(add_one(df.a).alias('incremented')).collect() This never returns with a result. > UDFs do not work in Spark 2.0-preview built with scala 2.10 > --- > > Key: SPARK-15811 > URL: https://issues.apache.org/jira/browse/SPARK-15811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Franklyn Dsouza >Priority: Blocker > Fix For: 2.0.0 > > > I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following > {code} > ./dev/change-version-to-2.10.sh > ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 > -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive > {code} > and then ran the following code in a pyspark shell > {code:python} > from pyspark.sql import SparkSession > from pyspark.sql.types import IntegerType, StructField, StructType > from pyspark.sql.functions import udf > from pyspark.sql.types import Row > spark = SparkSession.builder.master('local[4]').appName('2.0 > DF').getOrCreate() > add_one = udf(lambda x: x + 1, IntegerType()) > schema = StructType([StructField('a', IntegerType(), False)]) > df = spark.createDataFrame([Row(a=1),Row(a=2)], schema) > df.select(add_one(df.a).alias('incremented')).collect() > {code:xml} > This never returns with a result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15811) UDFs do not work in Spark 2.0-preview built with scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-15811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franklyn Dsouza updated SPARK-15811: Description: I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following {code} ./dev/change-version-to-2.10.sh ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive {code} and then ran the following code in a pyspark shell {code} from pyspark.sql import SparkSession from pyspark.sql.types import IntegerType, StructField, StructType from pyspark.sql.functions import udf from pyspark.sql.types import Row spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate() add_one = udf(lambda x: x + 1, IntegerType()) schema = StructType([StructField('a', IntegerType(), False)]) df = spark.createDataFrame([Row(a=1),Row(a=2)], schema) df.select(add_one(df.a).alias('incremented')).collect() {code} This never returns with a result. was: I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following {code} ./dev/change-version-to-2.10.sh ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive {code} and then ran the following code in a pyspark shell {code:python} from pyspark.sql import SparkSession from pyspark.sql.types import IntegerType, StructField, StructType from pyspark.sql.functions import udf from pyspark.sql.types import Row spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate() add_one = udf(lambda x: x + 1, IntegerType()) schema = StructType([StructField('a', IntegerType(), False)]) df = spark.createDataFrame([Row(a=1),Row(a=2)], schema) df.select(add_one(df.a).alias('incremented')).collect() {code} This never returns with a result. > UDFs do not work in Spark 2.0-preview built with scala 2.10 > --- > > Key: SPARK-15811 > URL: https://issues.apache.org/jira/browse/SPARK-15811 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Franklyn Dsouza >Priority: Blocker > Fix For: 2.0.0 > > > I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following > {code} > ./dev/change-version-to-2.10.sh > ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 > -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive > {code} > and then ran the following code in a pyspark shell > {code} > from pyspark.sql import SparkSession > from pyspark.sql.types import IntegerType, StructField, StructType > from pyspark.sql.functions import udf > from pyspark.sql.types import Row > spark = SparkSession.builder.master('local[4]').appName('2.0 > DF').getOrCreate() > add_one = udf(lambda x: x + 1, IntegerType()) > schema = StructType([StructField('a', IntegerType(), False)]) > df = spark.createDataFrame([Row(a=1),Row(a=2)], schema) > df.select(add_one(df.a).alias('incremented')).collect() > {code} > This never returns with a result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15811) UDFs do not work in Spark 2.0-preview built with scala 2.10
Franklyn Dsouza created SPARK-15811: --- Summary: UDFs do not work in Spark 2.0-preview built with scala 2.10 Key: SPARK-15811 URL: https://issues.apache.org/jira/browse/SPARK-15811 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Franklyn Dsouza Priority: Blocker Fix For: 2.0.0 I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following ./dev/change-version-to-2.10.sh ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive and then ran the following code in a pyspark shell from pyspark.sql import SparkSession from pyspark.sql.types import IntegerType, StructField, StructType from pyspark.sql.functions import udf from pyspark.sql.types import Row spark = SparkSession.builder.master('local[4]').appName('2.0 DF').getOrCreate() add_one = udf(lambda x: x + 1, IntegerType()) schema = StructType([StructField('a', IntegerType(), False)]) df = spark.createDataFrame([Row(a=1),Row(a=2)], schema) df.select(add_one(df.a).alias('incremented')).collect() This never returns with a result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15804) Manually added metadata not saving with parquet
[ https://issues.apache.org/jira/browse/SPARK-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319706#comment-15319706 ] kevin yu commented on SPARK-15804: -- I will submit a PR soon. Thanks. > Manually added metadata not saving with parquet > --- > > Key: SPARK-15804 > URL: https://issues.apache.org/jira/browse/SPARK-15804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Charlie Evans > > Adding metadata with col().as(_, metadata) then saving the resultant > dataframe does not save the metadata. No error is thrown. Only see the schema > contains the metadata before saving and does not contain the metadata after > saving and loading the dataframe. Was working fine with 1.6.1. > {code} > case class TestRow(a: String, b: Int) > val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil > val df = spark.createDataFrame(rows) > import org.apache.spark.sql.types.MetadataBuilder > val md = new MetadataBuilder().putString("key", "value").build() > val dfWithMeta = df.select(col("a"), col("b").as("b", md)) > println(dfWithMeta.schema.json) > dfWithMeta.write.parquet("dfWithMeta") > val dfWithMeta2 = spark.read.parquet("dfWithMeta") > println(dfWithMeta2.schema.json) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15580) Add ContinuousQueryInfo to make ContinuousQueryListener events serializable
[ https://issues.apache.org/jira/browse/SPARK-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-15580. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13335 [https://github.com/apache/spark/pull/13335] > Add ContinuousQueryInfo to make ContinuousQueryListener events serializable > --- > > Key: SPARK-15580 > URL: https://issues.apache.org/jira/browse/SPARK-15580 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14485) Task finished cause fetch failure when its executor has already been removed by driver
[ https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-14485. Resolution: Fixed Fix Version/s: 2.0.0 > Task finished cause fetch failure when its executor has already been removed > by driver > --- > > Key: SPARK-14485 > URL: https://issues.apache.org/jira/browse/SPARK-14485 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1, 1.5.2 >Reporter: iward > Fix For: 2.0.0 > > > Now, when executor is removed by driver with heartbeats timeout, driver will > re-queue the task on this executor and send a kill command to cluster to kill > this executor. > But, in a situation, the running task of this executor is finished and return > result to driver before this executor killed by kill command sent by driver. > At this situation, driver will accept the task finished event and ignore > speculative task and re-queued task. But, as we know, this executor has > removed by driver, the result of this finished task can not save in driver > because the *BlockManagerId* has also removed from *BlockManagerMaster* by > driver. So, the result data of this stage is not complete, and then, it will > cause fetch failure. > For example, the following is the task log: > {noformat} > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing > executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor > 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after > 256015 ms > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing > tasks for 322 from TaskSet 107.0 > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task > 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): > ExecutorLostFailure (executor 322 lost) > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: > 322 (epoch 11) > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: > Trying to remove executor 322 from BlockManagerMaster. > 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed > 322 successfully in removeExecutor > {noformat} > {noformat} > 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task > 229.0 in stage 107.0 (TID 10384) in 272315 ms on > BJHC-HERA-16168.hadoop.jd.local (579/700) > 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring > task-finished event for 229.1 in stage 107.0 because task 229 has already > completed successfully > {noformat} > {noformat} > 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 > missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at > mapPartitions at Exchange.scala:137) > 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task > set 107.1 with 3 tasks > 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task > 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, > PROCESS_LOCAL, 3745 bytes) > 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task > 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, > 3745 bytes) > 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task > 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, > PROCESS_LOCAL, 3745 bytes) > {noformat} > Driver will check the stage's result is not complete, and submit missing > task, but this time, the next stage has run because previous stage has finish > for its task is all finished although its result is not complete. > {noformat} > 2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task > 39.0 in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): > FetchFailed(null, shuffleId=11, mapId=-1, reduceId=39, message= > 2015-12-31 04:40:13 INFO > org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output > location for shuffle 11 > 2015-12-31 04:40:13 INFO at > org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385) > 2015-12-31 04:40:13 INFO at > org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382) > 2015-12-31 04:40:13 INFO at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > 2015-12-31 04:40:13 INFO at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > 2015-12-31 04:40:13 INFO at >
[jira] [Commented] (SPARK-11106) Should ML Models contains single models or Pipelines?
[ https://issues.apache.org/jira/browse/SPARK-11106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319685#comment-15319685 ] Xusen Yin commented on SPARK-11106: --- RFormula is easy to use, but it may not always do right things. For example, RFormula indexes categorical features with OneHotEncoder, but in some scenario (like RandomForest), a StringIndexer is better. > Should ML Models contains single models or Pipelines? > - > > Key: SPARK-11106 > URL: https://issues.apache.org/jira/browse/SPARK-11106 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > This JIRA is for discussing whether an ML Estimators should do feature > processing. > h2. Issue > Currently, almost all ML Estimators require strict input types. E.g., > DecisionTreeClassifier requires that the label column be Double type and have > metadata indicating the number of classes. > This requires users to know how to preprocess data. > h2. Ideal workflow > A user should be able to pass any reasonable data to a Transformer or > Estimator and have it "do the right thing." > E.g.: > * If DecisionTreeClassifier is given a String column for labels, it should > know to index the Strings. > * See [SPARK-10513] for a similar issue with OneHotEncoder. > h2. Possible solutions > There are a few solutions I have thought of. Please comment with feedback or > alternative ideas! > h3. Leave as is > Pro: The current setup is good in that it forces the user to be very aware of > what they are doing. Feature transformations will not happen silently. > Con: The user has to write boilerplate code for transformations. The API is > not what some users would expect; e.g., coming from R, a user might expect > some automatic transformations. > h3. All Transformers can contain PipelineModels > We could allow all Transformers and Models to contain arbitrary > PipelineModels. E.g., if a DecisionTreeClassifier were given a String label > column, it might return a Model which contains a simple fitted PipelineModel > containing StringIndexer + DecisionTreeClassificationModel. > The API could present this to the user, or it could be hidden from the user. > Ideally, it would be hidden from the beginner user, but accessible for > experts. > The main problem is that we might have to break APIs. E.g., OneHotEncoder > may need to do indexing if given a String input column. This means it should > no longer be a Transformer; it should be an Estimator. > h3. All Estimators should use RFormula > The best option I have thought of is to make RFormula be the primary method > for automatic feature transformation. We could start adding an RFormula > Param to all Estimators, and it could handle most of these feature > transformation issues. > We could maintain old APIs: > * If a user sets the input column names, then those can be used in the > traditional (no automatic transformation) way. > * If a user sets the RFormula Param, then it can be used instead. (This > should probably take precedence over the old API.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15780) Support mapValues on KeyValueGroupedDataset
[ https://issues.apache.org/jira/browse/SPARK-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319594#comment-15319594 ] koert kuipers edited comment on SPARK-15780 at 6/7/16 10:34 PM: also see this discussion: https://www.mail-archive.com/user@spark.apache.org/msg51915.html was (Author: koert): also see this discussion: https://mail.google.com/mail/u/0/#label/Active/1552c23b293b1ac8 > Support mapValues on KeyValueGroupedDataset > --- > > Key: SPARK-15780 > URL: https://issues.apache.org/jira/browse/SPARK-15780 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: koert kuipers >Priority: Minor > > Currently when doing groupByKey on a Dataset the key ends up in the values > which can be clumsy: > {noformat} > val ds: Dataset[(K, V)] = ... > val grouped: KeyValueGroupedDataset[(K, (K, V))] = ds.groupByKey(_._1) > {noformat} > With mapValues one can create something more similar to PairRDDFunctions[K, > V]: > {noformat} > val ds: Dataset[(K, V)] = ... > val grouped: KeyValueGroupedDataset[(K, V)] = > ds.groupByKey(_._1).mapValues(_._2) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15780) Support mapValues on KeyValueGroupedDataset
[ https://issues.apache.org/jira/browse/SPARK-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319594#comment-15319594 ] koert kuipers commented on SPARK-15780: --- also see this discussion: https://mail.google.com/mail/u/0/#label/Active/1552c23b293b1ac8 > Support mapValues on KeyValueGroupedDataset > --- > > Key: SPARK-15780 > URL: https://issues.apache.org/jira/browse/SPARK-15780 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: koert kuipers >Priority: Minor > > Currently when doing groupByKey on a Dataset the key ends up in the values > which can be clumsy: > {noformat} > val ds: Dataset[(K, V)] = ... > val grouped: KeyValueGroupedDataset[(K, (K, V))] = ds.groupByKey(_._1) > {noformat} > With mapValues one can create something more similar to PairRDDFunctions[K, > V]: > {noformat} > val ds: Dataset[(K, V)] = ... > val grouped: KeyValueGroupedDataset[(K, V)] = > ds.groupByKey(_._1).mapValues(_._2) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14816) Update MLlib, GraphX, SparkR websites for 2.0
[ https://issues.apache.org/jira/browse/SPARK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-14816: --- Assignee: Yanbo Liang > Update MLlib, GraphX, SparkR websites for 2.0 > - > > Key: SPARK-14816 > URL: https://issues.apache.org/jira/browse/SPARK-14816 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib, SparkR >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Blocker > > Update the sub-projects' websites to include new features in this release. > For MLlib, make it clear that the DataFrame-based API is the primary one now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13590) Document the behavior of spark.ml logistic regression and AFT survival regression when there are constant features
[ https://issues.apache.org/jira/browse/SPARK-13590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-13590. - Resolution: Fixed Fix Version/s: 2.0.0 > Document the behavior of spark.ml logistic regression and AFT survival > regression when there are constant features > -- > > Key: SPARK-13590 > URL: https://issues.apache.org/jira/browse/SPARK-13590 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.0.0 > > > As discussed in SPARK-13029, we decided to keep the current behavior that > sets all coefficients associated with constant feature columns to zero, > regardless of intercept, regularization, and standardization settings. This > is the same behavior as in glmnet. Since this is different from LIBSVM, we > should document the behavior correctly, add tests, and generate warning > messages if there are constant columns and `addIntercept` is false. > cc [~coderxiang] [~dbtsai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14816) Update MLlib, GraphX, SparkR websites for 2.0
[ https://issues.apache.org/jira/browse/SPARK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-14816: Assignee: (was: Yanbo Liang) > Update MLlib, GraphX, SparkR websites for 2.0 > - > > Key: SPARK-14816 > URL: https://issues.apache.org/jira/browse/SPARK-14816 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib, SparkR >Reporter: Joseph K. Bradley >Priority: Blocker > > Update the sub-projects' websites to include new features in this release. > For MLlib, make it clear that the DataFrame-based API is the primary one now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15674) Deprecates "CREATE TEMPORARY TABLE USING...", use "CREATE TEMPORARY VIEW USING..." instead.
[ https://issues.apache.org/jira/browse/SPARK-15674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-15674. --- Resolution: Resolved Assignee: Sean Zhong > Deprecates "CREATE TEMPORARY TABLE USING...", use "CREATE TEMPORARY VIEW > USING..." instead. > --- > > Key: SPARK-15674 > URL: https://issues.apache.org/jira/browse/SPARK-15674 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Sean Zhong >Assignee: Sean Zhong >Priority: Minor > > The current implementation of "CREATE TEMPORARY TABLE USING..." is actually > creating a temporary VIEW behind the scene. > We probably should just use "CREATE TEMPORARY VIEW USING..." instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15810) Aggregator doesn't play nice with Option
[ https://issues.apache.org/jira/browse/SPARK-15810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koert kuipers updated SPARK-15810: -- Description: {noformat} val ds1 = List(("a", 1), ("a", 2), ("a", 3)).toDS val ds2 = ds1.map{ case (k, v) => (k, if (v > 1) Some(v) else None) } val ds3 = ds2.groupByKey(_._1).agg(new Aggregator[(String, Option[Int]), Option[Int], Option[Int]]{ def zero: Option[Int] = None def reduce(b: Option[Int], a: (String, Option[Int])): Option[Int] = b.map(bv => a._2.map(av => bv + av).getOrElse(bv)).orElse(a._2) def merge(b1: Option[Int], b2: Option[Int]): Option[Int] = b1.map(b1v => b2.map(b2v => b1v + b2v).getOrElse(b1v)).orElse(b2) def finish(reduction: Option[Int]): Option[Int] = reduction def bufferEncoder: Encoder[Option[Int]] = implicitly[Encoder[Option[Int]]] def outputEncoder: Encoder[Option[Int]] = implicitly[Encoder[Option[Int]]] }.toColumn) ds3.printSchema ds3.show {noformat} i get as output a somewhat odd looking schema, and after that the program just hangs pinning one cpu at 100%. the data never shows. output: {noformat} root |-- value: string (nullable = true) |-- $anon$1(scala.Tuple2): struct (nullable = true) ||-- value: integer (nullable = true) {noformat} was: {noformat} val ds1 = List(("a", 1), ("a", 2), ("a", 3)).toDS val df1 = ds1.map{ case (k, v) => (k, if (v > 1) Some(v) else None) }.toDF("k", "v") val df2 = df1.groupBy("k").agg(new Aggregator[(String, Option[Int]), Option[Int], Option[Int]]{ def zero: Option[Int] = None def reduce(b: Option[Int], a: (String, Option[Int])): Option[Int] = b.map(bv => a._2.map(av => bv + av).getOrElse(bv)).orElse(a._2) def merge(b1: Option[Int], b2: Option[Int]): Option[Int] = b1.map(b1v => b2.map(b2v => b1v + b2v).getOrElse(b1v)).orElse(b2) def finish(reduction: Option[Int]): Option[Int] = reduction def bufferEncoder: Encoder[Option[Int]] = implicitly[Encoder[Option[Int]]] def outputEncoder: Encoder[Option[Int]] = implicitly[Encoder[Option[Int]]] }.toColumn) df2.printSchema df2.show {noformat} i get as output a somewhat odd looking schema, and after that the program just hangs pinning one cpu at 100%. the data never shows. output: {noformat} root |-- k: string (nullable = true) |-- $anon$1(org.apache.spark.sql.Row): struct (nullable = true) ||-- value: integer (nullable = true) {noformat} > Aggregator doesn't play nice with Option > > > Key: SPARK-15810 > URL: https://issues.apache.org/jira/browse/SPARK-15810 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: spark 2.0.0-SNAPSHOT >Reporter: koert kuipers > > {noformat} > val ds1 = List(("a", 1), ("a", 2), ("a", 3)).toDS > val ds2 = ds1.map{ case (k, v) => (k, if (v > 1) Some(v) else None) } > val ds3 = ds2.groupByKey(_._1).agg(new Aggregator[(String, > Option[Int]), Option[Int], Option[Int]]{ > def zero: Option[Int] = None > def reduce(b: Option[Int], a: (String, Option[Int])): Option[Int] = > b.map(bv => a._2.map(av => bv + av).getOrElse(bv)).orElse(a._2) > def merge(b1: Option[Int], b2: Option[Int]): Option[Int] = b1.map(b1v > => b2.map(b2v => b1v + b2v).getOrElse(b1v)).orElse(b2) > def finish(reduction: Option[Int]): Option[Int] = reduction > def bufferEncoder: Encoder[Option[Int]] = > implicitly[Encoder[Option[Int]]] > def outputEncoder: Encoder[Option[Int]] = > implicitly[Encoder[Option[Int]]] > }.toColumn) > ds3.printSchema > ds3.show > {noformat} > i get as output a somewhat odd looking schema, and after that the program > just hangs pinning one cpu at 100%. the data never shows. > output: > {noformat} > root > |-- value: string (nullable = true) > |-- $anon$1(scala.Tuple2): struct (nullable = true) > ||-- value: integer (nullable = true) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15810) Aggregator doesn't play nice with Option
koert kuipers created SPARK-15810: - Summary: Aggregator doesn't play nice with Option Key: SPARK-15810 URL: https://issues.apache.org/jira/browse/SPARK-15810 Project: Spark Issue Type: Bug Components: SQL Environment: spark 2.0.0-SNAPSHOT Reporter: koert kuipers {noformat} val ds1 = List(("a", 1), ("a", 2), ("a", 3)).toDS val df1 = ds1.map{ case (k, v) => (k, if (v > 1) Some(v) else None) }.toDF("k", "v") val df2 = df1.groupBy("k").agg(new Aggregator[(String, Option[Int]), Option[Int], Option[Int]]{ def zero: Option[Int] = None def reduce(b: Option[Int], a: (String, Option[Int])): Option[Int] = b.map(bv => a._2.map(av => bv + av).getOrElse(bv)).orElse(a._2) def merge(b1: Option[Int], b2: Option[Int]): Option[Int] = b1.map(b1v => b2.map(b2v => b1v + b2v).getOrElse(b1v)).orElse(b2) def finish(reduction: Option[Int]): Option[Int] = reduction def bufferEncoder: Encoder[Option[Int]] = implicitly[Encoder[Option[Int]]] def outputEncoder: Encoder[Option[Int]] = implicitly[Encoder[Option[Int]]] }.toColumn) df2.printSchema df2.show {noformat} i get as output a somewhat odd looking schema, and after that the program just hangs pinning one cpu at 100%. the data never shows. output: {noformat} root |-- k: string (nullable = true) |-- $anon$1(org.apache.spark.sql.Row): struct (nullable = true) ||-- value: integer (nullable = true) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9623) RandomForestRegressor: provide variance of predictions
[ https://issues.apache.org/jira/browse/SPARK-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319450#comment-15319450 ] Manoj Kumar commented on SPARK-9623: [~yanboliang] Are you still working on this? Would you mind if I take over? > RandomForestRegressor: provide variance of predictions > -- > > Key: SPARK-9623 > URL: https://issues.apache.org/jira/browse/SPARK-9623 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Variance of predicted value, as estimated from training data. > Analogous to class probabilities for classification. > See [SPARK-3727] for discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14279) Improve the spark build to pick the version information from the pom file and add git commit information
[ https://issues.apache.org/jira/browse/SPARK-14279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-14279: --- Fix Version/s: (was: 2.1.0) 2.0.0 > Improve the spark build to pick the version information from the pom file and > add git commit information > > > Key: SPARK-14279 > URL: https://issues.apache.org/jira/browse/SPARK-14279 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Sanket Reddy >Assignee: Dhruve Ashar >Priority: Minor > Fix For: 2.0.0 > > > Right now the spark-submit --version and other parts of the code pick up > version information from a static SPARK_VERSION. We would want to pick the > version from the pom.version probably stored inside a properties file. Also, > it might be nice to have other details like branch, build information and > other specific details when having a spark-submit --version > Note, the motivation is to more easily tie this to automated continuous > integration and deployment and to easily have traceability. > Part of this is right now you have to manually change a java file to change > the version that comes out when you run spark-submit --version. With > continuous integration the build numbers could be something like 1.6.1.X > (where X increments on each change) and I want to see the exact version > easily. Having to manually change a java file makes that hard. obviously that > should make the apache spark releases easier as you don't have to manually > change this file as well. > The other important part for me is the git information. This easily lets me > trace it back to exact commits. We have a multi-tenant YARN cluster and users > can run many different versions at once. I want to be able to see exactly > which version they are running. The reason to know exact version can range > from helping debug some problem to making sure someone didn't hack something > in Spark to cause bad things (generally they should use approved version), > etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15809) PySpark SQL UDF default returnType
Vladimir Feinberg created SPARK-15809: - Summary: PySpark SQL UDF default returnType Key: SPARK-15809 URL: https://issues.apache.org/jira/browse/SPARK-15809 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Vladimir Feinberg Priority: Minor The current signature for the pyspark UDF creation function is: {code:python} pyspark.sql.functions.udf(f, returnType=StringType) {code} Is there a reason that there's a default parameter for {{returnType}}? Returning a string by default doesn't strike me as so much more a frequent use case than, say, returning an integer to merit the default. In fact, it seems the only reason that the default was chosen is that if we *had to choose* a default type, it would be a {{StringType}} because that's what we can implicitly convert everything to. But this only seems to do two things to me: (1) cause unintentional, annoying conversions to strings for new users and (2) make call sites less consistent (if people drop the type specification to actually use the default). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15808) Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to Mismatched File Formats
[ https://issues.apache.org/jira/browse/SPARK-15808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15808: Assignee: Apache Spark > Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to > Mismatched File Formats > --- > > Key: SPARK-15808 > URL: https://issues.apache.org/jira/browse/SPARK-15808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Example 1: PARQUET -> CSV > {noformat} > createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc") > createDF(10, > 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc") > {noformat} > Error we got: > {noformat} > Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): > java.lang.RuntimeException: > file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc > is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but > found [79, 82, 67, 23] > {noformat} > Example 2: Json -> CSV > {noformat} > createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV") > createDF(10, > 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV") > {noformat} > No exception, but wrong results: > {noformat} > +++ > | c1| c2| > +++ > |null|null| > |null|null| > |null|null| > |null|null| > | 0|str0| > | 1|str1| > | 2|str2| > | 3|str3| > | 4|str4| > | 5|str5| > | 6|str6| > | 7|str7| > | 8|str8| > | 9|str9| > +++ > {noformat} > Example 3: Json -> Text > {noformat} > createDF(0, 9).write.format("json").saveAsTable("appendJsonToText") > createDF(10, > 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText") > {noformat} > Error we got: > {noformat} > Text data source supports only a single column, and you have 2 columns. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15808) Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to Mismatched File Formats
[ https://issues.apache.org/jira/browse/SPARK-15808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319156#comment-15319156 ] Apache Spark commented on SPARK-15808: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/13546 > Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to > Mismatched File Formats > --- > > Key: SPARK-15808 > URL: https://issues.apache.org/jira/browse/SPARK-15808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Example 1: PARQUET -> CSV > {noformat} > createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc") > createDF(10, > 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc") > {noformat} > Error we got: > {noformat} > Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): > java.lang.RuntimeException: > file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc > is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but > found [79, 82, 67, 23] > {noformat} > Example 2: Json -> CSV > {noformat} > createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV") > createDF(10, > 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV") > {noformat} > No exception, but wrong results: > {noformat} > +++ > | c1| c2| > +++ > |null|null| > |null|null| > |null|null| > |null|null| > | 0|str0| > | 1|str1| > | 2|str2| > | 3|str3| > | 4|str4| > | 5|str5| > | 6|str6| > | 7|str7| > | 8|str8| > | 9|str9| > +++ > {noformat} > Example 3: Json -> Text > {noformat} > createDF(0, 9).write.format("json").saveAsTable("appendJsonToText") > createDF(10, > 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText") > {noformat} > Error we got: > {noformat} > Text data source supports only a single column, and you have 2 columns. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15808) Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to Mismatched File Formats
[ https://issues.apache.org/jira/browse/SPARK-15808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15808: Assignee: (was: Apache Spark) > Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to > Mismatched File Formats > --- > > Key: SPARK-15808 > URL: https://issues.apache.org/jira/browse/SPARK-15808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Example 1: PARQUET -> CSV > {noformat} > createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc") > createDF(10, > 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc") > {noformat} > Error we got: > {noformat} > Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): > java.lang.RuntimeException: > file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc > is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but > found [79, 82, 67, 23] > {noformat} > Example 2: Json -> CSV > {noformat} > createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV") > createDF(10, > 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV") > {noformat} > No exception, but wrong results: > {noformat} > +++ > | c1| c2| > +++ > |null|null| > |null|null| > |null|null| > |null|null| > | 0|str0| > | 1|str1| > | 2|str2| > | 3|str3| > | 4|str4| > | 5|str5| > | 6|str6| > | 7|str7| > | 8|str8| > | 9|str9| > +++ > {noformat} > Example 3: Json -> Text > {noformat} > createDF(0, 9).write.format("json").saveAsTable("appendJsonToText") > createDF(10, > 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText") > {noformat} > Error we got: > {noformat} > Text data source supports only a single column, and you have 2 columns. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15808) Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to Mismatched File Formats
Xiao Li created SPARK-15808: --- Summary: Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to Mismatched File Formats Key: SPARK-15808 URL: https://issues.apache.org/jira/browse/SPARK-15808 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li Example 1: PARQUET -> CSV {noformat} createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc") createDF(10, 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc") {noformat} Error we got: {noformat} Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.RuntimeException: file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [79, 82, 67, 23] {noformat} Example 2: Json -> CSV createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV") createDF(10, 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV") No exception, but wrong results: {noformat} +++ | c1| c2| +++ |null|null| |null|null| |null|null| |null|null| | 0|str0| | 1|str1| | 2|str2| | 3|str3| | 4|str4| | 5|str5| | 6|str6| | 7|str7| | 8|str8| | 9|str9| +++ {noformat} Example 3: Json -> Text createDF(0, 9).write.format("json").saveAsTable("appendJsonToText") createDF(10, 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText") Error we got: {noformat} Text data source supports only a single column, and you have 2 columns. {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15808) Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to Mismatched File Formats
[ https://issues.apache.org/jira/browse/SPARK-15808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-15808: Description: Example 1: PARQUET -> CSV {noformat} createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc") createDF(10, 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc") {noformat} Error we got: {noformat} Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.RuntimeException: file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [79, 82, 67, 23] {noformat} Example 2: Json -> CSV createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV") createDF(10, 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV") No exception, but wrong results: {noformat} +++ | c1| c2| +++ |null|null| |null|null| |null|null| |null|null| | 0|str0| | 1|str1| | 2|str2| | 3|str3| | 4|str4| | 5|str5| | 6|str6| | 7|str7| | 8|str8| | 9|str9| +++ {noformat} Example 3: Json -> Text {noformat} createDF(0, 9).write.format("json").saveAsTable("appendJsonToText") createDF(10, 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText") {noformat} Error we got: {noformat} Text data source supports only a single column, and you have 2 columns. {noformat} was: Example 1: PARQUET -> CSV {noformat} createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc") createDF(10, 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc") {noformat} Error we got: {noformat} Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.RuntimeException: file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [79, 82, 67, 23] {noformat} Example 2: Json -> CSV createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV") createDF(10, 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV") No exception, but wrong results: {noformat} +++ | c1| c2| +++ |null|null| |null|null| |null|null| |null|null| | 0|str0| | 1|str1| | 2|str2| | 3|str3| | 4|str4| | 5|str5| | 6|str6| | 7|str7| | 8|str8| | 9|str9| +++ {noformat} Example 3: Json -> Text createDF(0, 9).write.format("json").saveAsTable("appendJsonToText") createDF(10, 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText") Error we got: {noformat} Text data source supports only a single column, and you have 2 columns. {noformat} > Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to > Mismatched File Formats > --- > > Key: SPARK-15808 > URL: https://issues.apache.org/jira/browse/SPARK-15808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Example 1: PARQUET -> CSV > {noformat} > createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc") > createDF(10, > 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc") > {noformat} > Error we got: > {noformat} > Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): > java.lang.RuntimeException: > file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc > is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but > found [79, 82, 67, 23] > {noformat} > Example 2: Json -> CSV > createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV") > createDF(10, > 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV") > No exception, but wrong results: > {noformat} > +++ > | c1| c2| > +++ > |null|null| > |null|null| > |null|null| > |null|null| > | 0|str0| > | 1|str1| > | 2|str2| > | 3|str3| > | 4|str4| > | 5|str5| > | 6|str6| > | 7|str7| > | 8|str8| > | 9|str9| > +++ > {noformat} > Example 3: Json -> Text > {noformat} > createDF(0, 9).write.format("json").saveAsTable("appendJsonToText") > createDF(10,
[jira] [Updated] (SPARK-15808) Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to Mismatched File Formats
[ https://issues.apache.org/jira/browse/SPARK-15808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-15808: Description: Example 1: PARQUET -> CSV {noformat} createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc") createDF(10, 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc") {noformat} Error we got: {noformat} Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.RuntimeException: file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [79, 82, 67, 23] {noformat} Example 2: Json -> CSV {noformat} createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV") createDF(10, 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV") {noformat} No exception, but wrong results: {noformat} +++ | c1| c2| +++ |null|null| |null|null| |null|null| |null|null| | 0|str0| | 1|str1| | 2|str2| | 3|str3| | 4|str4| | 5|str5| | 6|str6| | 7|str7| | 8|str8| | 9|str9| +++ {noformat} Example 3: Json -> Text {noformat} createDF(0, 9).write.format("json").saveAsTable("appendJsonToText") createDF(10, 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText") {noformat} Error we got: {noformat} Text data source supports only a single column, and you have 2 columns. {noformat} was: Example 1: PARQUET -> CSV {noformat} createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc") createDF(10, 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc") {noformat} Error we got: {noformat} Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.RuntimeException: file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [79, 82, 67, 23] {noformat} Example 2: Json -> CSV createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV") createDF(10, 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV") No exception, but wrong results: {noformat} +++ | c1| c2| +++ |null|null| |null|null| |null|null| |null|null| | 0|str0| | 1|str1| | 2|str2| | 3|str3| | 4|str4| | 5|str5| | 6|str6| | 7|str7| | 8|str8| | 9|str9| +++ {noformat} Example 3: Json -> Text {noformat} createDF(0, 9).write.format("json").saveAsTable("appendJsonToText") createDF(10, 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText") {noformat} Error we got: {noformat} Text data source supports only a single column, and you have 2 columns. {noformat} > Wrong Results or Strange Errors In Append-mode DataFrame Writing Due to > Mismatched File Formats > --- > > Key: SPARK-15808 > URL: https://issues.apache.org/jira/browse/SPARK-15808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Example 1: PARQUET -> CSV > {noformat} > createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc") > createDF(10, > 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc") > {noformat} > Error we got: > {noformat} > Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): > java.lang.RuntimeException: > file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-0-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc > is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but > found [79, 82, 67, 23] > {noformat} > Example 2: Json -> CSV > {noformat} > createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV") > createDF(10, > 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV") > {noformat} > No exception, but wrong results: > {noformat} > +++ > | c1| c2| > +++ > |null|null| > |null|null| > |null|null| > |null|null| > | 0|str0| > | 1|str1| > | 2|str2| > | 3|str3| > | 4|str4| > | 5|str5| > | 6|str6| > | 7|str7| > | 8|str8| > | 9|str9| > +++ > {noformat} > Example 3: Json -> Text > {noformat} > createDF(0,
[jira] [Updated] (SPARK-15804) Manually added metadata not saving with parquet
[ https://issues.apache.org/jira/browse/SPARK-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charlie Evans updated SPARK-15804: -- Description: Adding metadata with col().as(_, metadata) then saving the resultant dataframe does not save the metadata. No error is thrown. Only see the schema contains the metadata before saving and does not contain the metadata after saving and loading the dataframe. Was working fine with 1.6.1. {code} case class TestRow(a: String, b: Int) val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil val df = spark.createDataFrame(rows) import org.apache.spark.sql.types.MetadataBuilder val md = new MetadataBuilder().putString("key", "value").build() val dfWithMeta = df.select(col("a"), col("b").as("b", md)) println(dfWithMeta.schema.json) dfWithMeta.write.parquet("dfWithMeta") val dfWithMeta2 = spark.read.parquet("dfWithMeta") println(dfWithMeta2.schema.json) {code} was: Adding metadata with col().as(_, metadata) then saving the resultant dataframe does not save the metadata. No error is thrown. Only see the schema contains the metadata before saving and does not contain the metadata after saving and loading the dataframe. {code} case class TestRow(a: String, b: Int) val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil val df = spark.createDataFrame(rows) import org.apache.spark.sql.types.MetadataBuilder val md = new MetadataBuilder().putString("key", "value").build() val dfWithMeta = df.select(col("a"), col("b").as("b", md)) println(dfWithMeta.schema.json) dfWithMeta.write.parquet("dfWithMeta") val dfWithMeta2 = spark.read.parquet("dfWithMeta") println(dfWithMeta2.schema.json) {code} > Manually added metadata not saving with parquet > --- > > Key: SPARK-15804 > URL: https://issues.apache.org/jira/browse/SPARK-15804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Charlie Evans > > Adding metadata with col().as(_, metadata) then saving the resultant > dataframe does not save the metadata. No error is thrown. Only see the schema > contains the metadata before saving and does not contain the metadata after > saving and loading the dataframe. Was working fine with 1.6.1. > {code} > case class TestRow(a: String, b: Int) > val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil > val df = spark.createDataFrame(rows) > import org.apache.spark.sql.types.MetadataBuilder > val md = new MetadataBuilder().putString("key", "value").build() > val dfWithMeta = df.select(col("a"), col("b").as("b", md)) > println(dfWithMeta.schema.json) > dfWithMeta.write.parquet("dfWithMeta") > val dfWithMeta2 = spark.read.parquet("dfWithMeta") > println(dfWithMeta2.schema.json) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15785) Add initialModel param to Gaussian Mixture Model (GMM) in spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319100#comment-15319100 ] Gayathri Murali commented on SPARK-15785: - I will work on this. Thanks! > Add initialModel param to Gaussian Mixture Model (GMM) in spark.ml > -- > > Key: SPARK-15785 > URL: https://issues.apache.org/jira/browse/SPARK-15785 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Xinh Huynh > > Adding this param is needed for SPARK-4591: algorithm/model parity for > spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15807) Support varargs for distinct/dropDuplicates in Dataset/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-15807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15807: Assignee: Apache Spark > Support varargs for distinct/dropDuplicates in Dataset/DataFrame > > > Key: SPARK-15807 > URL: https://issues.apache.org/jira/browse/SPARK-15807 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > This issue adds `varargs`-types `distinct/dropDuplicates` functions in > `Dataset/DataFrame`. Currently, `distinct` does not get arguments, and > `dropDuplicates` supports only `Seq` or `Array`. > {code} > scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2))) > ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int] > scala> ds.dropDuplicates(Seq("_1", "_2")) > res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, > _2: int] > scala> ds.dropDuplicates("_1", "_2") > :26: error: overloaded method value dropDuplicates with alternatives: > (colNames: > Array[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] > (colNames: > Seq[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] > ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] > cannot be applied to (String, String) >ds.dropDuplicates("_1", "_2") > ^ > scala> ds.distinct("_1", "_2") > :26: error: too many arguments for method distinct: > ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] >ds.distinct("_1", "_2") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15807) Support varargs for distinct/dropDuplicates in Dataset/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-15807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15807: Assignee: (was: Apache Spark) > Support varargs for distinct/dropDuplicates in Dataset/DataFrame > > > Key: SPARK-15807 > URL: https://issues.apache.org/jira/browse/SPARK-15807 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun > > This issue adds `varargs`-types `distinct/dropDuplicates` functions in > `Dataset/DataFrame`. Currently, `distinct` does not get arguments, and > `dropDuplicates` supports only `Seq` or `Array`. > {code} > scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2))) > ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int] > scala> ds.dropDuplicates(Seq("_1", "_2")) > res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, > _2: int] > scala> ds.dropDuplicates("_1", "_2") > :26: error: overloaded method value dropDuplicates with alternatives: > (colNames: > Array[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] > (colNames: > Seq[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] > ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] > cannot be applied to (String, String) >ds.dropDuplicates("_1", "_2") > ^ > scala> ds.distinct("_1", "_2") > :26: error: too many arguments for method distinct: > ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] >ds.distinct("_1", "_2") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15807) Support varargs for distinct/dropDuplicates in Dataset/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-15807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319090#comment-15319090 ] Apache Spark commented on SPARK-15807: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/13545 > Support varargs for distinct/dropDuplicates in Dataset/DataFrame > > > Key: SPARK-15807 > URL: https://issues.apache.org/jira/browse/SPARK-15807 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun > > This issue adds `varargs`-types `distinct/dropDuplicates` functions in > `Dataset/DataFrame`. Currently, `distinct` does not get arguments, and > `dropDuplicates` supports only `Seq` or `Array`. > {code} > scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2))) > ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int] > scala> ds.dropDuplicates(Seq("_1", "_2")) > res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, > _2: int] > scala> ds.dropDuplicates("_1", "_2") > :26: error: overloaded method value dropDuplicates with alternatives: > (colNames: > Array[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] > (colNames: > Seq[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] > ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] > cannot be applied to (String, String) >ds.dropDuplicates("_1", "_2") > ^ > scala> ds.distinct("_1", "_2") > :26: error: too many arguments for method distinct: > ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] >ds.distinct("_1", "_2") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15807) Support varargs for distinct/dropDuplicates in Dataset/DataFrame
Dongjoon Hyun created SPARK-15807: - Summary: Support varargs for distinct/dropDuplicates in Dataset/DataFrame Key: SPARK-15807 URL: https://issues.apache.org/jira/browse/SPARK-15807 Project: Spark Issue Type: Improvement Components: SQL Reporter: Dongjoon Hyun This issue adds `varargs`-types `distinct/dropDuplicates` functions in `Dataset/DataFrame`. Currently, `distinct` does not get arguments, and `dropDuplicates` supports only `Seq` or `Array`. {code} scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2))) ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int] scala> ds.dropDuplicates(Seq("_1", "_2")) res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, _2: int] scala> ds.dropDuplicates("_1", "_2") :26: error: overloaded method value dropDuplicates with alternatives: (colNames: Array[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] (colNames: Seq[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] cannot be applied to (String, String) ds.dropDuplicates("_1", "_2") ^ scala> ds.distinct("_1", "_2") :26: error: too many arguments for method distinct: ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] ds.distinct("_1", "_2") {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15804) Manually added metadata not saving with parquet
[ https://issues.apache.org/jira/browse/SPARK-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318897#comment-15318897 ] Takeshi Yamamuro commented on SPARK-15804: -- `MetadataBuilder` is one of developer apis, so is this functionality useful for developers? Any useful scenario to use this? Anyway, this is related to not only `parquet but also other formats such as orc, csv, json... > Manually added metadata not saving with parquet > --- > > Key: SPARK-15804 > URL: https://issues.apache.org/jira/browse/SPARK-15804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Charlie Evans > > Adding metadata with col().as(_, metadata) then saving the resultant > dataframe does not save the metadata. No error is thrown. Only see the schema > contains the metadata before saving and does not contain the metadata after > saving and loading the dataframe. > {code} > case class TestRow(a: String, b: Int) > val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil > val df = spark.createDataFrame(rows) > import org.apache.spark.sql.types.MetadataBuilder > val md = new MetadataBuilder().putString("key", "value").build() > val dfWithMeta = df.select(col("a"), col("b").as("b", md)) > println(dfWithMeta.schema.json) > dfWithMeta.write.parquet("dfWithMeta") > val dfWithMeta2 = spark.read.parquet("dfWithMeta") > println(dfWithMeta2.schema.json) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15760) Documentation missing for package-related config options
[ https://issues.apache.org/jira/browse/SPARK-15760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-15760. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.0.0 > Documentation missing for package-related config options > > > Key: SPARK-15760 > URL: https://issues.apache.org/jira/browse/SPARK-15760 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 1.6.1, 2.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.0.0 > > > There's no documentation about the config options that correlate to the > "--packages" (and friends) arguments of spark-submit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15684) Not mask startsWith and endsWith in R
[ https://issues.apache.org/jira/browse/SPARK-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-15684. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13476 [https://github.com/apache/spark/pull/13476] > Not mask startsWith and endsWith in R > - > > Key: SPARK-15684 > URL: https://issues.apache.org/jira/browse/SPARK-15684 > Project: Spark > Issue Type: Improvement >Reporter: Miao Wang > Fix For: 2.0.0 > > > In R 3.3.0, it has startsWith and endsWith. We should not mask this two > methods in Spark. Actually, Spark R has startsWith and endsWith working for > column. But making them work for both column and string is not easy. I create > this JIRA for discussions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15684) Not mask startsWith and endsWith in R
[ https://issues.apache.org/jira/browse/SPARK-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-15684: -- Assignee: Miao Wang > Not mask startsWith and endsWith in R > - > > Key: SPARK-15684 > URL: https://issues.apache.org/jira/browse/SPARK-15684 > Project: Spark > Issue Type: Improvement >Reporter: Miao Wang >Assignee: Miao Wang > Fix For: 2.0.0 > > > In R 3.3.0, it has startsWith and endsWith. We should not mask this two > methods in Spark. Actually, Spark R has startsWith and endsWith working for > column. But making them work for both column and string is not easy. I create > this JIRA for discussions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15799) Release SparkR on CRAN
[ https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318769#comment-15318769 ] Shivaram Venkataraman commented on SPARK-15799: --- I dont think there are any license issues and at least before we merged SparkR into the apache the package passed all the CRAN checks. The only problem is that we might need to ship the entire Spark assembly JAR (or all the jars that we have with the new release structure) to make the package work without additional downloads. Some other minor things that might make it challenging to use SparkR directly from CRAN 1. Matching versions between client and cluster versions of Spark. This is still a requirement today but the main difference is that people might upgrade CRAN packages separately from their Spark clusters say. 2. Figuring out where to put scripts like spark-submit that can be used to submit batch jobs. This isn't something normal R packages offer so I'm not sure there are existing practices we can follow here. > Release SparkR on CRAN > -- > > Key: SPARK-15799 > URL: https://issues.apache.org/jira/browse/SPARK-15799 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Xiangrui Meng > > Story: "As an R user, I would like to see SparkR released on CRAN, so I can > use SparkR easily in an existing R environment and have other packages built > on top of SparkR." > I made this JIRA with the following questions in mind: > * Are there known issues that prevent us releasing SparkR on CRAN? > * Do we want to package Spark jars in the SparkR release? > * Are there license issues? > * How does it fit into Spark's release process? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15805) update the whole sql programming guide
[ https://issues.apache.org/jira/browse/SPARK-15805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15805: Assignee: (was: Apache Spark) > update the whole sql programming guide > -- > > Key: SPARK-15805 > URL: https://issues.apache.org/jira/browse/SPARK-15805 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.0.0 >Reporter: Weichen Xu > Original Estimate: 48h > Remaining Estimate: 48h > > The sql programming guide of spark is out-of-date in many places, including: > should using `SparkSession` instead of `SQLContext` > should using `SparkSession.builder.enableHiveSupport` instead of `HiveContext` > should using `dataFrame.write.saveAsTable` instead of `dataFrame.saveAsTable` > should using `sparkSession.catalog.cacheTable/uncacheTable` instead of > `SQLContext.cacheTable/uncacheTable` > and so on... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15805) update the whole sql programming guide
[ https://issues.apache.org/jira/browse/SPARK-15805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318766#comment-15318766 ] Apache Spark commented on SPARK-15805: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/13544 > update the whole sql programming guide > -- > > Key: SPARK-15805 > URL: https://issues.apache.org/jira/browse/SPARK-15805 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.0.0 >Reporter: Weichen Xu > Original Estimate: 48h > Remaining Estimate: 48h > > The sql programming guide of spark is out-of-date in many places, including: > should using `SparkSession` instead of `SQLContext` > should using `SparkSession.builder.enableHiveSupport` instead of `HiveContext` > should using `dataFrame.write.saveAsTable` instead of `dataFrame.saveAsTable` > should using `sparkSession.catalog.cacheTable/uncacheTable` instead of > `SQLContext.cacheTable/uncacheTable` > and so on... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15805) update the whole sql programming guide
[ https://issues.apache.org/jira/browse/SPARK-15805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15805: Assignee: Apache Spark > update the whole sql programming guide > -- > > Key: SPARK-15805 > URL: https://issues.apache.org/jira/browse/SPARK-15805 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.0.0 >Reporter: Weichen Xu >Assignee: Apache Spark > Original Estimate: 48h > Remaining Estimate: 48h > > The sql programming guide of spark is out-of-date in many places, including: > should using `SparkSession` instead of `SQLContext` > should using `SparkSession.builder.enableHiveSupport` instead of `HiveContext` > should using `dataFrame.write.saveAsTable` instead of `dataFrame.saveAsTable` > should using `sparkSession.catalog.cacheTable/uncacheTable` instead of > `SQLContext.cacheTable/uncacheTable` > and so on... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15806) Update doc for SPARK_MASTER_IP
[ https://issues.apache.org/jira/browse/SPARK-15806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15806: Assignee: (was: Apache Spark) > Update doc for SPARK_MASTER_IP > -- > > Key: SPARK-15806 > URL: https://issues.apache.org/jira/browse/SPARK-15806 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Bo Meng >Priority: Minor > > SPARK_MASTER_IP is a deprecated environment variable. It is replaced by > SPARK_MASTER_HOST according to MasterArguments.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15806) Update doc for SPARK_MASTER_IP
[ https://issues.apache.org/jira/browse/SPARK-15806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318761#comment-15318761 ] Apache Spark commented on SPARK-15806: -- User 'bomeng' has created a pull request for this issue: https://github.com/apache/spark/pull/13543 > Update doc for SPARK_MASTER_IP > -- > > Key: SPARK-15806 > URL: https://issues.apache.org/jira/browse/SPARK-15806 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Bo Meng >Priority: Minor > > SPARK_MASTER_IP is a deprecated environment variable. It is replaced by > SPARK_MASTER_HOST according to MasterArguments.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15806) Update doc for SPARK_MASTER_IP
[ https://issues.apache.org/jira/browse/SPARK-15806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15806: Assignee: Apache Spark > Update doc for SPARK_MASTER_IP > -- > > Key: SPARK-15806 > URL: https://issues.apache.org/jira/browse/SPARK-15806 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Bo Meng >Assignee: Apache Spark >Priority: Minor > > SPARK_MASTER_IP is a deprecated environment variable. It is replaced by > SPARK_MASTER_HOST according to MasterArguments.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15806) Update doc for SPARK_MASTER_IP
Bo Meng created SPARK-15806: --- Summary: Update doc for SPARK_MASTER_IP Key: SPARK-15806 URL: https://issues.apache.org/jira/browse/SPARK-15806 Project: Spark Issue Type: Bug Components: Documentation Reporter: Bo Meng Priority: Minor SPARK_MASTER_IP is a deprecated environment variable. It is replaced by SPARK_MASTER_HOST according to MasterArguments.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15805) update the whole sql programming guide
Weichen Xu created SPARK-15805: -- Summary: update the whole sql programming guide Key: SPARK-15805 URL: https://issues.apache.org/jira/browse/SPARK-15805 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 2.0.0 Reporter: Weichen Xu The sql programming guide of spark is out-of-date in many places, including: should using `SparkSession` instead of `SQLContext` should using `SparkSession.builder.enableHiveSupport` instead of `HiveContext` should using `dataFrame.write.saveAsTable` instead of `dataFrame.saveAsTable` should using `sparkSession.catalog.cacheTable/uncacheTable` instead of `SQLContext.cacheTable/uncacheTable` and so on... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15801) spark-submit --num-executors switch also works without YARN
[ https://issues.apache.org/jira/browse/SPARK-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318721#comment-15318721 ] Marcelo Vanzin commented on SPARK-15801: I'm not really sure of how standalone works these days after all the changes for dynamic allocation. [~andrewor14] might be a better person to ask. > spark-submit --num-executors switch also works without YARN > --- > > Key: SPARK-15801 > URL: https://issues.apache.org/jira/browse/SPARK-15801 > Project: Spark > Issue Type: Documentation > Components: Spark Submit >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > Based on this [issue|https://issues.apache.org/jira/browse/SPARK-15781] > regarding the SPARK_WORKER_INSTANCES property, I also found that the > {{--num-executors}} switch documented in the spark-submit help is partially > incorrect. > Here's one part of the output (produced by {{spark-submit --help}}): > {code} > YARN-only: > --driver-cores NUM Number of cores used by the driver, only in > cluster mode > (Default: 1). > --queue QUEUE_NAME The YARN queue to submit to (Default: > "default"). > --num-executors NUM Number of executors to launch (Default: 2). > {code} > Correct me if I am wrong, but the num-executors switch also works in Spark > standalone mode *without YARN*. > I tried by only launching a master and a worker with 4 executors specified, > and they were all successfully spawned. The master switch pointed to the > master's url, and not to the yarn value. > Here's the exact command : {{spark-submit --master spark://[local > machine]:7077 --num-executors 4 --executor-cores 2}} > By default it is *1* executor per worker in Spark standalone mode without > YARN, but this option enables to specify the number of executors (per worker > ?) if, and only if, the {{--executor-cores}} switch is also set. I do believe > it defaults to 2 in YARN mode. > I would propose to move the option from the *YARN-only* section to the *Spark > standalone and YARN only* section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15652) Missing org.apache.spark.launcher.SparkAppHandle.Listener notification if SparkSubmit JVM shutsdown
[ https://issues.apache.org/jira/browse/SPARK-15652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318711#comment-15318711 ] Marcelo Vanzin commented on SPARK-15652: I'm a little worried about that because it touches a public API, even though it's just adding something that shouldn't cause issues. I also haven't seen much activity towards a new 1.6 point release... let me think about it. > Missing org.apache.spark.launcher.SparkAppHandle.Listener notification if > SparkSubmit JVM shutsdown > --- > > Key: SPARK-15652 > URL: https://issues.apache.org/jira/browse/SPARK-15652 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Subroto Sanyal >Assignee: Subroto Sanyal >Priority: Critical > Fix For: 2.0.0 > > Attachments: SPARK-15652-1.patch, spark-launcher-client-hang.jar > > > h6. Problem > In case SparkSubmit JVM goes down even before sending the job complete > notification; the _org.apache.spark.launcher.SparkAppHandle.Listener_ will > not receive any notification which may lead to the client using SparkLauncher > hang indefinitely. > h6. Root Cause > No proper exception handling at > org.apache.spark.launcher.LauncherConnection#run when an EOFException is > encountered while reading over Socket Stream. Mostly EOFException will be > thrown at the suggested > point(_org.apache.spark.launcher.LauncherConnection.run(LauncherConnection.java:58)_) > if the SparkSubmit JVM is shutdown. > Probably, it was assumed that SparkSubmit JVM can shut down only with normal > healthy completion but, there could be scenarios where this is not the case: > # OS kill the SparkSubmit process using OOM Killer. > # Exception while SparkSubmit submits the job, even before it starts > monitoring the application. This can happen if SparkLauncher is not > configured properly. There might be no exception handling in > org.apache.spark.deploy.yarn.Client#submitApplication(), which may lead to > any exception/throwable at this point lead to shutting down of JVM without > proper finalisation > h6. Possible Solutions > # In case of EOFException or any other exception notify the listeners that > job has failed > # Better exception handling on the SparkSubmit JVM side (though this may not > resolve the problem completely) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15755) java.lang.NullPointerException when run spark 2.0 setting spark.serializer=org.apache.spark.serializer.KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-15755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318692#comment-15318692 ] Bo Meng commented on SPARK-15755: - Could you provide a test case to reproduce the issue? > java.lang.NullPointerException when run spark 2.0 setting > spark.serializer=org.apache.spark.serializer.KryoSerializer > - > > Key: SPARK-15755 > URL: https://issues.apache.org/jira/browse/SPARK-15755 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: marymwu > > java.lang.NullPointerException when run spark 2.0 setting > spark.serializer=org.apache.spark.serializer.KryoSerializer > 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > underlying (org.apache.spark.util.BoundedPriorityQueue) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) > at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) > at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) > at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) > at java.util.PriorityQueue.offer(PriorityQueue.java:329) > at java.util.PriorityQueue.add(PriorityQueue.java:306) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31) > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:711) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125) > ... 15 more > 16/05/27 15:15:28 ERROR TaskResultGetter: Exception while getting task result > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > underlying (org.apache.spark.util.BoundedPriorityQueue) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:793) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793) > at >
[jira] [Commented] (SPARK-15652) Missing org.apache.spark.launcher.SparkAppHandle.Listener notification if SparkSubmit JVM shutsdown
[ https://issues.apache.org/jira/browse/SPARK-15652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318683#comment-15318683 ] Subroto Sanyal commented on SPARK-15652: hi [~vanzin] Can this be merged to 1.6 branch? > Missing org.apache.spark.launcher.SparkAppHandle.Listener notification if > SparkSubmit JVM shutsdown > --- > > Key: SPARK-15652 > URL: https://issues.apache.org/jira/browse/SPARK-15652 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Subroto Sanyal >Assignee: Subroto Sanyal >Priority: Critical > Fix For: 2.0.0 > > Attachments: SPARK-15652-1.patch, spark-launcher-client-hang.jar > > > h6. Problem > In case SparkSubmit JVM goes down even before sending the job complete > notification; the _org.apache.spark.launcher.SparkAppHandle.Listener_ will > not receive any notification which may lead to the client using SparkLauncher > hang indefinitely. > h6. Root Cause > No proper exception handling at > org.apache.spark.launcher.LauncherConnection#run when an EOFException is > encountered while reading over Socket Stream. Mostly EOFException will be > thrown at the suggested > point(_org.apache.spark.launcher.LauncherConnection.run(LauncherConnection.java:58)_) > if the SparkSubmit JVM is shutdown. > Probably, it was assumed that SparkSubmit JVM can shut down only with normal > healthy completion but, there could be scenarios where this is not the case: > # OS kill the SparkSubmit process using OOM Killer. > # Exception while SparkSubmit submits the job, even before it starts > monitoring the application. This can happen if SparkLauncher is not > configured properly. There might be no exception handling in > org.apache.spark.deploy.yarn.Client#submitApplication(), which may lead to > any exception/throwable at this point lead to shutting down of JVM without > proper finalisation > h6. Possible Solutions > # In case of EOFException or any other exception notify the listeners that > job has failed > # Better exception handling on the SparkSubmit JVM side (though this may not > resolve the problem completely) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15730) [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take effect in spark-sql session
[ https://issues.apache.org/jira/browse/SPARK-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15730: Assignee: (was: Apache Spark) > [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take > effect in spark-sql session > - > > Key: SPARK-15730 > URL: https://issues.apache.org/jira/browse/SPARK-15730 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yi Zhou >Priority: Critical > > /usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g > --executor-cores 5 --num-executors 31 --master yarn-client --conf > spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01 > spark-sql> use test; > 16/06/02 21:36:15 INFO execution.SparkSqlParser: Parsing command: use test > 16/06/02 21:36:15 INFO spark.SparkContext: Starting job: processCmd at > CliDriver.java:376 > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Got job 2 (processCmd at > CliDriver.java:376) with 1 output partitions > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 > (processCmd at CliDriver.java:376) > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Parents of final stage: List() > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Missing parents: List() > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting ResultStage 2 > (MapPartitionsRDD[8] at processCmd at CliDriver.java:376), which has no > missing parents > 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2 stored as values > in memory (estimated size 3.2 KB, free 2.4 GB) > 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as > bytes in memory (estimated size 1964.0 B, free 2.4 GB) > 16/06/02 21:36:15 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in > memory on 192.168.3.11:36189 (size: 1964.0 B, free: 2.4 GB) > 16/06/02 21:36:15 INFO spark.SparkContext: Created broadcast 2 from broadcast > at DAGScheduler.scala:1012 > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting 1 missing tasks > from ResultStage 2 (MapPartitionsRDD[8] at processCmd at CliDriver.java:376) > 16/06/02 21:36:15 INFO cluster.YarnScheduler: Adding task set 2.0 with 1 tasks > 16/06/02 21:36:15 INFO scheduler.TaskSetManager: Starting task 0.0 in stage > 2.0 (TID 2, 192.168.3.13, partition 0, PROCESS_LOCAL, 5362 bytes) > 16/06/02 21:36:15 INFO cluster.YarnClientSchedulerBackend: Launching task 2 > on executor id: 10 hostname: 192.168.3.13. > 16/06/02 21:36:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in > memory on hw-node3:45924 (size: 1964.0 B, free: 4.4 GB) > 16/06/02 21:36:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage > 2.0 (TID 2) in 1934 ms on 192.168.3.13 (1/1) > 16/06/02 21:36:17 INFO cluster.YarnScheduler: Removed TaskSet 2.0, whose > tasks have all completed, from pool > 16/06/02 21:36:17 INFO scheduler.DAGScheduler: ResultStage 2 (processCmd at > CliDriver.java:376) finished in 1.937 s > 16/06/02 21:36:17 INFO scheduler.DAGScheduler: Job 2 finished: processCmd at > CliDriver.java:376, took 1.962631 s > Time taken: 2.027 seconds > 16/06/02 21:36:17 INFO CliDriver: Time taken: 2.027 seconds > spark-sql> DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE}; > 16/06/02 21:36:36 INFO execution.SparkSqlParser: Parsing command: DROP TABLE > IF EXISTS ${hiveconf:RESULT_TABLE} > Error in query: > mismatched input '$' expecting {'ADD', 'AS', 'ALL', 'GROUP', 'BY', > 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'LIMIT', 'AT', 'IN', 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'OUTER', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', > 'RANGE', 'ROWS', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', > 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', > 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'TO', > 'TABLESAMPLE', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', > 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'IF', > 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', > 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', > 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', > 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'EXTENDED', > 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, > 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', > 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', > 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE',
[jira] [Commented] (SPARK-15730) [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take effect in spark-sql session
[ https://issues.apache.org/jira/browse/SPARK-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318654#comment-15318654 ] Cheng Hao commented on SPARK-15730: --- [~jameszhouyi], can you please verify this fixing? > [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take > effect in spark-sql session > - > > Key: SPARK-15730 > URL: https://issues.apache.org/jira/browse/SPARK-15730 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yi Zhou >Priority: Critical > > /usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g > --executor-cores 5 --num-executors 31 --master yarn-client --conf > spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01 > spark-sql> use test; > 16/06/02 21:36:15 INFO execution.SparkSqlParser: Parsing command: use test > 16/06/02 21:36:15 INFO spark.SparkContext: Starting job: processCmd at > CliDriver.java:376 > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Got job 2 (processCmd at > CliDriver.java:376) with 1 output partitions > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 > (processCmd at CliDriver.java:376) > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Parents of final stage: List() > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Missing parents: List() > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting ResultStage 2 > (MapPartitionsRDD[8] at processCmd at CliDriver.java:376), which has no > missing parents > 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2 stored as values > in memory (estimated size 3.2 KB, free 2.4 GB) > 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as > bytes in memory (estimated size 1964.0 B, free 2.4 GB) > 16/06/02 21:36:15 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in > memory on 192.168.3.11:36189 (size: 1964.0 B, free: 2.4 GB) > 16/06/02 21:36:15 INFO spark.SparkContext: Created broadcast 2 from broadcast > at DAGScheduler.scala:1012 > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting 1 missing tasks > from ResultStage 2 (MapPartitionsRDD[8] at processCmd at CliDriver.java:376) > 16/06/02 21:36:15 INFO cluster.YarnScheduler: Adding task set 2.0 with 1 tasks > 16/06/02 21:36:15 INFO scheduler.TaskSetManager: Starting task 0.0 in stage > 2.0 (TID 2, 192.168.3.13, partition 0, PROCESS_LOCAL, 5362 bytes) > 16/06/02 21:36:15 INFO cluster.YarnClientSchedulerBackend: Launching task 2 > on executor id: 10 hostname: 192.168.3.13. > 16/06/02 21:36:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in > memory on hw-node3:45924 (size: 1964.0 B, free: 4.4 GB) > 16/06/02 21:36:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage > 2.0 (TID 2) in 1934 ms on 192.168.3.13 (1/1) > 16/06/02 21:36:17 INFO cluster.YarnScheduler: Removed TaskSet 2.0, whose > tasks have all completed, from pool > 16/06/02 21:36:17 INFO scheduler.DAGScheduler: ResultStage 2 (processCmd at > CliDriver.java:376) finished in 1.937 s > 16/06/02 21:36:17 INFO scheduler.DAGScheduler: Job 2 finished: processCmd at > CliDriver.java:376, took 1.962631 s > Time taken: 2.027 seconds > 16/06/02 21:36:17 INFO CliDriver: Time taken: 2.027 seconds > spark-sql> DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE}; > 16/06/02 21:36:36 INFO execution.SparkSqlParser: Parsing command: DROP TABLE > IF EXISTS ${hiveconf:RESULT_TABLE} > Error in query: > mismatched input '$' expecting {'ADD', 'AS', 'ALL', 'GROUP', 'BY', > 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'LIMIT', 'AT', 'IN', 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'OUTER', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', > 'RANGE', 'ROWS', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', > 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', > 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'TO', > 'TABLESAMPLE', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', > 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'IF', > 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', > 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', > 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', > 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'EXTENDED', > 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, > 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', > 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', > 'FILEFORMAT', 'TOUCH',
[jira] [Commented] (SPARK-15730) [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take effect in spark-sql session
[ https://issues.apache.org/jira/browse/SPARK-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318648#comment-15318648 ] Apache Spark commented on SPARK-15730: -- User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/13542 > [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take > effect in spark-sql session > - > > Key: SPARK-15730 > URL: https://issues.apache.org/jira/browse/SPARK-15730 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yi Zhou >Priority: Critical > > /usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g > --executor-cores 5 --num-executors 31 --master yarn-client --conf > spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01 > spark-sql> use test; > 16/06/02 21:36:15 INFO execution.SparkSqlParser: Parsing command: use test > 16/06/02 21:36:15 INFO spark.SparkContext: Starting job: processCmd at > CliDriver.java:376 > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Got job 2 (processCmd at > CliDriver.java:376) with 1 output partitions > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 > (processCmd at CliDriver.java:376) > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Parents of final stage: List() > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Missing parents: List() > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting ResultStage 2 > (MapPartitionsRDD[8] at processCmd at CliDriver.java:376), which has no > missing parents > 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2 stored as values > in memory (estimated size 3.2 KB, free 2.4 GB) > 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as > bytes in memory (estimated size 1964.0 B, free 2.4 GB) > 16/06/02 21:36:15 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in > memory on 192.168.3.11:36189 (size: 1964.0 B, free: 2.4 GB) > 16/06/02 21:36:15 INFO spark.SparkContext: Created broadcast 2 from broadcast > at DAGScheduler.scala:1012 > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting 1 missing tasks > from ResultStage 2 (MapPartitionsRDD[8] at processCmd at CliDriver.java:376) > 16/06/02 21:36:15 INFO cluster.YarnScheduler: Adding task set 2.0 with 1 tasks > 16/06/02 21:36:15 INFO scheduler.TaskSetManager: Starting task 0.0 in stage > 2.0 (TID 2, 192.168.3.13, partition 0, PROCESS_LOCAL, 5362 bytes) > 16/06/02 21:36:15 INFO cluster.YarnClientSchedulerBackend: Launching task 2 > on executor id: 10 hostname: 192.168.3.13. > 16/06/02 21:36:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in > memory on hw-node3:45924 (size: 1964.0 B, free: 4.4 GB) > 16/06/02 21:36:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage > 2.0 (TID 2) in 1934 ms on 192.168.3.13 (1/1) > 16/06/02 21:36:17 INFO cluster.YarnScheduler: Removed TaskSet 2.0, whose > tasks have all completed, from pool > 16/06/02 21:36:17 INFO scheduler.DAGScheduler: ResultStage 2 (processCmd at > CliDriver.java:376) finished in 1.937 s > 16/06/02 21:36:17 INFO scheduler.DAGScheduler: Job 2 finished: processCmd at > CliDriver.java:376, took 1.962631 s > Time taken: 2.027 seconds > 16/06/02 21:36:17 INFO CliDriver: Time taken: 2.027 seconds > spark-sql> DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE}; > 16/06/02 21:36:36 INFO execution.SparkSqlParser: Parsing command: DROP TABLE > IF EXISTS ${hiveconf:RESULT_TABLE} > Error in query: > mismatched input '$' expecting {'ADD', 'AS', 'ALL', 'GROUP', 'BY', > 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'LIMIT', 'AT', 'IN', 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'OUTER', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', > 'RANGE', 'ROWS', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', > 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', > 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'TO', > 'TABLESAMPLE', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', > 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'IF', > 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', > 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', > 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', > 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'EXTENDED', > 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, > 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', > 'STORED', 'DIRECTORIES', 'LOCATION',
[jira] [Assigned] (SPARK-15730) [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take effect in spark-sql session
[ https://issues.apache.org/jira/browse/SPARK-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15730: Assignee: Apache Spark > [Spark SQL] the value of 'hiveconf' parameter in Spark-sql CLI don't take > effect in spark-sql session > - > > Key: SPARK-15730 > URL: https://issues.apache.org/jira/browse/SPARK-15730 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yi Zhou >Assignee: Apache Spark >Priority: Critical > > /usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g > --executor-cores 5 --num-executors 31 --master yarn-client --conf > spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01 > spark-sql> use test; > 16/06/02 21:36:15 INFO execution.SparkSqlParser: Parsing command: use test > 16/06/02 21:36:15 INFO spark.SparkContext: Starting job: processCmd at > CliDriver.java:376 > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Got job 2 (processCmd at > CliDriver.java:376) with 1 output partitions > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 > (processCmd at CliDriver.java:376) > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Parents of final stage: List() > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Missing parents: List() > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting ResultStage 2 > (MapPartitionsRDD[8] at processCmd at CliDriver.java:376), which has no > missing parents > 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2 stored as values > in memory (estimated size 3.2 KB, free 2.4 GB) > 16/06/02 21:36:15 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as > bytes in memory (estimated size 1964.0 B, free 2.4 GB) > 16/06/02 21:36:15 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in > memory on 192.168.3.11:36189 (size: 1964.0 B, free: 2.4 GB) > 16/06/02 21:36:15 INFO spark.SparkContext: Created broadcast 2 from broadcast > at DAGScheduler.scala:1012 > 16/06/02 21:36:15 INFO scheduler.DAGScheduler: Submitting 1 missing tasks > from ResultStage 2 (MapPartitionsRDD[8] at processCmd at CliDriver.java:376) > 16/06/02 21:36:15 INFO cluster.YarnScheduler: Adding task set 2.0 with 1 tasks > 16/06/02 21:36:15 INFO scheduler.TaskSetManager: Starting task 0.0 in stage > 2.0 (TID 2, 192.168.3.13, partition 0, PROCESS_LOCAL, 5362 bytes) > 16/06/02 21:36:15 INFO cluster.YarnClientSchedulerBackend: Launching task 2 > on executor id: 10 hostname: 192.168.3.13. > 16/06/02 21:36:16 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in > memory on hw-node3:45924 (size: 1964.0 B, free: 4.4 GB) > 16/06/02 21:36:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage > 2.0 (TID 2) in 1934 ms on 192.168.3.13 (1/1) > 16/06/02 21:36:17 INFO cluster.YarnScheduler: Removed TaskSet 2.0, whose > tasks have all completed, from pool > 16/06/02 21:36:17 INFO scheduler.DAGScheduler: ResultStage 2 (processCmd at > CliDriver.java:376) finished in 1.937 s > 16/06/02 21:36:17 INFO scheduler.DAGScheduler: Job 2 finished: processCmd at > CliDriver.java:376, took 1.962631 s > Time taken: 2.027 seconds > 16/06/02 21:36:17 INFO CliDriver: Time taken: 2.027 seconds > spark-sql> DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE}; > 16/06/02 21:36:36 INFO execution.SparkSqlParser: Parsing command: DROP TABLE > IF EXISTS ${hiveconf:RESULT_TABLE} > Error in query: > mismatched input '$' expecting {'ADD', 'AS', 'ALL', 'GROUP', 'BY', > 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'LIMIT', 'AT', 'IN', 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'OUTER', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', > 'RANGE', 'ROWS', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', > 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', > 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'SHOW', 'TABLES', > 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'TO', > 'TABLESAMPLE', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', > 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'IF', > 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', > 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', > 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', > 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'EXTENDED', > 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, > 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', > 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', > 'FILEFORMAT', 'TOUCH', 'COMPACT',
[jira] [Resolved] (SPARK-13570) pyspark save with partitionBy is very slow
[ https://issues.apache.org/jira/browse/SPARK-13570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13570. --- Resolution: Incomplete > pyspark save with partitionBy is very slow > -- > > Key: SPARK-13570 > URL: https://issues.apache.org/jira/browse/SPARK-13570 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Shubhanshu Mishra > Labels: dataframe, partitioning, pyspark, save > > Running the following code to store data from each year and pos in a seperate > folder for a very large dataframe is taking a huge amount of time. (>37 hours > for 60% of the work) > {code} > ## IPYTHON was started using the following command: > # IPYTHON=1 "$SPARK_HOME/bin/pyspark" --driver-memory 50g > from pyspark import SparkContext, SparkConf > from pyspark.sql import SQLContext, Row > from pyspark.sql.types import * > conf = SparkConf() > conf.setMaster("local[30]") > conf.setAppName("analysis") > conf.set("spark.local.dir", "./tmp") > conf.set("spark.executor.memory", "50g") > conf.set("spark.driver.maxResultSize", "5g") > sc = SparkContext(conf=conf) > sqlContext = SQLContext(sc) > df = sqlContext.read.format("csv").options(header=False, inferschema=True, > delimiter="\t").load("out/new_features") > df = df.selectExpr(*("%s as %s" % (df.columns[i], k) for i,k in > enumerate(columns))) > # year can take values from [1902,2015] > # pos takes integer values from [-1,0,1,2] > # df is a dataframe with 20 columns and 1 billion rows > # Running on Machine with 32 cores and 500 GB RAM > df.write.save("out/model_input_partitioned", format="csv", > partitionBy=["year", "pos"], delimiter="\t") > {code} > Currently, the code is at: > [Stage 12:==>(1367 + 30) / > 2290] > And it has already been more than 37 hours. A single sweep on this data for > filter by value takes less than 6.5 minutes. > The spark web interface shows the following lines for the 2 stages of the job: > Stage Description Submitted DurationTasks:succeeded/total > Input Output Shuffle ReadShuffle Write > 11load at NativeMethodAccessorImpl.java:-2 +details 2016/02/27 23:07:04 > 6.5 min 2290/2290 66.8 GB > 12save at NativeMethodAccessorImpl.java:-2 +details 2016/02/27 23:15:59 > 37.1 h 1370/2290 40.9 GB -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching
[ https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318558#comment-15318558 ] Sean Owen commented on SPARK-15796: --- I'm not sure what you mean about storing RDDs that don't fit in memory, but that's perfectly fine. I am suggesting that it's not surprising that you need to do some tuning to use nearly all the heap, since GC time will increase a lot as you get close to this limit and needs some extra help to work efficiently. This is what this boils down to: the settings are causing Spark to mis-use the new generation, really, and it's expensive to keep GCing the long-lived objects there that never die. But this isn't an exotic use case and really ought not happen out of the box. I agree that I don't think it makes sense to allow Spark to cache (inherently, long lived objects) more memory than is available in the old gen (the place for long-lived objects that don't need much GC attention). I think the resolution is to change the defaults accordingly. > Spark 1.6 default memory settings can cause heavy GC when caching > - > > Key: SPARK-15796 > URL: https://issues.apache.org/jira/browse/SPARK-15796 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gabor Feher >Priority: Minor > > While debugging performance issues in a Spark program, I've found a simple > way to slow down Spark 1.6 significantly by filling the RDD memory cache. > This seems to be a regression, because setting > "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is > just a simple program that fills the memory cache of Spark using a > MEMORY_ONLY cached RDD (but of course this comes up in more complex > situations, too): > {code} > import org.apache.spark.SparkContext > import org.apache.spark.SparkConf > import org.apache.spark.storage.StorageLevel > object CacheDemoApp { > def main(args: Array[String]) { > val conf = new SparkConf().setAppName("Cache Demo Application") > > val sc = new SparkContext(conf) > val startTime = System.currentTimeMillis() > > > val cacheFiller = sc.parallelize(1 to 5, 1000) > > .mapPartitionsWithIndex { > case (ix, it) => > println(s"CREATE DATA PARTITION ${ix}") > > val r = new scala.util.Random(ix) > it.map(x => (r.nextLong, r.nextLong)) > } > cacheFiller.persist(StorageLevel.MEMORY_ONLY) > cacheFiller.foreach(identity) > val finishTime = System.currentTimeMillis() > val elapsedTime = (finishTime - startTime) / 1000 > println(s"TIME= $elapsedTime s") > } > } > {code} > If I call it the following way, it completes in around 5 minutes on my > Laptop, while often stopping for slow Full GC cycles. I can also see with > jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled. > {code} > sbt package > ~/spark-1.6.0/bin/spark-submit \ > --class "CacheDemoApp" \ > --master "local[2]" \ > --driver-memory 3g \ > --driver-java-options "-XX:+PrintGCDetails" \ > target/scala-2.10/simple-project_2.10-1.0.jar > {code} > If I add any one of the below flags, then the run-time drops to around 40-50 > seconds and the difference is coming from the drop in GC times: > --conf "spark.memory.fraction=0.6" > OR > --conf "spark.memory.useLegacyMode=true" > OR > --driver-java-options "-XX:NewRatio=3" > All the other cache types except for DISK_ONLY produce similar symptoms. It > looks like that the problem is that the amount of data Spark wants to store > long-term ends up being larger than the old generation size in the JVM and > this triggers Full GC repeatedly. > I did some research: > * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It > defaults to 0.75. > * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache > size. It defaults to 0.6 and... > * http://spark.apache.org/docs/1.5.2/configuration.html even says that it > shouldn't be bigger than the size of the old generation. > * On the other hand, OpenJDK's default NewRatio is 2, which means an old > generation size of 66%. Hence the default value in Spark 1.6 contradicts this > advice. > http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old > generation is running close to full, then setting > spark.memory.storageFraction to a lower value should help. I have tried with > spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is > not a surprise:
[jira] [Comment Edited] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching
[ https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318530#comment-15318530 ] Gabor Feher edited comment on SPARK-15796 at 6/7/16 2:15 PM: - MEMORY_ONLY caching works in a way that when a partition doesn't fit into the memory, then it won't save it in the memory cache region. It prints stuff like this: {code} 16/06/07 06:35:27 INFO MemoryStore: Will not store rdd_1_464 as it would require dropping another block from the same RDD 16/06/07 06:35:27 WARN MemoryStore: Not enough space to cache rdd_1_464 in memory! (computed 5.5 MB so far) {code} MEMORY_AND_DISK caching works in a way that if a partition doesn't fit into the memory, then it saves it to the disk. It prints stuff like this: {code} 16/06/07 06:46:39 WARN CacheManager: Persisting partition rdd_1_99 to disk instead. {code} In the MEMORY_ONLY case, if I shouldn't expect it to work with too much data as you suggest, then why Spark even bothers dropping the blocks from memory? If it's a non-goal to store oversized RDDs, then it would be much simpler to just throw an OOM. In the MEMORY_AND_DISK case, I can see the exact same GC issue with MEMORY_ONLY. But there the whole point should be that we are caching RDDs that don't fit into the memory, no? So, these two behaviors made me assume that Spark will work even if I try to cache too big stuff. I understand if you say that this is a JVM-implementation dependent issue, I have no idea how many people are using other JVMs than OpenJDK. But this raises the question: are there any situations when it makes sense to raise "spark.memory.fraction" above the old generation size? With caching I can say it doesn't make sense, but maybe execution could use it meaningfully? Maybe it is worth mentioning that my use case is not that exotic: we are developing a program based on Spark that works with user-provided data: so there is no way to say at implementation time whether a particular RDD will fit into memory or not. Speaking of storageFraction, I was not trying to say that there is a problem with it. But the following sentence in http://spark.apache.org/docs/1.6.1/tuning.html is not correct, if I understand correctly: {quote} In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of memory used for caching by lowering spark.memory.storageFraction; it is better to cache fewer objects than to slow down task execution! {quote} Because storageFraction will not actually reduce the amount of cache unless execution needs more memory. Thanks for looking into the issue! To sum up, this is at least a bug in the documentation: * tuning.html should have better advice for when OldGen is close to being full * I'd prefer a mention of these GC issues somewhere near the cache docs, given that many people are using OpenJDK with default settings I believe. was (Author: gfeher): MEMORY_ONLY caching works in a way that when a partition doesn't fit into the memory, then it won't save it in the memory cache region. It prints stuff like this: {{code}] 16/06/07 06:35:27 INFO MemoryStore: Will not store rdd_1_464 as it would require dropping another block from the same RDD 16/06/07 06:35:27 WARN MemoryStore: Not enough space to cache rdd_1_464 in memory! (computed 5.5 MB so far) {{code}} MEMORY_AND_DISK caching works in a way that if a partition doesn't fit into the memory, then it saves it to the disk. It prints stuff like this: {{code}} 16/06/07 06:46:39 WARN CacheManager: Persisting partition rdd_1_99 to disk instead. {{code}} In the MEMORY_ONLY case, if I shouldn't expect it to work with too much data as you suggest, then why Spark even bothers dropping the blocks from memory? If it's a non-goal to store oversized RDDs, then it would be much simpler to just throw an OOM. In the MEMORY_AND_DISK case, I can see the exact same GC issue with MEMORY_ONLY. But there the whole point should be that we are caching RDDs that don't fit into the memory, no? So, these two behaviors made me assume that Spark will work even if I try to cache too big stuff. I understand if you say that this is a JVM-implementation dependent issue, I have no idea how many people are using other JVMs than OpenJDK. But this raises the question: are there any situations when it makes sense to raise "spark.memory.fraction" above the old generation size? With caching I can say it doesn't make sense, but maybe execution could use it meaningfully? Maybe it is worth mentioning that my use case is not that exotic: we are developing a program based on Spark that works with user-provided data: so there is no way to say at implementation time whether a particular RDD will fit into memory or not. Speaking of storageFraction, I was not trying to say that there is a problem with it. But the following sentence in
[jira] [Commented] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching
[ https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318530#comment-15318530 ] Gabor Feher commented on SPARK-15796: - MEMORY_ONLY caching works in a way that when a partition doesn't fit into the memory, then it won't save it in the memory cache region. It prints stuff like this: {{code}] 16/06/07 06:35:27 INFO MemoryStore: Will not store rdd_1_464 as it would require dropping another block from the same RDD 16/06/07 06:35:27 WARN MemoryStore: Not enough space to cache rdd_1_464 in memory! (computed 5.5 MB so far) {{code}} MEMORY_AND_DISK caching works in a way that if a partition doesn't fit into the memory, then it saves it to the disk. It prints stuff like this: {{code}} 16/06/07 06:46:39 WARN CacheManager: Persisting partition rdd_1_99 to disk instead. {{code}} In the MEMORY_ONLY case, if I shouldn't expect it to work with too much data as you suggest, then why Spark even bothers dropping the blocks from memory? If it's a non-goal to store oversized RDDs, then it would be much simpler to just throw an OOM. In the MEMORY_AND_DISK case, I can see the exact same GC issue with MEMORY_ONLY. But there the whole point should be that we are caching RDDs that don't fit into the memory, no? So, these two behaviors made me assume that Spark will work even if I try to cache too big stuff. I understand if you say that this is a JVM-implementation dependent issue, I have no idea how many people are using other JVMs than OpenJDK. But this raises the question: are there any situations when it makes sense to raise "spark.memory.fraction" above the old generation size? With caching I can say it doesn't make sense, but maybe execution could use it meaningfully? Maybe it is worth mentioning that my use case is not that exotic: we are developing a program based on Spark that works with user-provided data: so there is no way to say at implementation time whether a particular RDD will fit into memory or not. Speaking of storageFraction, I was not trying to say that there is a problem with it. But the following sentence in http://spark.apache.org/docs/1.6.1/tuning.html is not correct, if I understand correctly: {{quote}} In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of memory used for caching by lowering spark.memory.storageFraction; it is better to cache fewer objects than to slow down task execution! {{quote}} Because storageFraction will not actually reduce the amount of cache unless execution needs more memory. Thanks for looking into the issue! To sum up, this is at least a bug in the documentation: * tuning.html should have better advice for when OldGen is close to being full * I'd prefer a mention of these GC issues somewhere near the cache docs, given that many people are using OpenJDK with default settings I believe. > Spark 1.6 default memory settings can cause heavy GC when caching > - > > Key: SPARK-15796 > URL: https://issues.apache.org/jira/browse/SPARK-15796 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gabor Feher >Priority: Minor > > While debugging performance issues in a Spark program, I've found a simple > way to slow down Spark 1.6 significantly by filling the RDD memory cache. > This seems to be a regression, because setting > "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is > just a simple program that fills the memory cache of Spark using a > MEMORY_ONLY cached RDD (but of course this comes up in more complex > situations, too): > {code} > import org.apache.spark.SparkContext > import org.apache.spark.SparkConf > import org.apache.spark.storage.StorageLevel > object CacheDemoApp { > def main(args: Array[String]) { > val conf = new SparkConf().setAppName("Cache Demo Application") > > val sc = new SparkContext(conf) > val startTime = System.currentTimeMillis() > > > val cacheFiller = sc.parallelize(1 to 5, 1000) > > .mapPartitionsWithIndex { > case (ix, it) => > println(s"CREATE DATA PARTITION ${ix}") > > val r = new scala.util.Random(ix) > it.map(x => (r.nextLong, r.nextLong)) > } > cacheFiller.persist(StorageLevel.MEMORY_ONLY) > cacheFiller.foreach(identity) > val finishTime = System.currentTimeMillis() > val elapsedTime = (finishTime - startTime) / 1000 > println(s"TIME= $elapsedTime s") > } > } > {code} > If I call it the following way, it
[jira] [Commented] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching
[ https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318526#comment-15318526 ] Sean Owen commented on SPARK-15796: --- To leave a little extra room and to match the old behavior -- yeah reasonable to me. CC [~andrewor14]? > Spark 1.6 default memory settings can cause heavy GC when caching > - > > Key: SPARK-15796 > URL: https://issues.apache.org/jira/browse/SPARK-15796 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gabor Feher >Priority: Minor > > While debugging performance issues in a Spark program, I've found a simple > way to slow down Spark 1.6 significantly by filling the RDD memory cache. > This seems to be a regression, because setting > "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is > just a simple program that fills the memory cache of Spark using a > MEMORY_ONLY cached RDD (but of course this comes up in more complex > situations, too): > {code} > import org.apache.spark.SparkContext > import org.apache.spark.SparkConf > import org.apache.spark.storage.StorageLevel > object CacheDemoApp { > def main(args: Array[String]) { > val conf = new SparkConf().setAppName("Cache Demo Application") > > val sc = new SparkContext(conf) > val startTime = System.currentTimeMillis() > > > val cacheFiller = sc.parallelize(1 to 5, 1000) > > .mapPartitionsWithIndex { > case (ix, it) => > println(s"CREATE DATA PARTITION ${ix}") > > val r = new scala.util.Random(ix) > it.map(x => (r.nextLong, r.nextLong)) > } > cacheFiller.persist(StorageLevel.MEMORY_ONLY) > cacheFiller.foreach(identity) > val finishTime = System.currentTimeMillis() > val elapsedTime = (finishTime - startTime) / 1000 > println(s"TIME= $elapsedTime s") > } > } > {code} > If I call it the following way, it completes in around 5 minutes on my > Laptop, while often stopping for slow Full GC cycles. I can also see with > jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled. > {code} > sbt package > ~/spark-1.6.0/bin/spark-submit \ > --class "CacheDemoApp" \ > --master "local[2]" \ > --driver-memory 3g \ > --driver-java-options "-XX:+PrintGCDetails" \ > target/scala-2.10/simple-project_2.10-1.0.jar > {code} > If I add any one of the below flags, then the run-time drops to around 40-50 > seconds and the difference is coming from the drop in GC times: > --conf "spark.memory.fraction=0.6" > OR > --conf "spark.memory.useLegacyMode=true" > OR > --driver-java-options "-XX:NewRatio=3" > All the other cache types except for DISK_ONLY produce similar symptoms. It > looks like that the problem is that the amount of data Spark wants to store > long-term ends up being larger than the old generation size in the JVM and > this triggers Full GC repeatedly. > I did some research: > * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It > defaults to 0.75. > * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache > size. It defaults to 0.6 and... > * http://spark.apache.org/docs/1.5.2/configuration.html even says that it > shouldn't be bigger than the size of the old generation. > * On the other hand, OpenJDK's default NewRatio is 2, which means an old > generation size of 66%. Hence the default value in Spark 1.6 contradicts this > advice. > http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old > generation is running close to full, then setting > spark.memory.storageFraction to a lower value should help. I have tried with > spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is > not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html > explains that storageFraction is not an upper-limit but a lower limit-like > thing on the size of Spark's cache. The real upper limit is > spark.memory.fraction. > To sum up my questions/issues: > * At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. > Maybe the old generation size should also be mentioned in configuration.html > near spark.memory.fraction. > * Is it a goal for Spark to support heavy caching with default parameters and > without GC breakdown? If so, then better default values are needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching
[ https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318523#comment-15318523 ] Daniel Darabos commented on SPARK-15796: > The only argument against it was that it's specific to the OpenJDK default. I think Gabor has only tested with OpenJDK, but the default for {{NewRatio}} is the same in Oracle Java 8 Server JVM according to https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/sizing.html. > I think this issue still exists even with the fraction set to 0.66, because > of course if you are using any memory at all for other stuff, some of that > can't fit in the old generation. There will always be some need to tune GC > params when that becomes the bottleneck. Good point. Maybe 0.6 would be the best default? If everything fit in old-gen in 1.5, it would probably still fit in the old-gen that way. > Spark 1.6 default memory settings can cause heavy GC when caching > - > > Key: SPARK-15796 > URL: https://issues.apache.org/jira/browse/SPARK-15796 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gabor Feher >Priority: Minor > > While debugging performance issues in a Spark program, I've found a simple > way to slow down Spark 1.6 significantly by filling the RDD memory cache. > This seems to be a regression, because setting > "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is > just a simple program that fills the memory cache of Spark using a > MEMORY_ONLY cached RDD (but of course this comes up in more complex > situations, too): > {code} > import org.apache.spark.SparkContext > import org.apache.spark.SparkConf > import org.apache.spark.storage.StorageLevel > object CacheDemoApp { > def main(args: Array[String]) { > val conf = new SparkConf().setAppName("Cache Demo Application") > > val sc = new SparkContext(conf) > val startTime = System.currentTimeMillis() > > > val cacheFiller = sc.parallelize(1 to 5, 1000) > > .mapPartitionsWithIndex { > case (ix, it) => > println(s"CREATE DATA PARTITION ${ix}") > > val r = new scala.util.Random(ix) > it.map(x => (r.nextLong, r.nextLong)) > } > cacheFiller.persist(StorageLevel.MEMORY_ONLY) > cacheFiller.foreach(identity) > val finishTime = System.currentTimeMillis() > val elapsedTime = (finishTime - startTime) / 1000 > println(s"TIME= $elapsedTime s") > } > } > {code} > If I call it the following way, it completes in around 5 minutes on my > Laptop, while often stopping for slow Full GC cycles. I can also see with > jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled. > {code} > sbt package > ~/spark-1.6.0/bin/spark-submit \ > --class "CacheDemoApp" \ > --master "local[2]" \ > --driver-memory 3g \ > --driver-java-options "-XX:+PrintGCDetails" \ > target/scala-2.10/simple-project_2.10-1.0.jar > {code} > If I add any one of the below flags, then the run-time drops to around 40-50 > seconds and the difference is coming from the drop in GC times: > --conf "spark.memory.fraction=0.6" > OR > --conf "spark.memory.useLegacyMode=true" > OR > --driver-java-options "-XX:NewRatio=3" > All the other cache types except for DISK_ONLY produce similar symptoms. It > looks like that the problem is that the amount of data Spark wants to store > long-term ends up being larger than the old generation size in the JVM and > this triggers Full GC repeatedly. > I did some research: > * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It > defaults to 0.75. > * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache > size. It defaults to 0.6 and... > * http://spark.apache.org/docs/1.5.2/configuration.html even says that it > shouldn't be bigger than the size of the old generation. > * On the other hand, OpenJDK's default NewRatio is 2, which means an old > generation size of 66%. Hence the default value in Spark 1.6 contradicts this > advice. > http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old > generation is running close to full, then setting > spark.memory.storageFraction to a lower value should help. I have tried with > spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is > not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html > explains that storageFraction is not an upper-limit but a lower limit-like > thing on the size of Spark's
[jira] [Commented] (SPARK-15564) App name is the main class name in Spark streaming jobs
[ https://issues.apache.org/jira/browse/SPARK-15564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318520#comment-15318520 ] Sean Owen commented on SPARK-15564: --- On further review, I don't see how there's a null appName here. There isn't a call to createNewSparkContext with a null app name. The constructor you invoke in both cases preserves the provided conf object, which should have its spark.app.name already set. Are you sure there isn't something else at work in the code that's omitted here? I don't yet see how this could be a difference. > App name is the main class name in Spark streaming jobs > --- > > Key: SPARK-15564 > URL: https://issues.apache.org/jira/browse/SPARK-15564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Steven Lowenthal >Priority: Minor > > I've tried everything to set the app name to something other than the class > name of the job, but spark reports the application name as the class. This > adversely affects the ability to monitor jobs, we can't have dots in the > reported app name. > {code:title=job.scala} > val defaultAppName = "NDS Transform" >conf.setAppName(defaultAppName) >println (s"App Name: ${conf.get("spark.app.name")}") > ... > val ssc = new StreamingContext(conf, streamingBatchWindow) > {code} > {code:title=output} > App Name: NDS Transform > {code} > Application IDName > app-20160526161230-0017 (kill) com.gracenote.ongo.spark.NDSStreamAvro -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching
[ https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318496#comment-15318496 ] Sean Owen commented on SPARK-15796: --- Yeah, sounds like we should change the default min cache size so that it fits in, at least, OpenJDK's default old gen. The only argument against it was that it's specific to the OpenJDK default. I don't know if that's a Spark problem, just raises the issue of JVM tuning that was always there, but, not surprising people out of the box has value too. I think this issue still exists even with the fraction set to 0.66, because of course if you are using any memory at all for other stuff, some of that can't fit in the old generation. There will always be some need to tune GC params when that becomes the bottleneck. > Spark 1.6 default memory settings can cause heavy GC when caching > - > > Key: SPARK-15796 > URL: https://issues.apache.org/jira/browse/SPARK-15796 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gabor Feher >Priority: Minor > > While debugging performance issues in a Spark program, I've found a simple > way to slow down Spark 1.6 significantly by filling the RDD memory cache. > This seems to be a regression, because setting > "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is > just a simple program that fills the memory cache of Spark using a > MEMORY_ONLY cached RDD (but of course this comes up in more complex > situations, too): > {code} > import org.apache.spark.SparkContext > import org.apache.spark.SparkConf > import org.apache.spark.storage.StorageLevel > object CacheDemoApp { > def main(args: Array[String]) { > val conf = new SparkConf().setAppName("Cache Demo Application") > > val sc = new SparkContext(conf) > val startTime = System.currentTimeMillis() > > > val cacheFiller = sc.parallelize(1 to 5, 1000) > > .mapPartitionsWithIndex { > case (ix, it) => > println(s"CREATE DATA PARTITION ${ix}") > > val r = new scala.util.Random(ix) > it.map(x => (r.nextLong, r.nextLong)) > } > cacheFiller.persist(StorageLevel.MEMORY_ONLY) > cacheFiller.foreach(identity) > val finishTime = System.currentTimeMillis() > val elapsedTime = (finishTime - startTime) / 1000 > println(s"TIME= $elapsedTime s") > } > } > {code} > If I call it the following way, it completes in around 5 minutes on my > Laptop, while often stopping for slow Full GC cycles. I can also see with > jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled. > {code} > sbt package > ~/spark-1.6.0/bin/spark-submit \ > --class "CacheDemoApp" \ > --master "local[2]" \ > --driver-memory 3g \ > --driver-java-options "-XX:+PrintGCDetails" \ > target/scala-2.10/simple-project_2.10-1.0.jar > {code} > If I add any one of the below flags, then the run-time drops to around 40-50 > seconds and the difference is coming from the drop in GC times: > --conf "spark.memory.fraction=0.6" > OR > --conf "spark.memory.useLegacyMode=true" > OR > --driver-java-options "-XX:NewRatio=3" > All the other cache types except for DISK_ONLY produce similar symptoms. It > looks like that the problem is that the amount of data Spark wants to store > long-term ends up being larger than the old generation size in the JVM and > this triggers Full GC repeatedly. > I did some research: > * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It > defaults to 0.75. > * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache > size. It defaults to 0.6 and... > * http://spark.apache.org/docs/1.5.2/configuration.html even says that it > shouldn't be bigger than the size of the old generation. > * On the other hand, OpenJDK's default NewRatio is 2, which means an old > generation size of 66%. Hence the default value in Spark 1.6 contradicts this > advice. > http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old > generation is running close to full, then setting > spark.memory.storageFraction to a lower value should help. I have tried with > spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is > not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html > explains that storageFraction is not an upper-limit but a lower limit-like > thing on the size of Spark's cache. The real upper limit is > spark.memory.fraction. > To sum up my questions/issues: > *
[jira] [Commented] (SPARK-15065) HiveSparkSubmitSuite's "set spark.sql.warehouse.dir" is flaky
[ https://issues.apache.org/jira/browse/SPARK-15065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318492#comment-15318492 ] Pete Robbins commented on SPARK-15065: -- I think this may be related to https://issues.apache.org/jira/browse/SPARK-15606 where there is a deadlock in executor shutdown. This test was consistently failing on our machine with only 2 cores but since my fix to SPARK-15606 it has passed all the time. > HiveSparkSubmitSuite's "set spark.sql.warehouse.dir" is flaky > - > > Key: SPARK-15065 > URL: https://issues.apache.org/jira/browse/SPARK-15065 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Reporter: Yin Huai >Priority: Critical > Attachments: log.txt > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/861/testReport/junit/org.apache.spark.sql.hive/HiveSparkSubmitSuite/dir/ > There are several WARN messages like {{16/05/02 00:51:06 WARN Master: Got > status update for unknown executor app-20160502005054-/3}}, which are > suspicious. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching
[ https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318486#comment-15318486 ] Daniel Darabos commented on SPARK-15796: The example program takes less than a minute on Spark 1.5 and 5 minutes on Spark 1.6, using the default configuration in both cases. In neither case do we run out of memory. The old generation size defaults to 66% and Spark caching in Spark 1.5 defaults to 60%, so with default settings the cache fits in the old generation in 1.5. But in 1.6 the default cache size is increased to 75% so it no longer fits in the old generation. This kills performance. (And the regression is very hard to debug. Kudos to Gabor Feher!) The default settings have been changed in Spark 1.6 to give a 5x slowdown, and the documentation for the current settings does not make a note of this. Only the documentation for the deprecated {{spark.storage.memoryFraction}} mentions the issue, but its default value had been chosen so that the issue was not triggered by default. This also has to be documented for the new settings. Unless someone never uses cache, they are going to hit this issue if they run with the default settings. I think this is bad enough to warrant changing the defaults. I propose defaulting {{spark.memory.fraction}} to 0.6. If someone wants to set {{spark.memory.fraction}} to 0.75 they need to also set {{-XX:NewRatio=3}} to avoid GC thrashing. (Another option is to set {{-XX:NewRatio=3}} by default, but I think it's a vendor-specific flag.) What is the argument against defaulting {{spark.memory.fraction}} to 0.6? > Spark 1.6 default memory settings can cause heavy GC when caching > - > > Key: SPARK-15796 > URL: https://issues.apache.org/jira/browse/SPARK-15796 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gabor Feher >Priority: Minor > > While debugging performance issues in a Spark program, I've found a simple > way to slow down Spark 1.6 significantly by filling the RDD memory cache. > This seems to be a regression, because setting > "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is > just a simple program that fills the memory cache of Spark using a > MEMORY_ONLY cached RDD (but of course this comes up in more complex > situations, too): > {code} > import org.apache.spark.SparkContext > import org.apache.spark.SparkConf > import org.apache.spark.storage.StorageLevel > object CacheDemoApp { > def main(args: Array[String]) { > val conf = new SparkConf().setAppName("Cache Demo Application") > > val sc = new SparkContext(conf) > val startTime = System.currentTimeMillis() > > > val cacheFiller = sc.parallelize(1 to 5, 1000) > > .mapPartitionsWithIndex { > case (ix, it) => > println(s"CREATE DATA PARTITION ${ix}") > > val r = new scala.util.Random(ix) > it.map(x => (r.nextLong, r.nextLong)) > } > cacheFiller.persist(StorageLevel.MEMORY_ONLY) > cacheFiller.foreach(identity) > val finishTime = System.currentTimeMillis() > val elapsedTime = (finishTime - startTime) / 1000 > println(s"TIME= $elapsedTime s") > } > } > {code} > If I call it the following way, it completes in around 5 minutes on my > Laptop, while often stopping for slow Full GC cycles. I can also see with > jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled. > {code} > sbt package > ~/spark-1.6.0/bin/spark-submit \ > --class "CacheDemoApp" \ > --master "local[2]" \ > --driver-memory 3g \ > --driver-java-options "-XX:+PrintGCDetails" \ > target/scala-2.10/simple-project_2.10-1.0.jar > {code} > If I add any one of the below flags, then the run-time drops to around 40-50 > seconds and the difference is coming from the drop in GC times: > --conf "spark.memory.fraction=0.6" > OR > --conf "spark.memory.useLegacyMode=true" > OR > --driver-java-options "-XX:NewRatio=3" > All the other cache types except for DISK_ONLY produce similar symptoms. It > looks like that the problem is that the amount of data Spark wants to store > long-term ends up being larger than the old generation size in the JVM and > this triggers Full GC repeatedly. > I did some research: > * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It > defaults to 0.75. > * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache > size. It defaults to 0.6 and... > *
[jira] [Resolved] (SPARK-15787) Display more helpful error messages for several invalid operations
[ https://issues.apache.org/jira/browse/SPARK-15787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15787. --- Resolution: Duplicate Fix Version/s: (was: 1.2.1) Please comment on the other JIRA with details, and if it's the same issue we can reopen it. > Display more helpful error messages for several invalid operations > -- > > Key: SPARK-15787 > URL: https://issues.apache.org/jira/browse/SPARK-15787 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: nalin garg > > Referencing SPARK-5063. The issue has reappeared. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15801) spark-submit --num-executors switch also works without YARN
[ https://issues.apache.org/jira/browse/SPARK-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318447#comment-15318447 ] Jonathan Taws commented on SPARK-15801: --- Indeed, I am getting the same behavior. After quickly sifting through the code, it looks like the num-executors option isn't taken into account for the standalone mode, based on the {{[allocateWorkerResourceToExecutors|https://github.com/apache/spark/blob/d5911d1173fe0872f21cae6c47abf8ff479345a4/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L673]}} method. > spark-submit --num-executors switch also works without YARN > --- > > Key: SPARK-15801 > URL: https://issues.apache.org/jira/browse/SPARK-15801 > Project: Spark > Issue Type: Documentation > Components: Spark Submit >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > Based on this [issue|https://issues.apache.org/jira/browse/SPARK-15781] > regarding the SPARK_WORKER_INSTANCES property, I also found that the > {{--num-executors}} switch documented in the spark-submit help is partially > incorrect. > Here's one part of the output (produced by {{spark-submit --help}}): > {code} > YARN-only: > --driver-cores NUM Number of cores used by the driver, only in > cluster mode > (Default: 1). > --queue QUEUE_NAME The YARN queue to submit to (Default: > "default"). > --num-executors NUM Number of executors to launch (Default: 2). > {code} > Correct me if I am wrong, but the num-executors switch also works in Spark > standalone mode *without YARN*. > I tried by only launching a master and a worker with 4 executors specified, > and they were all successfully spawned. The master switch pointed to the > master's url, and not to the yarn value. > Here's the exact command : {{spark-submit --master spark://[local > machine]:7077 --num-executors 4 --executor-cores 2}} > By default it is *1* executor per worker in Spark standalone mode without > YARN, but this option enables to specify the number of executors (per worker > ?) if, and only if, the {{--executor-cores}} switch is also set. I do believe > it defaults to 2 in YARN mode. > I would propose to move the option from the *YARN-only* section to the *Spark > standalone and YARN only* section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15779) SQL context fails when Hive uses Tez as its default execution engine
[ https://issues.apache.org/jira/browse/SPARK-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318443#comment-15318443 ] Alexandre Linte commented on SPARK-15779: - Thank you for your reply Zhang, You're right, I'm using the same hive-site.xml for Hive and Spark (this is a symbolic link). I will try with a copy of the hive-site.xml for spark. > SQL context fails when Hive uses Tez as its default execution engine > > > Key: SPARK-15779 > URL: https://issues.apache.org/jira/browse/SPARK-15779 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit, SQL >Affects Versions: 1.6.1 > Environment: Hadoop 2.7.2, Spark 1.6.1, Hive 2.0.1, Tez 0.8.3 >Reporter: Alexandre Linte > > By default, Hive uses MapReduce as its default execution engine. Since Hive > 2.0.0, MapReduce is deprecated. > To avoid this deprecation, I decided to use Tez instead of MapReduce as the > default execution engine. Unfortunately, this choice had an impact on Spark. > Now when I start Spark the SQL context fails with the following error: > {noformat} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.6.1 > /_/ > Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.7.0_85) > Type in expressions to have them evaluated. > Type :help for more information. > Spark context available as sc. > java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:529) > at > org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:204) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238) > at > org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218) > at > org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208) > at > org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:440) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:271) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028) > at $iwC$$iwC.(:15) > at $iwC.(:24) > at (:26) > at .(:30) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:132) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:124) > at > org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324) > at >
[jira] [Commented] (SPARK-15781) Misleading deprecated property in standalone cluster configuration documentation
[ https://issues.apache.org/jira/browse/SPARK-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318439#comment-15318439 ] Sean Owen commented on SPARK-15781: --- Yeah, I'd love for someone who really knows standalone to confirm that. If it's true, OK. Empirically that does look right. > Misleading deprecated property in standalone cluster configuration > documentation > > > Key: SPARK-15781 > URL: https://issues.apache.org/jira/browse/SPARK-15781 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > I am unsure if this is regarded as an issue or not, but in the > [latest|http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts] > documentation for the configuration to launch Spark in stand-alone cluster > mode, the following property is documented : > |SPARK_WORKER_INSTANCES| Number of worker instances to run on each > machine (default: 1). You can make this more than 1 if you have have very > large machines and would like multiple Spark worker processes. If you do set > this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores > per worker, or else each worker will try to use all the cores.| > However, once I launch Spark with the spark-submit utility and the property > {{SPARK_WORKER_INSTANCES}} set in my spark-env.sh file, I get the following > deprecated warning : > {code} > 16/06/06 16:38:28 WARN SparkConf: > SPARK_WORKER_INSTANCES was detected (set to '4'). > This is deprecated in Spark 1.0+. > Please instead use: > - ./spark-submit with --num-executors to specify the number of executors > - Or set SPARK_EXECUTOR_INSTANCES > - spark.executor.instances to configure the number of instances in the spark > config. > {code} > Is this regarded as normal practice to have deprecated fields documented in > the documentation ? > I would have preferred to directly know about the --num-executors property > than to have to submit my application and find a deprecated warning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15792) [SQL] Allows operator to change the verbosity in explain output.
[ https://issues.apache.org/jira/browse/SPARK-15792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15792: -- Assignee: Sean Zhong > [SQL] Allows operator to change the verbosity in explain output. > > > Key: SPARK-15792 > URL: https://issues.apache.org/jira/browse/SPARK-15792 > Project: Spark > Issue Type: Improvement >Reporter: Sean Zhong >Assignee: Sean Zhong >Priority: Minor > Fix For: 2.0.0 > > > We should allows an operator (Physical plan or logical plan) to change > verbosity in explain output. > For example, we may not want to display {{output=[count(a)#48L]}} in > less-verbose mode. > {code} > scala> spark.sql("select count(a) from df").explain() > == Physical Plan == > *HashAggregate(key=[], functions=[count(1)], output=[count(a)#48L]) > +- Exchange SinglePartition >+- *HashAggregate(key=[], functions=[partial_count(1)], output=[count#50L]) > +- LocalTableScan > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15781) Misleading deprecated property in standalone cluster configuration documentation
[ https://issues.apache.org/jira/browse/SPARK-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318426#comment-15318426 ] Jonathan Taws edited comment on SPARK-15781 at 6/7/16 1:00 PM: --- Then a little sentence as this one could do the trick after the end of [this section|http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications] : If you are looking to run multiple executors on the same worker, you can pass the option --executor-cores , which will create as many workers with cores as there are cores available for this worker. was (Author: jonathantaws): Then a little sentence as this one could do the trick : If you are looking to run multiple executors on the same worker, you can pass the option --executor-cores , which will create as many workers with cores as there are cores available for this worker. > Misleading deprecated property in standalone cluster configuration > documentation > > > Key: SPARK-15781 > URL: https://issues.apache.org/jira/browse/SPARK-15781 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > I am unsure if this is regarded as an issue or not, but in the > [latest|http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts] > documentation for the configuration to launch Spark in stand-alone cluster > mode, the following property is documented : > |SPARK_WORKER_INSTANCES| Number of worker instances to run on each > machine (default: 1). You can make this more than 1 if you have have very > large machines and would like multiple Spark worker processes. If you do set > this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores > per worker, or else each worker will try to use all the cores.| > However, once I launch Spark with the spark-submit utility and the property > {{SPARK_WORKER_INSTANCES}} set in my spark-env.sh file, I get the following > deprecated warning : > {code} > 16/06/06 16:38:28 WARN SparkConf: > SPARK_WORKER_INSTANCES was detected (set to '4'). > This is deprecated in Spark 1.0+. > Please instead use: > - ./spark-submit with --num-executors to specify the number of executors > - Or set SPARK_EXECUTOR_INSTANCES > - spark.executor.instances to configure the number of instances in the spark > config. > {code} > Is this regarded as normal practice to have deprecated fields documented in > the documentation ? > I would have preferred to directly know about the --num-executors property > than to have to submit my application and find a deprecated warning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15781) Misleading deprecated property in standalone cluster configuration documentation
[ https://issues.apache.org/jira/browse/SPARK-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318426#comment-15318426 ] Jonathan Taws commented on SPARK-15781: --- Then a little sentence as this one could do the trick : If you are looking to run multiple executors on the same worker, you can pass the option --executor-cores , which will create as many workers with cores as there are cores available for this worker. > Misleading deprecated property in standalone cluster configuration > documentation > > > Key: SPARK-15781 > URL: https://issues.apache.org/jira/browse/SPARK-15781 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > I am unsure if this is regarded as an issue or not, but in the > [latest|http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts] > documentation for the configuration to launch Spark in stand-alone cluster > mode, the following property is documented : > |SPARK_WORKER_INSTANCES| Number of worker instances to run on each > machine (default: 1). You can make this more than 1 if you have have very > large machines and would like multiple Spark worker processes. If you do set > this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores > per worker, or else each worker will try to use all the cores.| > However, once I launch Spark with the spark-submit utility and the property > {{SPARK_WORKER_INSTANCES}} set in my spark-env.sh file, I get the following > deprecated warning : > {code} > 16/06/06 16:38:28 WARN SparkConf: > SPARK_WORKER_INSTANCES was detected (set to '4'). > This is deprecated in Spark 1.0+. > Please instead use: > - ./spark-submit with --num-executors to specify the number of executors > - Or set SPARK_EXECUTOR_INSTANCES > - spark.executor.instances to configure the number of instances in the spark > config. > {code} > Is this regarded as normal practice to have deprecated fields documented in > the documentation ? > I would have preferred to directly know about the --num-executors property > than to have to submit my application and find a deprecated warning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15804) Manually added metadata not saving with parquet
Charlie Evans created SPARK-15804: - Summary: Manually added metadata not saving with parquet Key: SPARK-15804 URL: https://issues.apache.org/jira/browse/SPARK-15804 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Charlie Evans Adding metadata with col().as(_, metadata) then saving the resultant dataframe does not save the metadata. No error is thrown. Only see the schema contains the metadata before saving and does not contain the metadata after saving and loading the dataframe. {code} case class TestRow(a: String, b: Int) val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil val df = spark.createDataFrame(rows) import org.apache.spark.sql.types.MetadataBuilder val md = new MetadataBuilder().putString("key", "value").build() val dfWithMeta = df.select(col("a"), col("b").as("b", md)) println(dfWithMeta.schema.json) dfWithMeta.write.parquet("dfWithMeta") val dfWithMeta2 = spark.read.parquet("dfWithMeta") println(dfWithMeta2.schema.json) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15801) spark-submit --num-executors switch also works without YARN
[ https://issues.apache.org/jira/browse/SPARK-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318392#comment-15318392 ] Sean Owen commented on SPARK-15801: --- I get the result you get _without_ {{--num-executors}}. I've kind of forgotten how standalone mode is supposed to work, so hopefully that is still expected behavior. But {{--num-executors}} doesn't seem to do anything. I get 4 regardless of the value I set. CC [~vanzin] to see if that's supposed to generate a warning or whatever. > spark-submit --num-executors switch also works without YARN > --- > > Key: SPARK-15801 > URL: https://issues.apache.org/jira/browse/SPARK-15801 > Project: Spark > Issue Type: Documentation > Components: Spark Submit >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > Based on this [issue|https://issues.apache.org/jira/browse/SPARK-15781] > regarding the SPARK_WORKER_INSTANCES property, I also found that the > {{--num-executors}} switch documented in the spark-submit help is partially > incorrect. > Here's one part of the output (produced by {{spark-submit --help}}): > {code} > YARN-only: > --driver-cores NUM Number of cores used by the driver, only in > cluster mode > (Default: 1). > --queue QUEUE_NAME The YARN queue to submit to (Default: > "default"). > --num-executors NUM Number of executors to launch (Default: 2). > {code} > Correct me if I am wrong, but the num-executors switch also works in Spark > standalone mode *without YARN*. > I tried by only launching a master and a worker with 4 executors specified, > and they were all successfully spawned. The master switch pointed to the > master's url, and not to the yarn value. > Here's the exact command : {{spark-submit --master spark://[local > machine]:7077 --num-executors 4 --executor-cores 2}} > By default it is *1* executor per worker in Spark standalone mode without > YARN, but this option enables to specify the number of executors (per worker > ?) if, and only if, the {{--executor-cores}} switch is also set. I do believe > it defaults to 2 in YARN mode. > I would propose to move the option from the *YARN-only* section to the *Spark > standalone and YARN only* section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-15802) SparkSQL connection fail using shell command "bin/beeline -u "jdbc:hive2://*.*.*.*:10000/default""
[ https://issues.apache.org/jira/browse/SPARK-15802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-15802: --- oops, didn't yet mean to resolve > SparkSQL connection fail using shell command "bin/beeline -u > "jdbc:hive2://*.*.*.*:1/default"" > -- > > Key: SPARK-15802 > URL: https://issues.apache.org/jira/browse/SPARK-15802 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: marymwu > > reproduce steps: > 1. execute shell "sbin/start-thriftserver.sh --master yarn"; > 2. execute shell "bin/beeline -u "jdbc:hive2://*.*.*.*:1/default""; > Actually result: > SparkSQL connection failed and the log shows as follows: > 16/06/07 14:49:18 WARN HttpParser: Illegal character 0x1 in state=START for > buffer > HeapByteBuffer@485a5ad9[p=1,l=35,c=16384,r=34]={\x01<<<\x00\x00\x00\x05PLAIN\x05\x00\x00\x00\x14\x00an...ymous\x00anonymous>>>Type: > application...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00} > 16/06/07 14:49:18 WARN HttpParser: badMessage: 400 Illegal character 0x1 for > HttpChannelOverHttp@718db102{r=0,c=false,a=IDLE,uri=} > 16/06/07 14:49:19 WARN HttpParser: Illegal character 0x1 in state=START for > buffer > HeapByteBuffer@485a5ad9[p=1,l=35,c=16384,r=34]={\x01<<<\x00\x00\x00\x05PLAIN\x05\x00\x00\x00\x14\x00an...ymous\x00anonymous>>>Type: > application...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00} > 16/06/07 14:49:19 WARN HttpParser: badMessage: 400 Illegal character 0x1 for > HttpChannelOverHttp@195db217{r=0,c=false,a=IDLE,uri=} > note: > SparkSQL connection succeeded, if using shell command "bin/beeline -u > "jdbc:hive2://*.*.*.*:1/default;transportMode=http;httpPath=cliservice"" > Two parameters(transportMode) have been added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15802) SparkSQL connection fail using shell command "bin/beeline -u "jdbc:hive2://*.*.*.*:10000/default""
[ https://issues.apache.org/jira/browse/SPARK-15802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15802. --- Resolution: Fixed Doesn't that just mean you used the wrong protocol, and when you specified the right protocol, it worked? I don't see a Spark problem there. > SparkSQL connection fail using shell command "bin/beeline -u > "jdbc:hive2://*.*.*.*:1/default"" > -- > > Key: SPARK-15802 > URL: https://issues.apache.org/jira/browse/SPARK-15802 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: marymwu > > reproduce steps: > 1. execute shell "sbin/start-thriftserver.sh --master yarn"; > 2. execute shell "bin/beeline -u "jdbc:hive2://*.*.*.*:1/default""; > Actually result: > SparkSQL connection failed and the log shows as follows: > 16/06/07 14:49:18 WARN HttpParser: Illegal character 0x1 in state=START for > buffer > HeapByteBuffer@485a5ad9[p=1,l=35,c=16384,r=34]={\x01<<<\x00\x00\x00\x05PLAIN\x05\x00\x00\x00\x14\x00an...ymous\x00anonymous>>>Type: > application...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00} > 16/06/07 14:49:18 WARN HttpParser: badMessage: 400 Illegal character 0x1 for > HttpChannelOverHttp@718db102{r=0,c=false,a=IDLE,uri=} > 16/06/07 14:49:19 WARN HttpParser: Illegal character 0x1 in state=START for > buffer > HeapByteBuffer@485a5ad9[p=1,l=35,c=16384,r=34]={\x01<<<\x00\x00\x00\x05PLAIN\x05\x00\x00\x00\x14\x00an...ymous\x00anonymous>>>Type: > application...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00} > 16/06/07 14:49:19 WARN HttpParser: badMessage: 400 Illegal character 0x1 for > HttpChannelOverHttp@195db217{r=0,c=false,a=IDLE,uri=} > note: > SparkSQL connection succeeded, if using shell command "bin/beeline -u > "jdbc:hive2://*.*.*.*:1/default;transportMode=http;httpPath=cliservice"" > Two parameters(transportMode) have been added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15781) Misleading deprecated property in standalone cluster configuration documentation
[ https://issues.apache.org/jira/browse/SPARK-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318379#comment-15318379 ] Sean Owen commented on SPARK-15781: --- These are reasonable ideas, though I think the idea is to move away from env variables entirely eventually. Hence I'd be fine just removing this deprecated one. > Misleading deprecated property in standalone cluster configuration > documentation > > > Key: SPARK-15781 > URL: https://issues.apache.org/jira/browse/SPARK-15781 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > I am unsure if this is regarded as an issue or not, but in the > [latest|http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts] > documentation for the configuration to launch Spark in stand-alone cluster > mode, the following property is documented : > |SPARK_WORKER_INSTANCES| Number of worker instances to run on each > machine (default: 1). You can make this more than 1 if you have have very > large machines and would like multiple Spark worker processes. If you do set > this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores > per worker, or else each worker will try to use all the cores.| > However, once I launch Spark with the spark-submit utility and the property > {{SPARK_WORKER_INSTANCES}} set in my spark-env.sh file, I get the following > deprecated warning : > {code} > 16/06/06 16:38:28 WARN SparkConf: > SPARK_WORKER_INSTANCES was detected (set to '4'). > This is deprecated in Spark 1.0+. > Please instead use: > - ./spark-submit with --num-executors to specify the number of executors > - Or set SPARK_EXECUTOR_INSTANCES > - spark.executor.instances to configure the number of instances in the spark > config. > {code} > Is this regarded as normal practice to have deprecated fields documented in > the documentation ? > I would have preferred to directly know about the --num-executors property > than to have to submit my application and find a deprecated warning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15803) Support with statement syntax for SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15803: Assignee: (was: Apache Spark) > Support with statement syntax for SparkSession > -- > > Key: SPARK-15803 > URL: https://issues.apache.org/jira/browse/SPARK-15803 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Priority: Minor > > It would be nice to support with statement syntax for SparkSession like > following > {code} > with SparkSession.builder.(...).getOrCreate() as session: > session.sql("show tables").show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15803) Support with statement syntax for SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318375#comment-15318375 ] Apache Spark commented on SPARK-15803: -- User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/13541 > Support with statement syntax for SparkSession > -- > > Key: SPARK-15803 > URL: https://issues.apache.org/jira/browse/SPARK-15803 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Priority: Minor > > It would be nice to support with statement syntax for SparkSession like > following > {code} > with SparkSession.builder.(...).getOrCreate() as session: > session.sql("show tables").show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15803) Support with statement syntax for SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15803: Assignee: Apache Spark > Support with statement syntax for SparkSession > -- > > Key: SPARK-15803 > URL: https://issues.apache.org/jira/browse/SPARK-15803 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Assignee: Apache Spark >Priority: Minor > > It would be nice to support with statement syntax for SparkSession like > following > {code} > with SparkSession.builder.(...).getOrCreate() as session: > session.sql("show tables").show() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15803) Support with statement syntax for SparkSession
Jeff Zhang created SPARK-15803: -- Summary: Support with statement syntax for SparkSession Key: SPARK-15803 URL: https://issues.apache.org/jira/browse/SPARK-15803 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.0.0 Reporter: Jeff Zhang Priority: Minor It would be nice to support with statement syntax for SparkSession like following {code} with SparkSession.builder.(...).getOrCreate() as session: session.sql("show tables").show() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15801) spark-submit --num-executors switch also works without YARN
[ https://issues.apache.org/jira/browse/SPARK-15801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318185#comment-15318185 ] Jonathan Taws edited comment on SPARK-15801 at 6/7/16 10:13 AM: It is mandatory to add the --executor-cores option for it to work, I added the exact command in the description. was (Author: jonathantaws): It is mandatory to add the --executor-cores option for it to work, will add the exact command in the description. > spark-submit --num-executors switch also works without YARN > --- > > Key: SPARK-15801 > URL: https://issues.apache.org/jira/browse/SPARK-15801 > Project: Spark > Issue Type: Documentation > Components: Spark Submit >Affects Versions: 1.6.1 >Reporter: Jonathan Taws >Priority: Minor > > Based on this [issue|https://issues.apache.org/jira/browse/SPARK-15781] > regarding the SPARK_WORKER_INSTANCES property, I also found that the > {{--num-executors}} switch documented in the spark-submit help is partially > incorrect. > Here's one part of the output (produced by {{spark-submit --help}}): > {code} > YARN-only: > --driver-cores NUM Number of cores used by the driver, only in > cluster mode > (Default: 1). > --queue QUEUE_NAME The YARN queue to submit to (Default: > "default"). > --num-executors NUM Number of executors to launch (Default: 2). > {code} > Correct me if I am wrong, but the num-executors switch also works in Spark > standalone mode *without YARN*. > I tried by only launching a master and a worker with 4 executors specified, > and they were all successfully spawned. The master switch pointed to the > master's url, and not to the yarn value. > Here's the exact command : {{spark-submit --master spark://[local > machine]:7077 --num-executors 4 --executor-cores 2}} > By default it is *1* executor per worker in Spark standalone mode without > YARN, but this option enables to specify the number of executors (per worker > ?) if, and only if, the {{--executor-cores}} switch is also set. I do believe > it defaults to 2 in YARN mode. > I would propose to move the option from the *YARN-only* section to the *Spark > standalone and YARN only* section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org