[jira] [Updated] (SPARK-6942) Umbrella: UI Visualizations for Core and Dataframes
[ https://issues.apache.org/jira/browse/SPARK-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6942: --- Component/s: Web UI Umbrella: UI Visualizations for Core and Dataframes Key: SPARK-6942 URL: https://issues.apache.org/jira/browse/SPARK-6942 Project: Spark Issue Type: Umbrella Components: Spark Core, SQL, Web UI Reporter: Patrick Wendell Assignee: Patrick Wendell This is an umbrella issue for the assorted visualization proposals for Spark's UI. The scope will likely cover Spark 1.4 and 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6942) Umbrella: UI Visualizations for Core and Dataframes
Patrick Wendell created SPARK-6942: -- Summary: Umbrella: UI Visualizations for Core and Dataframes Key: SPARK-6942 URL: https://issues.apache.org/jira/browse/SPARK-6942 Project: Spark Issue Type: Umbrella Components: Spark Core, SQL Reporter: Patrick Wendell Assignee: Patrick Wendell This is an umbrella issue for the assorted visualization proposals for Spark's UI. The scope will likely cover Spark 1.4 and 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature
[ https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3468: --- Issue Type: Sub-task (was: New Feature) Parent: SPARK-6942 WebUI Timeline-View feature --- Key: SPARK-3468 URL: https://issues.apache.org/jira/browse/SPARK-3468 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 1.4.0 Reporter: Kousuke Saruta Attachments: ApplicationTimeliView.png, JobTimelineView.png, TaskAssignmentTimelineView.png I sometimes trouble-shoot and analyse the cause of long time spending job. At the time, I find the stages which spends long time or fails, then I find the tasks which spends long time or fails, next I analyse the proportion of each phase in a task. Another case, I find executors which spends long time for running a task and analyse the details of a task. In such situation, I think it's helpful to visualize timeline view of stages / tasks / executors and visualize details of proportion of activity for each task. Now I'm developing prototypes like captures I attached. I'll integrate these viewer into WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6943) Graphically show RDD's included in a stage
Patrick Wendell created SPARK-6943: -- Summary: Graphically show RDD's included in a stage Key: SPARK-6943 URL: https://issues.apache.org/jira/browse/SPARK-6943 Project: Spark Issue Type: Sub-task Reporter: Patrick Wendell Assignee: Andrew Or -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3468) Provide timeline view in Job and Stage pages
[ https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3468: --- Summary: Provide timeline view in Job and Stage pages (was: WebUI Timeline-View feature) Provide timeline view in Job and Stage pages Key: SPARK-3468 URL: https://issues.apache.org/jira/browse/SPARK-3468 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 1.4.0 Reporter: Kousuke Saruta Attachments: ApplicationTimeliView.png, JobTimelineView.png, TaskAssignmentTimelineView.png I sometimes trouble-shoot and analyse the cause of long time spending job. At the time, I find the stages which spends long time or fails, then I find the tasks which spends long time or fails, next I analyse the proportion of each phase in a task. Another case, I find executors which spends long time for running a task and analyse the details of a task. In such situation, I think it's helpful to visualize timeline view of stages / tasks / executors and visualize details of proportion of activity for each task. Now I'm developing prototypes like captures I attached. I'll integrate these viewer into WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3468) Provide timeline view in Job and Stage UI pages
[ https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3468: --- Summary: Provide timeline view in Job and Stage UI pages (was: Provide timeline view in Job and Stage pages) Provide timeline view in Job and Stage UI pages --- Key: SPARK-3468 URL: https://issues.apache.org/jira/browse/SPARK-3468 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 1.4.0 Reporter: Kousuke Saruta Attachments: ApplicationTimeliView.png, JobTimelineView.png, TaskAssignmentTimelineView.png I sometimes trouble-shoot and analyse the cause of long time spending job. At the time, I find the stages which spends long time or fails, then I find the tasks which spends long time or fails, next I analyse the proportion of each phase in a task. Another case, I find executors which spends long time for running a task and analyse the details of a task. In such situation, I think it's helpful to visualize timeline view of stages / tasks / executors and visualize details of proportion of activity for each task. Now I'm developing prototypes like captures I attached. I'll integrate these viewer into WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6950) Spark master UI believes some applications are in progress when they are actually completed
[ https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6950: --- Component/s: Web UI Spark master UI believes some applications are in progress when they are actually completed --- Key: SPARK-6950 URL: https://issues.apache.org/jira/browse/SPARK-6950 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Matt Cheah In Spark 1.2.x, I was able to set my spark event log directory to be a different location from the default, and after the job finishes, I can replay the UI by clicking on the appropriate link under Completed Applications. Now, on a non-deterministic basis (but seems to happen most of the time), when I click on the link under Completed Applications, I instead get a webpage that says: Application history not found (app-20150415052927-0014) Application myApp is still in progress. I am able to view the application's UI using the Spark history server, so something regressed in the Spark master code between 1.2 and 1.3, but that regression does not apply in the history server use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.2.2
I'd like to close this vote to coincide with the 1.3.1 release, however, it would be great to have more people test this release first. I'll leave it open for a bit longer and see if others can give a +1. On Tue, Apr 14, 2015 at 9:55 PM, Patrick Wendell pwend...@gmail.com wrote: +1 from me ass well. On Tue, Apr 7, 2015 at 4:36 AM, Sean Owen so...@cloudera.com wrote: I think that's close enough for a +1: Signatures and hashes are good. LICENSE, NOTICE still check out. Compiles for a Hadoop 2.6 + YARN + Hive profile. JIRAs with target version = 1.2.x look legitimate; no blockers. I still observe several Hive test failures with: mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0 -DskipTests clean package; mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0 test .. though again I think these are not regressions but known issues in older branches. FYI there are 16 Critical issues still open for 1.2.x: SPARK-6209,ExecutorClassLoader can leak connections after failing to load classes from the REPL class server,Josh Rosen,In Progress,4/5/15 SPARK-5098,Number of running tasks become negative after tasks lost,,Open,1/14/15 SPARK-4888,Spark EC2 doesn't mount local disks for i2.8xlarge instances,,Open,1/27/15 SPARK-4879,Missing output partitions after job completes with speculative execution,Josh Rosen,Open,3/5/15 SPARK-4568,Publish release candidates under $VERSION-RCX instead of $VERSION,Patrick Wendell,Open,11/24/14 SPARK-4520,SparkSQL exception when reading certain columns from a parquet file,sadhan sood,Open,1/21/15 SPARK-4514,SparkContext localProperties does not inherit property updates across thread reuse,Josh Rosen,Open,3/31/15 SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15 SPARK-4452,Shuffle data structures can starve others on the same thread for memory,Tianshuo Deng,Open,1/24/15 SPARK-4356,Test Scala 2.11 on Jenkins,Patrick Wendell,Open,11/12/14 SPARK-4258,NPE with new Parquet Filters,Cheng Lian,Reopened,4/3/15 SPARK-4194,Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state,,In Progress,4/2/15 SPARK-4159,Maven build doesn't run JUnit test suites,Sean Owen,Open,1/11/15 SPARK-4106,Shuffle write and spill to disk metrics are incorrect,,Open,10/28/14 SPARK-3492,Clean up Yarn integration code,Andrew Or,Open,9/12/14 SPARK-3461,Support external groupByKey using repartitionAndSortWithinPartitions,Sandy Ryza,Open,11/10/14 SPARK-2984,FileNotFoundException on _temporary directory,,Open,12/11/14 SPARK-2532,Fix issues with consolidated shuffle,,Open,3/26/15 SPARK-1312,Batch should read based on the batch interval provided in the StreamingContext,Tathagata Das,Open,12/24/14 On Sun, Apr 5, 2015 at 7:24 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.2! The tag to be voted on is v1.2.2-rc1 (commit 7531b50): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7531b50e406ee2e3301b009ceea7c684272b2e27 The list of fixes present in this release can be found at: http://bit.ly/1DCNddt The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.2-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1082/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.2-rc1-docs/ Please vote on releasing this package as Apache Spark 1.2.2! The vote is open until Thursday, April 08, at 00:30 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.2.2 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.1 (RC3)
+1 from myself as well On Mon, Apr 13, 2015 at 8:35 PM, GuoQiang Li wi...@qq.com wrote: +1 (non-binding) -- Original -- From: Patrick Wendell;pwend...@gmail.com; Date: Sat, Apr 11, 2015 02:05 PM To: dev@spark.apache.orgdev@spark.apache.org; Subject: [VOTE] Release Apache Spark 1.3.1 (RC3) Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 3e83913): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e8391327ba586eaf54447043bd526d919043a44 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1088/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc3-docs/ The patches on top of RC2 are: [SPARK-6851] [SQL] Create new instance for each converted parquet relation [SPARK-5969] [PySpark] Fix descending pyspark.rdd.sortByKey. [SPARK-6343] Doc driver-worker network reqs [SPARK-6767] [SQL] Fixed Query DSL error in spark sql Readme [SPARK-6781] [SQL] use sqlContext in python shell [SPARK-6753] Clone SparkConf in ShuffleSuite tests [SPARK-6506] [PySpark] Do not try to retrieve SPARK_HOME when not needed... Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Tuesday, April 14, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[RESULT] [VOTE] Release Apache Spark 1.3.1 (RC3)
This vote passes with 10 +1 votes (5 binding) and no 0 or -1 votes. +1: Sean Owen* Reynold Xin* Krishna Sankar Denny Lee Mark Hamstra* Sean McNamara* Sree V Marcelo Vanzin GuoQiang Li Patrick Wendell* 0: -1: I will work on packaging this release in the next 48 hours. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.2.2
+1 from me ass well. On Tue, Apr 7, 2015 at 4:36 AM, Sean Owen so...@cloudera.com wrote: I think that's close enough for a +1: Signatures and hashes are good. LICENSE, NOTICE still check out. Compiles for a Hadoop 2.6 + YARN + Hive profile. JIRAs with target version = 1.2.x look legitimate; no blockers. I still observe several Hive test failures with: mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0 -DskipTests clean package; mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0 test .. though again I think these are not regressions but known issues in older branches. FYI there are 16 Critical issues still open for 1.2.x: SPARK-6209,ExecutorClassLoader can leak connections after failing to load classes from the REPL class server,Josh Rosen,In Progress,4/5/15 SPARK-5098,Number of running tasks become negative after tasks lost,,Open,1/14/15 SPARK-4888,Spark EC2 doesn't mount local disks for i2.8xlarge instances,,Open,1/27/15 SPARK-4879,Missing output partitions after job completes with speculative execution,Josh Rosen,Open,3/5/15 SPARK-4568,Publish release candidates under $VERSION-RCX instead of $VERSION,Patrick Wendell,Open,11/24/14 SPARK-4520,SparkSQL exception when reading certain columns from a parquet file,sadhan sood,Open,1/21/15 SPARK-4514,SparkContext localProperties does not inherit property updates across thread reuse,Josh Rosen,Open,3/31/15 SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15 SPARK-4452,Shuffle data structures can starve others on the same thread for memory,Tianshuo Deng,Open,1/24/15 SPARK-4356,Test Scala 2.11 on Jenkins,Patrick Wendell,Open,11/12/14 SPARK-4258,NPE with new Parquet Filters,Cheng Lian,Reopened,4/3/15 SPARK-4194,Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state,,In Progress,4/2/15 SPARK-4159,Maven build doesn't run JUnit test suites,Sean Owen,Open,1/11/15 SPARK-4106,Shuffle write and spill to disk metrics are incorrect,,Open,10/28/14 SPARK-3492,Clean up Yarn integration code,Andrew Or,Open,9/12/14 SPARK-3461,Support external groupByKey using repartitionAndSortWithinPartitions,Sandy Ryza,Open,11/10/14 SPARK-2984,FileNotFoundException on _temporary directory,,Open,12/11/14 SPARK-2532,Fix issues with consolidated shuffle,,Open,3/26/15 SPARK-1312,Batch should read based on the batch interval provided in the StreamingContext,Tathagata Das,Open,12/24/14 On Sun, Apr 5, 2015 at 7:24 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.2! The tag to be voted on is v1.2.2-rc1 (commit 7531b50): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7531b50e406ee2e3301b009ceea7c684272b2e27 The list of fixes present in this release can be found at: http://bit.ly/1DCNddt The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.2-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1082/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.2-rc1-docs/ Please vote on releasing this package as Apache Spark 1.2.2! The vote is open until Thursday, April 08, at 00:30 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.2.2 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's
[ https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492888#comment-14492888 ] Patrick Wendell commented on SPARK-6703: Hey [~ilganeli] - sure thing. I've pinged a couple of people to provide feedback on the design. Overall I think it won't be a complicated feature to implement. I've added you as the assignee. One note, if it gets very close to the 1.4 code freeze I may need to help take it across the finish line. But for now why don't you go ahead, I think we'll be fine. Provide a way to discover existing SparkContext's - Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.3.0 Reporter: Patrick Wendell Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to provide a rendez-vous point so that applications can learn whether an existing SparkContext already exists before creating one. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6703) Provide a way to discover existing SparkContext's
[ https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6703: --- Assignee: Ilya Ganelin Provide a way to discover existing SparkContext's - Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Ilya Ganelin Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to provide a rendez-vous point so that applications can learn whether an existing SparkContext already exists before creating one. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6703) Provide a way to discover existing SparkContext's
[ https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6703: --- Priority: Critical (was: Major) Provide a way to discover existing SparkContext's - Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Ilya Ganelin Priority: Critical Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to provide a rendez-vous point so that applications can learn whether an existing SparkContext already exists before creating one. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Configuring amount of disk space available to spark executors in mesos?
Hey Jonathan, Are you referring to disk space used for storing persisted RDD's? For that, Spark does not bound the amount of data persisted to disk. It's a similar story to how Spark's shuffle disk output works (and also Hadoop and other frameworks make this assumption as well for their shuffle data, AFAIK). We could (in theory) add a storage level that bounds the amount of data persisted to disk and forces re-computation if the partition did not fit. I'd be interested to hear more about a workload where that's relevant though, before going that route. Maybe if people are using SSD's that would make sense. - Patrick On Mon, Apr 13, 2015 at 8:19 AM, Jonathan Coveney jcove...@gmail.com wrote: I'm surprised that I haven't been able to find this via google, but I haven't... What is the setting that requests some amount of disk space for the executors? Maybe I'm misunderstanding how this is configured... Thanks for any help! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's
[ https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492898#comment-14492898 ] Patrick Wendell commented on SPARK-6703: /cc [~velvia] Provide a way to discover existing SparkContext's - Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Ilya Ganelin Priority: Critical Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to provide a rendez-vous point so that applications can learn whether an existing SparkContext already exists before creating one. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6511) Publish hadoop provided build with instructions for different distros
[ https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493183#comment-14493183 ] Patrick Wendell edited comment on SPARK-6511 at 4/13/15 10:11 PM: -- Just as an example I tried to wire Spark to work with stock Hadoop 2.6. Here is how I got it running after doing a hadoop-provided build. This is pretty clunky, so I wonder if we should just support setting HADOOP_HOME or something and we can automatically find and add the jar files present within that folder. {code} export SPARK_DIST_CLASSPATH=$(find /tmp/hadoop-2.6.0/ -name *.jar | tr \n ;) ./bin/spark-shell {code} [~vanzin] for your CDH packages, what do you end up setting SPARK_DIST_CLASSPATH to? /cc [~srowen] was (Author: pwendell): Just as an example I tried to wire Spark to work with stock Hadoop 2.6. Here is how I got it running after doing a hadoop-provided build. This is pretty clunky, so I wonder if we should just support setting HADOOP_HOME or something and we can automatically find and add the jar files present within that folder. {code} export SPARK_DIST_CLASSPATH=$(find /tmp/hadoop-2.6.0/ -name *.jar | tr \n ;) ./bin/spark-shell {code} [~vanzin] for your CDH packages, what do you end up setting SPARK_DIST_CLASSPATH to? Publish hadoop provided build with instructions for different distros --- Key: SPARK-6511 URL: https://issues.apache.org/jira/browse/SPARK-6511 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Currently we publish a series of binaries with different Hadoop client jars. This mostly works, but some users have reported compatibility issues with different distributions. One improvement moving forward might be to publish a binary build that simply asks you to set HADOOP_HOME to pick up the Hadoop client location. That way it would work across multiple distributions, even if they have subtle incompatibilities with upstream Hadoop. I think a first step for this would be to produce such a build for the community and see how well it works. One potential issue is that our fancy excludes and dependency re-writing won't work with the simpler append Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes for dependency conflicts) or do we allow for linking against vanilla Hive at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's
[ https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493179#comment-14493179 ] Patrick Wendell commented on SPARK-6703: Yes, ideally we get it into 1.4 - though I think the ultimate solution here could be a very small patch. Provide a way to discover existing SparkContext's - Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Ilya Ganelin Priority: Critical Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to provide a rendez-vous point so that applications can learn whether an existing SparkContext already exists before creating one. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6511) Publish hadoop provided build with instructions for different distros
[ https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493183#comment-14493183 ] Patrick Wendell commented on SPARK-6511: Just as an example I tried to wire Spark to work with stock Hadoop 2.6. Here is how I got it running after doing a hadoop-provided build. This is pretty clunky, so I wonder if we should just support setting HADOOP_HOME or something and we can automatically find and add the jar files present within that folder. {code} export SPARK_DIST_CLASSPATH=$(find /tmp/hadoop-2.6.0/ -name *.jar | tr \n ;) ./bin/spark-shell {code} [~vanzin] for your CDH packages, what do you end up setting SPARK_DIST_CLASSPATH to? Publish hadoop provided build with instructions for different distros --- Key: SPARK-6511 URL: https://issues.apache.org/jira/browse/SPARK-6511 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Currently we publish a series of binaries with different Hadoop client jars. This mostly works, but some users have reported compatibility issues with different distributions. One improvement moving forward might be to publish a binary build that simply asks you to set HADOOP_HOME to pick up the Hadoop client location. That way it would work across multiple distributions, even if they have subtle incompatibilities with upstream Hadoop. I think a first step for this would be to produce such a build for the community and see how well it works. One potential issue is that our fancy excludes and dependency re-writing won't work with the simpler append Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes for dependency conflicts) or do we allow for linking against vanilla Hive at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6511) Publish hadoop provided build with instructions for different distros
[ https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493274#comment-14493274 ] Patrick Wendell commented on SPARK-6511: Can we just run HADOOP_HOME/bin/hadoop classpath and then capture the result? I'm wondering if there is a standard interface here we can expect most Hadoop distributions to have. Publish hadoop provided build with instructions for different distros --- Key: SPARK-6511 URL: https://issues.apache.org/jira/browse/SPARK-6511 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Currently we publish a series of binaries with different Hadoop client jars. This mostly works, but some users have reported compatibility issues with different distributions. One improvement moving forward might be to publish a binary build that simply asks you to set HADOOP_HOME to pick up the Hadoop client location. That way it would work across multiple distributions, even if they have subtle incompatibilities with upstream Hadoop. I think a first step for this would be to produce such a build for the community and see how well it works. One potential issue is that our fancy excludes and dependency re-writing won't work with the simpler append Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes for dependency conflicts) or do we allow for linking against vanilla Hive at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules
[ https://issues.apache.org/jira/browse/SPARK-6889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493254#comment-14493254 ] Patrick Wendell commented on SPARK-6889: Thanks for posting this Sean. Overall, I think this is a big improvement. Some comments on the proposed JIRA workflow changes: 1. I think logically Affects Version/s is required only for bugs, right? Is there a well defined meaning for Affects Version/s for a new feature that is distinct from Target Version/s? 2. I am not sure you can restrict certain priority levels to certain roles, but if so that would be really nice. Streamline contribution process with update to Contribution wiki, JIRA rules Key: SPARK-6889 URL: https://issues.apache.org/jira/browse/SPARK-6889 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Sean Owen Assignee: Sean Owen Attachments: ContributingtoSpark.pdf, SparkProjectMechanicsChallenges.pdf From about 6 months of intimate experience with the Spark JIRA and the reality of the JIRA / PR flow, I've observed some challenges, problems and growing pains that have begun to encumber the project mechanics. In the attached SparkProjectMechanicsChallenges.pdf document, I've collected these observations and a few statistics that summarize much of what I've seen. From side conversations with several of you, I think some of these will resonate. (Read it first for this to make sense.) I'd like to improve just one aspect to start: the contribution process. A lot of inbound contribution effort gets misdirected, and can burn a lot of cycles for everyone, and that's a barrier to scaling up further and to general happiness. I'd like to propose for discussion a change to the wiki pages, and a change to some JIRA settings. *Wiki* - Replace https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark with proposed text (NewContributingToSpark.pdf) - Delete https://cwiki.apache.org/confluence/display/SPARK/Reviewing+and+Merging+Patches as it is subsumed by the new text - Move the IDE Setup section to https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools - Delete https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+Scheme as it's a bit out of date and not all that useful *JIRA* Now: Start by removing everyone from the 'Developer' role and add them to 'Contributor'. Right now Developer has no permission that Contributor doesn't. We may reuse Developer later for some level between Committer and Contributor. Later, with Apache admin assistance: - Make Component and Affects Version required for new JIRAs - Set default priority to Minor and type to Question for new JIRAs. If defaults aren't changed, by default it can't be that important - Only let Committers set Target Version and Fix Version - Only let Committers set Blocker Priority -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6199) Support CTE
[ https://issues.apache.org/jira/browse/SPARK-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6199: --- Assignee: (was: Cheng Hao) Support CTE --- Key: SPARK-6199 URL: https://issues.apache.org/jira/browse/SPARK-6199 Project: Spark Issue Type: Improvement Components: SQL Reporter: haiyang Fix For: 1.4.0 Support CTE in SQLContext and HiveContext -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6858) Register Java HashMap for SparkSqlSerializer
[ https://issues.apache.org/jira/browse/SPARK-6858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6858: --- Assignee: Liang-Chi Hsieh Register Java HashMap for SparkSqlSerializer Key: SPARK-6858 URL: https://issues.apache.org/jira/browse/SPARK-6858 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Priority: Trivial Fix For: 1.4.0 Since now kyro serializer is used for {{GeneralHashedRelation}} whether kyro is enabled or not, it is better to register Java {{HashMap}} in {{SparkSqlSerializer}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4760) ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size for tables created from Parquet files
[ https://issues.apache.org/jira/browse/SPARK-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4760. Resolution: Not A Problem ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size for tables created from Parquet files -- Key: SPARK-4760 URL: https://issues.apache.org/jira/browse/SPARK-4760 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jianshi Huang Priority: Critical Fix For: 1.3.0 In a older Spark version built around Oct. 12, I was able to use ANALYZE TABLE table COMPUTE STATISTICS noscan to get estimated table size, which is important for optimizing joins. (I'm joining 15 small dimension tables, and this is crucial to me). In the more recent Spark builds, it fails to estimate the table size unless I remove noscan. Here's the statistics I got using DESC EXTENDED: old: parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166} new: parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1} And I've tried turning off spark.sql.hive.convertMetastoreParquet in my spark-defaults.conf and the result is unaffected (in both versions). Looks like the Parquet support in new Hive (0.13.1) is broken? Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6611) Add support for INTEGER as synonym of INT to DDLParser
[ https://issues.apache.org/jira/browse/SPARK-6611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6611: --- Assignee: Santiago M. Mola Add support for INTEGER as synonym of INT to DDLParser -- Key: SPARK-6611 URL: https://issues.apache.org/jira/browse/SPARK-6611 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Santiago M. Mola Assignee: Santiago M. Mola Priority: Minor Fix For: 1.4.0 Add support for INTEGER as synonym of INT to DDLParser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491863#comment-14491863 ] Patrick Wendell commented on SPARK-1529: Hey Kannan, We originally considered doing something like you are proposing, where we would change our filesystem interactions to all use a Hadoop FileSystem class and then we'd use Hadoop's LocalFileSystem. However, there were two issues: 1. We used POSIX API's that are not present in Hadoop. For instance, we use memory mapping in various places, FileChannel in the BlockObjectWriter, etc. 2. Using LocalFileSystem has a substantial performance overheads compared with our current code. So we'd have to write our own implementation of a Local filesystem. For this reason, we decided that our current shuffle machinery was fundamentally not usable for non-POSIX environments. So we decided that instead, we'd let people customize shuffle behavior at a higher level and we implemented the pluggable shuffle components. So you can create a shuffle manager that is specifically optimized for a particular environment (e.g. MapR). Did you consider implementing a MapR shuffle using that mechanism instead? You'd have to operate at a higher level, where you reason about shuffle records, etc. But you'd have a lot of flexibility to optimize within that. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Kannan Rajah Attachments: Spark Shuffle using HDFS.pdf In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-4760) ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size for tables created from Parquet files
[ https://issues.apache.org/jira/browse/SPARK-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reopened SPARK-4760: ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size for tables created from Parquet files -- Key: SPARK-4760 URL: https://issues.apache.org/jira/browse/SPARK-4760 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jianshi Huang Priority: Critical Fix For: 1.3.0 In a older Spark version built around Oct. 12, I was able to use ANALYZE TABLE table COMPUTE STATISTICS noscan to get estimated table size, which is important for optimizing joins. (I'm joining 15 small dimension tables, and this is crucial to me). In the more recent Spark builds, it fails to estimate the table size unless I remove noscan. Here's the statistics I got using DESC EXTENDED: old: parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166} new: parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1} And I've tried turning off spark.sql.hive.convertMetastoreParquet in my spark-defaults.conf and the result is unaffected (in both versions). Looks like the Parquet support in new Hive (0.13.1) is broken? Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6179) Support SHOW PRINCIPALS role_name;
[ https://issues.apache.org/jira/browse/SPARK-6179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6179: --- Assignee: Zhongshuai Pei Support SHOW PRINCIPALS role_name; Key: SPARK-6179 URL: https://issues.apache.org/jira/browse/SPARK-6179 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.1 Reporter: Zhongshuai Pei Assignee: Zhongshuai Pei Fix For: 1.4.0 SHOW PRINCIPALS role_name; Lists all roles and users who belong to this role. Only the admin role has privilege for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6199) Support CTE
[ https://issues.apache.org/jira/browse/SPARK-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6199: --- Assignee: Cheng Hao Support CTE --- Key: SPARK-6199 URL: https://issues.apache.org/jira/browse/SPARK-6199 Project: Spark Issue Type: Improvement Components: SQL Reporter: haiyang Assignee: Cheng Hao Fix For: 1.4.0 Support CTE in SQLContext and HiveContext -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6863) Formatted list broken on Hive compatibility section of SQL programming guide
[ https://issues.apache.org/jira/browse/SPARK-6863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6863: --- Assignee: Santiago M. Mola Formatted list broken on Hive compatibility section of SQL programming guide Key: SPARK-6863 URL: https://issues.apache.org/jira/browse/SPARK-6863 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.3.0 Reporter: Santiago M. Mola Assignee: Santiago M. Mola Priority: Trivial Fix For: 1.3.1, 1.4.0 Formatted list broken on Hive compatibility section of SQL programming guide. It does not appear as a list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.1 (RC2)
Hey Denny, I beleive the 2.4 bits are there. The 2.6 bits I had done specially (we haven't merge that into our upstream build script). I'll do it again now for RC2. - Patrick On Wed, Apr 8, 2015 at 1:53 PM, Timothy Chen tnac...@gmail.com wrote: +1 Tested on 4 nodes Mesos cluster with fine-grain and coarse-grain mode. Tim On Wed, Apr 8, 2015 at 9:32 AM, Denny Lee denny.g@gmail.com wrote: The RC2 bits are lacking Hadoop 2.4 and Hadoop 2.6 - was that intended (they were included in RC1)? On Wed, Apr 8, 2015 at 9:01 AM Tom Graves tgraves...@yahoo.com.invalid wrote: +1. Tested spark on yarn against hadoop 2.6. Tom On Wednesday, April 8, 2015 6:15 AM, Sean Owen so...@cloudera.com wrote: Still a +1 from me; same result (except that now of course the UISeleniumSuite test does not fail) On Wed, Apr 8, 2015 at 1:46 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 7c4473a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 7c4473aa5a7f5de0323394aaedeefbf9738e8eb5 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1083/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/ The patches on top of RC1 are: [SPARK-6737] Fix memory leak in OutputCommitCoordinator https://github.com/apache/spark/pull/5397 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py https://github.com/apache/spark/pull/5302 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError https://github.com/apache/spark/pull/4933 Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Saturday, April 11, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.1 (RC2)
Oh I see - ah okay I'm guessing it was a transient build error and I'll get it posted ASAP. On Wed, Apr 8, 2015 at 3:41 PM, Denny Lee denny.g@gmail.com wrote: Oh, it appears the 2.4 bits without hive are there but not the 2.4 bits with hive. Cool stuff on the 2.6. On Wed, Apr 8, 2015 at 12:30 Patrick Wendell pwend...@gmail.com wrote: Hey Denny, I beleive the 2.4 bits are there. The 2.6 bits I had done specially (we haven't merge that into our upstream build script). I'll do it again now for RC2. - Patrick On Wed, Apr 8, 2015 at 1:53 PM, Timothy Chen tnac...@gmail.com wrote: +1 Tested on 4 nodes Mesos cluster with fine-grain and coarse-grain mode. Tim On Wed, Apr 8, 2015 at 9:32 AM, Denny Lee denny.g@gmail.com wrote: The RC2 bits are lacking Hadoop 2.4 and Hadoop 2.6 - was that intended (they were included in RC1)? On Wed, Apr 8, 2015 at 9:01 AM Tom Graves tgraves...@yahoo.com.invalid wrote: +1. Tested spark on yarn against hadoop 2.6. Tom On Wednesday, April 8, 2015 6:15 AM, Sean Owen so...@cloudera.com wrote: Still a +1 from me; same result (except that now of course the UISeleniumSuite test does not fail) On Wed, Apr 8, 2015 at 1:46 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 7c4473a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 7c4473aa5a7f5de0323394aaedeefbf9738e8eb5 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1083/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/ The patches on top of RC1 are: [SPARK-6737] Fix memory leak in OutputCommitCoordinator https://github.com/apache/spark/pull/5397 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py https://github.com/apache/spark/pull/5302 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError https://github.com/apache/spark/pull/4933 Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Saturday, April 11, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jira] [Resolved] (SPARK-6792) pySpark groupByKey returns rows with the same key
[ https://issues.apache.org/jira/browse/SPARK-6792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-6792. Resolution: Not A Problem Resolving per Josh's comment. pySpark groupByKey returns rows with the same key - Key: SPARK-6792 URL: https://issues.apache.org/jira/browse/SPARK-6792 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0 Reporter: Charles Hayden Under some circumstances, pySpark groupByKey returns two or more rows with the same groupby key. It is not reproducible by a short example, but it can be seen in the following program. The preservesPartitioning argument is required to see the failure. I ran this with cluster_url=local[4], but I think it will also show up with cluster_url=local. = {noformat} # The RDD.groupByKey sometimes gives two results with the same key value. This is incorrect: all results with a single key need to be grouped together. # Report the spark version from pyspark import SparkContext import StringIO import csv sc = SparkContext() print sc.version def loadRecord(line): input = StringIO.StringIO(line) reader = csv.reader(input, delimiter='\t') return reader.next() # Read data from movielens dataset # This can be obtained from http://files.grouplens.org/datasets/movielens/ml-100k.zip inputFile = 'u.data' input = sc.textFile(inputFile) data = input.map(loadRecord) # Trim off unneeded fields data = data.map(lambda row: row[0:2]) print 'Data Sample' print data.take(10) # Use join to filter the data # # map bulds left key # map builds right key # join # map throws away the key and gets result # pick a couple of users j = sc.parallelize([789, 939]) # key left # conversion to str is required to show the error keyed_j = j.map(lambda row: (str(row), None)) # key right keyed_rdd = data.map(lambda row: (str(row[0]), row)) # join joined = keyed_rdd.join(keyed_j) # throw away key # preservesPartitioning is required to show the error res = joined.map(lambda row: row[1][0], preservesPartitioning=True) #res = joined.map(lambda row: row[1][0]) # no error print 'Filtered Sample' print res.take(10) #print res.count() # Do the groupby # There should be fewer rows keyed_rdd = res.map(lambda row: (row[1], row), preservesPartitioning=True) print 'Input Count', keyed_rdd.count() grouped_rdd = keyed_rdd.groupByKey() print 'Grouped Count', grouped_rdd.count() # There are two rows with the same key ! print 'Group Output Sample' print grouped_rdd.filter(lambda row: row[0] == '508').take(10) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6785) DateUtils can not handle date before 1970/01/01 correctly
[ https://issues.apache.org/jira/browse/SPARK-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6785: --- Component/s: SQL DateUtils can not handle date before 1970/01/01 correctly - Key: SPARK-6785 URL: https://issues.apache.org/jira/browse/SPARK-6785 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu {code} scala val d = new Date(100) d: java.sql.Date = 1969-12-31 scala DateUtils.toJavaDate(DateUtils.fromJavaDate(d)) res1: java.sql.Date = 1970-01-01 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6778) SQL contexts in spark-shell and pyspark should both be called sqlContext
[ https://issues.apache.org/jira/browse/SPARK-6778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-6778. Resolution: Duplicate SQL contexts in spark-shell and pyspark should both be called sqlContext Key: SPARK-6778 URL: https://issues.apache.org/jira/browse/SPARK-6778 Project: Spark Issue Type: Bug Components: PySpark, Spark Shell Reporter: Matei Zaharia For some reason the Python one is only called sqlCtx. This is pretty confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6399) Code compiled against 1.3.0 may not run against older Spark versions
[ https://issues.apache.org/jira/browse/SPARK-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486595#comment-14486595 ] Patrick Wendell commented on SPARK-6399: It would be good to document more clearly what compatibility we intend to provide. I am not so sure that forward compatibility is a stated or necessary goal for binary interfaces. I think we should just provide backwards compatibility for those interfaces (though in practice these will almost always be the same except for some issues like this with implicits). The main area we've had really strict enforcement of forward compatibility has been around the serialization format of JSON logs, since we want it to be easy for people to use the Spark history server with newer versions of Spark in a multi-tenant cluster. Code compiled against 1.3.0 may not run against older Spark versions Key: SPARK-6399 URL: https://issues.apache.org/jira/browse/SPARK-6399 Project: Spark Issue Type: Bug Components: Documentation, Spark Core Affects Versions: 1.3.0 Reporter: Marcelo Vanzin Commit 65b987c3 re-organized the implicit conversions of RDDs so that they're easier to use. The problem is that scalac now generates code that will not run on older Spark versions if those conversions are used. Basically, even if you explicitly import {{SparkContext._}}, scalac will generate references to the new methods in the {{RDD}} object instead. So the compiled code will reference code that doesn't exist in older versions of Spark. You can work around this by explicitly calling the methods in the {{SparkContext}} object, although that's a little ugly. We should at least document this limitation (if there's no way to fix it), since I believe forwards compatibility in the API was also a goal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6784) Clean up all the inbound/outbound conversions for DateType
[ https://issues.apache.org/jira/browse/SPARK-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6784: --- Component/s: SQL Clean up all the inbound/outbound conversions for DateType -- Key: SPARK-6784 URL: https://issues.apache.org/jira/browse/SPARK-6784 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Priority: Blocker We had changed the JvmType of DateType to Int, but there still some places putting java.sql.Date into Row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6783) Add timing and test output for PR tests
[ https://issues.apache.org/jira/browse/SPARK-6783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6783: --- Component/s: Project Infra Add timing and test output for PR tests --- Key: SPARK-6783 URL: https://issues.apache.org/jira/browse/SPARK-6783 Project: Spark Issue Type: Improvement Components: Build, Project Infra Affects Versions: 1.3.0 Reporter: Brennon York Currently the PR tests that run under {{dev/tests/*}} do not provide any output within the actual Jenkins run. It would be nice to not only have error output, but also timing results from each test and have those surfaced within the Jenkins output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.1
Hey All, Today SPARK-6737 came to my attention. This is a bug that causes a memory leak for any long running program that repeatedly saves data out to a Hadoop FileSystem. For that reason, it is problematic for Spark Streaming. My sense is that this is severe enough to cut another RC once the fix is merged (which is imminent): https://issues.apache.org/jira/browse/SPARK-6737 I'll leave a bit of time for others to comment, in particular if people feel we should not wait for this fix. - Patrick On Tue, Apr 7, 2015 at 2:34 PM, Marcelo Vanzin van...@cloudera.com wrote: +1 (non-binding) Ran standalone and yarn tests on the hadoop-2.6 tarball, with and without the external shuffle service in yarn mode. On Sat, Apr 4, 2015 at 5:09 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc4e530cbb69851 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1080 The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/ Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Wednesday, April 08, at 01:10 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[RESULT] [VOTE] Release Apache Spark 1.3.1
This vote is cancelled in favor of RC2. On Tue, Apr 7, 2015 at 8:13 PM, Josh Rosen rosenvi...@gmail.com wrote: The leak will impact long running streaming jobs even if they don't write Hadoop files, although the problem may take much longer to manifest itself for those jobs. I think we currently leak an empty HashMap per stage submitted in the common case, so it could take a very long time for this to trigger an OOM. On the other hand, the worst case behavior is quite bad for streaming jobs, so we should probably fix this so that 1.2.x streaming users can more safely upgrade to 1.3.x. - Josh Sent from my phone On Apr 7, 2015, at 4:13 PM, Patrick Wendell pwend...@gmail.com wrote: Hey All, Today SPARK-6737 came to my attention. This is a bug that causes a memory leak for any long running program that repeatedly saves data out to a Hadoop FileSystem. For that reason, it is problematic for Spark Streaming. My sense is that this is severe enough to cut another RC once the fix is merged (which is imminent): https://issues.apache.org/jira/browse/SPARK-6737 I'll leave a bit of time for others to comment, in particular if people feel we should not wait for this fix. - Patrick On Tue, Apr 7, 2015 at 2:34 PM, Marcelo Vanzin van...@cloudera.com wrote: +1 (non-binding) Ran standalone and yarn tests on the hadoop-2.6 tarball, with and without the external shuffle service in yarn mode. On Sat, Apr 4, 2015 at 5:09 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc4e530cbb69851 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1080 The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/ Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Wednesday, April 08, at 01:10 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[VOTE] Release Apache Spark 1.3.1 (RC2)
Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 7c4473a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7c4473aa5a7f5de0323394aaedeefbf9738e8eb5 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1083/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/ The patches on top of RC1 are: [SPARK-6737] Fix memory leak in OutputCommitCoordinator https://github.com/apache/spark/pull/5397 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py https://github.com/apache/spark/pull/5302 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError https://github.com/apache/spark/pull/4933 Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Saturday, April 11, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jira] [Updated] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed
[ https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6222: --- Fix Version/s: 1.4.0 1.3.1 [STREAMING] All data may not be recovered from WAL when driver is killed Key: SPARK-6222 URL: https://issues.apache.org/jira/browse/SPARK-6222 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Hari Shreedharan Priority: Blocker Fix For: 1.3.1, 1.4.0 Attachments: AfterPatch.txt, CleanWithoutPatch.txt, SPARK-6122.patch When testing for our next release, our internal tests written by [~wypoon] caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs FlumePolling stream to read data from Flume, then kills the Application Master. Once YARN restarts it, the test waits until no more data is to be written and verifies the original against the data on HDFS. This was passing in 1.2.0, but is failing now. Since the test ties into Cloudera's internal infrastructure and build process, it cannot be directly run on an Apache build. But I have been working on isolating the commit that may have caused the regression. I have confirmed that it was caused by SPARK-5147 (PR # [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several times using the test and the failure is consistently reproducible. To re-confirm, I reverted just this one commit (and Clock consolidation one to avoid conflicts), and the issue was no longer reproducible. Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0 /cc [~tdas], [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.1
I believe TD just forgot to set the fix version on the JIRA. There is a fix for this in 1.3: https://github.com/apache/spark/commit/03e263f5b527cf574f4ffcd5cd886f7723e3756e - Patrick On Mon, Apr 6, 2015 at 2:31 PM, Mark Hamstra m...@clearstorydata.com wrote: Is that correct, or is the JIRA just out of sync, since TD's PR was merged? https://github.com/apache/spark/pull/5008 On Mon, Apr 6, 2015 at 11:10 AM, Hari Shreedharan hshreedha...@cloudera.com wrote: It does not look like https://issues.apache.org/jira/browse/SPARK-6222 made it. It was targeted towards this release. Thanks, Hari On Mon, Apr 6, 2015 at 11:04 AM, York, Brennon brennon.y...@capitalone.com wrote: +1 (non-binding) Tested GraphX, build infrastructure, core test suite on OSX 10.9 w/ Java 1.7/1.8 On 4/6/15, 5:21 AM, Sean Owen so...@cloudera.com wrote: SPARK-6673 is not, in the end, relevant for 1.3.x I believe; we just resolved it for 1.4 anyway. False alarm there. I back-ported SPARK-6205 into the 1.3 branch for next time. We'll pick it up if there's another RC, but by itself is not something that needs a new RC. (I will give the same treatment to branch 1.2 if needed in light of the 1.2.2 release.) I applied the simple change in SPARK-6205 in order to continue executing tests and all was well. I still see a few failures in Hive tests: - show_create_table_serde *** FAILED *** - show_tblproperties *** FAILED *** - udf_std *** FAILED *** - udf_stddev *** FAILED *** with ... mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0 -DskipTests clean package; mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0 test ... but these are not regressions from 1.3.0. +1 from me at this point on the current artifacts. On Sun, Apr 5, 2015 at 9:24 AM, Sean Owen so...@cloudera.com wrote: Signatures and hashes are good. LICENSE, NOTICE still check out. Compiles for a Hadoop 2.6 + YARN + Hive profile. I still see the UISeleniumSuite test failure observed in 1.3.0, which is minor and already fixed. I don't know why I didn't back-port it: https://issues.apache.org/jira/browse/SPARK-6205 If we roll another, let's get this easy fix in, but it is only an issue with tests. On JIRA, I checked open issues with Fix Version = 1.3.0 or 1.3.1 and all look legitimate (e.g. reopened or in progress) There is 1 open Blocker for 1.3.1 per Andrew: https://issues.apache.org/jira/browse/SPARK-6673 spark-shell.cmd can't start even when spark was built in Windows I believe this can be resolved quickly but as a matter of hygiene should be fixed or demoted before release. FYI there are 16 Critical issues marked for 1.3.0 / 1.3.1; worth examining before release to see how critical they are: SPARK-6701,Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python application,,Open,4/3/15 SPARK-6484,Ganglia metrics xml reporter doesn't escape correctly,Josh Rosen,Open,3/24/15 SPARK-6270,Standalone Master hangs when streaming job completes,,Open,3/11/15 SPARK-6209,ExecutorClassLoader can leak connections after failing to load classes from the REPL class server,Josh Rosen,In Progress,4/2/15 SPARK-5113,Audit and document use of hostnames and IP addresses in Spark,,Open,3/24/15 SPARK-5098,Number of running tasks become negative after tasks lost,,Open,1/14/15 SPARK-4925,Publish Spark SQL hive-thriftserver maven artifact,Patrick Wendell,Reopened,3/23/15 SPARK-4922,Support dynamic allocation for coarse-grained Mesos,,Open,3/31/15 SPARK-4888,Spark EC2 doesn't mount local disks for i2.8xlarge instances,,Open,1/27/15 SPARK-4879,Missing output partitions after job completes with speculative execution,Josh Rosen,Open,3/5/15 SPARK-4751,Support dynamic allocation for standalone mode,Andrew Or,Open,12/22/14 SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15 SPARK-4452,Shuffle data structures can starve others on the same thread for memory,Tianshuo Deng,Open,1/24/15 SPARK-4352,Incorporate locality preferences in dynamic allocation requests,,Open,1/26/15 SPARK-4227,Document external shuffle service,,Open,3/23/15 SPARK-3650,Triangle Count handles reverse edges incorrectly,,Open,2/23/15 On Sun, Apr 5, 2015 at 1:09 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f3 1b713ed90bcec63ebc4e530cbb69851 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can
Re: 1.3 Build Error with Scala-2.11
The issue is that if you invoke build/mvn it will start zinc again if it sees that it is killed. The absolute most sterile thing to do is this: 1. Kill any zinc processes. 2. Clean up spark git clean -fdx (WARNING: this will delete any staged changes you have, if you have code modifications or extra files around) 3. Run the 2.11 script to change the versions. 4. Run mvn package with maven that you installed on your machine. On Mon, Apr 6, 2015 at 10:43 PM, Marty Bower sp...@mjhb.com wrote: I'm killing zinc (if it's running) before running each build attempt. Trying to build as clean as possible. On Mon, Apr 6, 2015 at 7:31 PM Patrick Wendell pwend...@gmail.com wrote: What if you don't run zinc? I.e. just download maven and run that mvn package It might take longer, but I wonder if it will work. On Mon, Apr 6, 2015 at 10:26 PM, mjhb sp...@mjhb.com wrote: Similar problem on 1.2 branch: [ERROR] Failed to execute goal on project spark-core_2.11: Could not resolve dependencies for project org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following artifacts could not be resolved: org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT, org.apache.spark:spark-network-shuffle_2.10:jar:1.2.3-SNAPSHOT: Failure to find org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT in http://repository.apache.org/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of apache.snapshots has elapsed or updates are forced - [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal on project spark-core_2.11: Could not resolve dependencies for project org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following artifacts could not be resolved: org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT, org.apache.spark:spark-network-shuffle_2.10:jar:1.2.3-SNAPSHOT: Failure to find org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT in http://repository.apache.org/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of apache.snapshots has elapsed or updates are forced -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/1-3-Build-Error-with-Scala-2-11-tp11441p11442.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: 1.3 Build Error with Scala-2.11
One thing that I think can cause issues is if you run build/mvn with Scala 2.10, then try to run it with 2.11, since I think we may store some downloaded jars relating to zinc that will get screwed up. Not sure that's what is happening, just an idea. On Mon, Apr 6, 2015 at 10:54 PM, Patrick Wendell pwend...@gmail.com wrote: The issue is that if you invoke build/mvn it will start zinc again if it sees that it is killed. The absolute most sterile thing to do is this: 1. Kill any zinc processes. 2. Clean up spark git clean -fdx (WARNING: this will delete any staged changes you have, if you have code modifications or extra files around) 3. Run the 2.11 script to change the versions. 4. Run mvn package with maven that you installed on your machine. On Mon, Apr 6, 2015 at 10:43 PM, Marty Bower sp...@mjhb.com wrote: I'm killing zinc (if it's running) before running each build attempt. Trying to build as clean as possible. On Mon, Apr 6, 2015 at 7:31 PM Patrick Wendell pwend...@gmail.com wrote: What if you don't run zinc? I.e. just download maven and run that mvn package It might take longer, but I wonder if it will work. On Mon, Apr 6, 2015 at 10:26 PM, mjhb sp...@mjhb.com wrote: Similar problem on 1.2 branch: [ERROR] Failed to execute goal on project spark-core_2.11: Could not resolve dependencies for project org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following artifacts could not be resolved: org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT, org.apache.spark:spark-network-shuffle_2.10:jar:1.2.3-SNAPSHOT: Failure to find org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT in http://repository.apache.org/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of apache.snapshots has elapsed or updates are forced - [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal on project spark-core_2.11: Could not resolve dependencies for project org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following artifacts could not be resolved: org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT, org.apache.spark:spark-network-shuffle_2.10:jar:1.2.3-SNAPSHOT: Failure to find org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT in http://repository.apache.org/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of apache.snapshots has elapsed or updates are forced -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/1-3-Build-Error-with-Scala-2-11-tp11441p11442.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: 1.3 Build Error with Scala-2.11
What if you don't run zinc? I.e. just download maven and run that mvn package It might take longer, but I wonder if it will work. On Mon, Apr 6, 2015 at 10:26 PM, mjhb sp...@mjhb.com wrote: Similar problem on 1.2 branch: [ERROR] Failed to execute goal on project spark-core_2.11: Could not resolve dependencies for project org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following artifacts could not be resolved: org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT, org.apache.spark:spark-network-shuffle_2.10:jar:1.2.3-SNAPSHOT: Failure to find org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT in http://repository.apache.org/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of apache.snapshots has elapsed or updates are forced - [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal on project spark-core_2.11: Could not resolve dependencies for project org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following artifacts could not be resolved: org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT, org.apache.spark:spark-network-shuffle_2.10:jar:1.2.3-SNAPSHOT: Failure to find org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT in http://repository.apache.org/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of apache.snapshots has elapsed or updates are forced -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/1-3-Build-Error-with-Scala-2-11-tp11441p11442.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: 1.3 Build Error with Scala-2.11
The only think that can persist outside of Spark is if there is still a live Zinc process. We took care to make sure this was a generally stateless mechanism. Both the 1.2.X and 1.3.X releases are built with Scala 2.11 for packaging purposes. And these have been built as recently as in the last few days, since we are voting on 1.2.2 and 1.3.1. However there could be issues that only affect certain environments. - Patrick On Mon, Apr 6, 2015 at 11:02 PM, mjhb sp...@mjhb.com wrote: I resorted to deleting the spark directory between each build earlier today (attempting maximum sterility) and then re-cloning from github and switching to the 1.2 or 1.3 branch. Does anything persist outside of the spark directory? Are you able to build either 1.2 or 1.3 w/ Scala-2.11? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/1-3-Build-Error-with-Scala-2-11-tp11441p11447.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: 1.3 Build Error with Scala-2.11
Hmm.. Make sure you are building with the right flags. I think you need to pass -Dscala-2.11 to maven. Take a look at the upstream docs - on my phone now so can't easily access. On Apr 7, 2015 1:01 AM, mjhb sp...@mjhb.com wrote: I even deleted my local maven repository (.m2) but still stuck when attempting to build w/ Scala-2.11: [ERROR] Failed to execute goal on project spark-core_2.11: Could not resolve dependencies for project org.apache.spark:spark-core_2.11:jar:1.3.2-SNAPSHOT: The following artifacts could not be resolved: org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT, org.apache.spark:spark-network-shuffle_2.10:jar:1.3.2-SNAPSHOT: Could not find artifact org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT in apache.snapshots (http://repository.apache.org/snapshots) - [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal on project spark-core_2.11: Could not resolve dependencies for project org.apache.spark:spark-core_2.11:jar:1.3.2-SNAPSHOT: The following artifacts could not be resolved: org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT, org.apache.spark:spark-network-shuffle_2.10:jar:1.3.2-SNAPSHOT: Could not find artifact org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT in apache.snapshots (http://repository.apache.org/snapshots) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/1-3-Build-Error-with-Scala-2-11-tp11441p11449.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jira] [Updated] (SPARK-6703) Provide a way to discover existing SparkContext's
[ https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6703: --- Description: Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to provide a rendez-vous point so that applications can learn whether an existing SparkContext already exists before creating one. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. was: Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to have a way to write an application where you can get or create a SparkContext and have some standard type of synchronization point application authors can access. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. Provide a way to discover existing SparkContext's - Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Patrick Wendell Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to provide a rendez-vous point so that applications can learn whether an existing SparkContext already exists before creating one. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6676) Add hadoop 2.4+ for profiles in POM.xml
[ https://issues.apache.org/jira/browse/SPARK-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395932#comment-14395932 ] Patrick Wendell commented on SPARK-6676: [~srowen] This is such a common source of confusion for users, do you think we should just add 2.5 and 2.6 profiles and add a note internally that they are duplicates of 2.4? The maintenance cost there is pretty marginal and it might be better user experience, since this is something people clearly regularly stumble on. Add hadoop 2.4+ for profiles in POM.xml --- Key: SPARK-6676 URL: https://issues.apache.org/jira/browse/SPARK-6676 Project: Spark Issue Type: Improvement Components: Build, Tests Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor support *-Phadoop-2.5* and *-Phadoop-2.6* when building and testing Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6703) Provide a way to discover existing SparkContext's
Patrick Wendell created SPARK-6703: -- Summary: Provide a way to discover existing SparkContext's Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Patrick Wendell Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to have a way to write an application where you can get or create a SparkContext and have some standard type of synchronization point application authors can access. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6627) Clean up of shuffle code and interfaces
[ https://issues.apache.org/jira/browse/SPARK-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-6627. Resolution: Fixed Fix Version/s: 1.4.0 Clean up of shuffle code and interfaces --- Key: SPARK-6627 URL: https://issues.apache.org/jira/browse/SPARK-6627 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Critical Fix For: 1.4.0 The shuffle code in Spark is somewhat messy and could use some interface clean-up, especially with some larger changes outstanding. This is a catch all for what may be some small improvements in a few different PR's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6659) Spark SQL 1.3 cannot read json file that only with a record.
[ https://issues.apache.org/jira/browse/SPARK-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6659: --- Component/s: SQL Spark SQL 1.3 cannot read json file that only with a record. Key: SPARK-6659 URL: https://issues.apache.org/jira/browse/SPARK-6659 Project: Spark Issue Type: Bug Components: SQL Reporter: luochenghui Dear friends: Spark SQL 1.3 cannot read json file that only with a record. here is my json file's content. {name:milo,age,24} when i run Spark SQL under the local mode,it throws an exception rg.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns _corrupt_record; what i had done: 1 ./spark-shell 2 scala val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@5f3be6c8 scala val df = sqlContext.jsonFile(/home/milo/person.json) 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=280248975 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 267.1 MB) 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(22692) called with curMem=163705, maxMem=280248975 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 267.1 MB) 15/03/19 22:11:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:35842 (size: 22.2 KB, free: 267.2 MB) 15/03/19 22:11:45 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/03/19 22:11:45 INFO SparkContext: Created broadcast 0 from textFile at JSONRelation.scala:98 15/03/19 22:11:47 INFO FileInputFormat: Total input paths to process : 1 15/03/19 22:11:47 INFO SparkContext: Starting job: reduce at JsonRDD.scala:51 15/03/19 22:11:47 INFO DAGScheduler: Got job 0 (reduce at JsonRDD.scala:51) with 1 output partitions (allowLocal=false) 15/03/19 22:11:47 INFO DAGScheduler: Final stage: Stage 0(reduce at JsonRDD.scala:51) 15/03/19 22:11:47 INFO DAGScheduler: Parents of final stage: List() 15/03/19 22:11:47 INFO DAGScheduler: Missing parents: List() 15/03/19 22:11:47 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[3] at map at JsonRDD.scala:51), which has no missing parents 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(3184) called with curMem=186397, maxMem=280248975 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 267.1 MB) 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(2251) called with curMem=189581, maxMem=280248975 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free 267.1 MB) 15/03/19 22:11:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:35842 (size: 2.2 KB, free: 267.2 MB) 15/03/19 22:11:47 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 15/03/19 22:11:47 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839 15/03/19 22:11:48 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MapPartitionsRDD[3] at map at JsonRDD.scala:51) 15/03/19 22:11:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 15/03/19 22:11:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1291 bytes) 15/03/19 22:11:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 15/03/19 22:11:48 INFO HadoopRDD: Input split: file:/home/milo/person.json:0+26 15/03/19 22:11:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/03/19 22:11:48 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/03/19 22:11:48 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/03/19 22:11:48 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/03/19 22:11:48 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/03/19 22:11:49 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2023 bytes result sent to driver 15/03/19 22:11:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1209 ms on localhost (1/1) 15/03/19 22:11:49 INFO DAGScheduler: Stage 0 (reduce at JsonRDD.scala:51) finished in 1.308 s 15/03/19 22:11:49 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/03/19 22:11:49 INFO DAGScheduler: Job 0 finished: reduce at JsonRDD.scala:51, took 2.002429 s df: org.apache.spark.sql.DataFrame = [_corrupt_record: string] 3 scala df.select(name).show() 15/03/19 22:12:44 INFO BlockManager
[jira] [Closed] (SPARK-6659) Spark SQL 1.3 cannot read json file that only with a record.
[ https://issues.apache.org/jira/browse/SPARK-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell closed SPARK-6659. -- Resolution: Invalid Per the comment, I think the issue is the JSON is not correctly formatted. Spark SQL 1.3 cannot read json file that only with a record. Key: SPARK-6659 URL: https://issues.apache.org/jira/browse/SPARK-6659 Project: Spark Issue Type: Bug Components: SQL Reporter: luochenghui Dear friends: Spark SQL 1.3 cannot read json file that only with a record. here is my json file's content. {name:milo,age,24} when i run Spark SQL under the local mode,it throws an exception rg.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns _corrupt_record; what i had done: 1 ./spark-shell 2 scala val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@5f3be6c8 scala val df = sqlContext.jsonFile(/home/milo/person.json) 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=280248975 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 267.1 MB) 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(22692) called with curMem=163705, maxMem=280248975 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 267.1 MB) 15/03/19 22:11:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:35842 (size: 22.2 KB, free: 267.2 MB) 15/03/19 22:11:45 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/03/19 22:11:45 INFO SparkContext: Created broadcast 0 from textFile at JSONRelation.scala:98 15/03/19 22:11:47 INFO FileInputFormat: Total input paths to process : 1 15/03/19 22:11:47 INFO SparkContext: Starting job: reduce at JsonRDD.scala:51 15/03/19 22:11:47 INFO DAGScheduler: Got job 0 (reduce at JsonRDD.scala:51) with 1 output partitions (allowLocal=false) 15/03/19 22:11:47 INFO DAGScheduler: Final stage: Stage 0(reduce at JsonRDD.scala:51) 15/03/19 22:11:47 INFO DAGScheduler: Parents of final stage: List() 15/03/19 22:11:47 INFO DAGScheduler: Missing parents: List() 15/03/19 22:11:47 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[3] at map at JsonRDD.scala:51), which has no missing parents 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(3184) called with curMem=186397, maxMem=280248975 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 267.1 MB) 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(2251) called with curMem=189581, maxMem=280248975 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free 267.1 MB) 15/03/19 22:11:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:35842 (size: 2.2 KB, free: 267.2 MB) 15/03/19 22:11:47 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 15/03/19 22:11:47 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839 15/03/19 22:11:48 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MapPartitionsRDD[3] at map at JsonRDD.scala:51) 15/03/19 22:11:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 15/03/19 22:11:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1291 bytes) 15/03/19 22:11:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 15/03/19 22:11:48 INFO HadoopRDD: Input split: file:/home/milo/person.json:0+26 15/03/19 22:11:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/03/19 22:11:48 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/03/19 22:11:48 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/03/19 22:11:48 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/03/19 22:11:48 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/03/19 22:11:49 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2023 bytes result sent to driver 15/03/19 22:11:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1209 ms on localhost (1/1) 15/03/19 22:11:49 INFO DAGScheduler: Stage 0 (reduce at JsonRDD.scala:51) finished in 1.308 s 15/03/19 22:11:49 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/03/19 22:11:49 INFO DAGScheduler: Job 0 finished: reduce at JsonRDD.scala:51, took 2.002429 s df: org.apache.spark.sql.DataFrame = [_corrupt_record: string] 3
Re: Unit test logs in Jenkins?
Hey Marcelo, Great question. Right now, some of the more active developers have an account that allows them to log into this cluster to inspect logs (we copy the logs from each run to a node on that cluster). The infrastructure is maintained by the AMPLab. I will put you in touch the someone there who can get you an account. This is a short term solution. The longer term solution is to have these scp'd regularly to an S3 bucket or somewhere people can get access to them, but that's not ready yet. - Patrick On Wed, Apr 1, 2015 at 1:01 PM, Marcelo Vanzin van...@cloudera.com wrote: Hey all, Is there a way to access unit test logs in jenkins builds? e.g., core/target/unit-tests.log That would be really helpful to debug build failures. The scalatest output isn't all that helpful. If that's currently not available, would it be possible to add those logs as build artifacts? -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jira] [Created] (SPARK-6627) Clean up of shuffle code and interfaces
Patrick Wendell created SPARK-6627: -- Summary: Clean up of shuffle code and interfaces Key: SPARK-6627 URL: https://issues.apache.org/jira/browse/SPARK-6627 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Critical The shuffle code in Spark is somewhat messy and could use some interface clean-up, especially with some larger changes outstanding. This is a catch all for what may be some small improvements in a few different PR's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6561) Add partition support in saveAsParquet
[ https://issues.apache.org/jira/browse/SPARK-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383413#comment-14383413 ] Patrick Wendell commented on SPARK-6561: FYI - I just removed Affects Version's since that is only for bugs (to indicate which version has the bug). Add partition support in saveAsParquet -- Key: SPARK-6561 URL: https://issues.apache.org/jira/browse/SPARK-6561 Project: Spark Issue Type: Improvement Components: SQL Reporter: Jianshi Huang Now ParquetRelation2 supports automatic partition discovery which is very nice. When we save a DataFrame into Parquet files, we also want to have it partitioned. The proposed API looks like this: {code} def saveAsParquetFile(path: String, partitionColumns: Seq[String]) {code} Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6561) Add partition support in saveAsParquet
[ https://issues.apache.org/jira/browse/SPARK-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6561: --- Affects Version/s: (was: 1.3.1) (was: 1.3.0) Add partition support in saveAsParquet -- Key: SPARK-6561 URL: https://issues.apache.org/jira/browse/SPARK-6561 Project: Spark Issue Type: Improvement Components: SQL Reporter: Jianshi Huang Now ParquetRelation2 supports automatic partition discovery which is very nice. When we save a DataFrame into Parquet files, we also want to have it partitioned. The proposed API looks like this: {code} def saveAsParquetFile(path: String, partitionColumns: Seq[String]) {code} Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: RDD resiliency -- does it keep state?
If you invoke this, you will get at-least-once semantics on failure. For instance, if a machine dies in the middle of executing the foreach for a single partition, that will be re-executed on another machine. It could even fully complete on one machine, but the machine dies immediately before reporting the result back to the driver. This means you need to make sure the side-effects are idempotent, or use some transactional locking. Spark's own output operations, such as saving to Hadoop, use such mechanisms. For instance, in the case of Hadoop it uses the OutputCommitter classes. - Patrick On Fri, Mar 27, 2015 at 12:36 PM, Michal Klos michal.klo...@gmail.com wrote: Hi Spark group, We haven't been able to find clear descriptions of how Spark handles the resiliency of RDDs in relationship to executing actions with side-effects. If you do an `rdd.foreach(someSideEffect)`, then you are doing a side-effect for each element in the RDD. If a partition goes down -- the resiliency rebuilds the data, but did it keep track of how far it go in the partition's set of data or will it start from the beginning again. So will it do at-least-once execution of foreach closures or at-most-once? thanks, Michal - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[jira] [Updated] (SPARK-6544) Problem with Avro and Kryo Serialization
[ https://issues.apache.org/jira/browse/SPARK-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6544: --- Fix Version/s: 1.3.1 Problem with Avro and Kryo Serialization Key: SPARK-6544 URL: https://issues.apache.org/jira/browse/SPARK-6544 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0, 1.3.0 Reporter: Dean Chen Fix For: 1.3.1, 1.4.0 We're running in to the following bug with Avro 1.7.6 and the Kryo serializer causing jobs to fail https://issues.apache.org/jira/browse/AVRO-1476?focusedCommentId=13999249page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13999249 PR here https://github.com/apache/spark/pull/5193 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Spark 1.3 Source - Github and source tar does not seem to match
The source code should match the Spark commit 4aaf48d46d13129f0f9bdafd771dd80fe568a7dc. Do you see any differences? On Fri, Mar 27, 2015 at 11:28 AM, Manoj Samel manojsamelt...@gmail.com wrote: While looking into a issue, I noticed that the source displayed on Github site does not matches the downloaded tar for 1.3 Thoughts ? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[jira] [Resolved] (SPARK-4073) Parquet+Snappy can cause significant off-heap memory usage
[ https://issues.apache.org/jira/browse/SPARK-4073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4073. Resolution: Won't Fix I have never seen someone else run into this, so closing as not urgent enough to deal with at the moment. One way to fix this is to fix PARQUET-118 so that users can use on-heap buffers with parquet. Parquet+Snappy can cause significant off-heap memory usage -- Key: SPARK-4073 URL: https://issues.apache.org/jira/browse/SPARK-4073 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Patrick Wendell Priority: Critical The parquet snappy codec allocates off-heap buffers for decompression[1]. In one cases the observed size of these buffers was high enough to add several GB of data to the overall virtual memory usage of the Spark executor process. I don't understand enough about our use of Snappy to fully grok how much data we would _expect_ to be present in these buffers at any given time, but I can say a few things. 1. The dataset had individual rows that were fairly large, e.g. megabytes. 2. Direct buffers are not cleaned up until GC events, and overall there was not much heap contention. So maybe they just weren't being cleaned. I opened PARQUET-118 to see if they can provide an option to use on-heap buffers for decompression. In the mean time, we could consider changing the default back to gzip, or we could do nothing (not sure how many other users will hit this). [1] https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/codec/SnappyDecompressor.java#L28 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5025) Write a guide for creating well-formed packages for Spark
[ https://issues.apache.org/jira/browse/SPARK-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5025. Resolution: Won't Fix I'm closing this as wont fix. There are now a bunch of community packages as examples, so I think people can just follow those examples. Write a guide for creating well-formed packages for Spark - Key: SPARK-5025 URL: https://issues.apache.org/jira/browse/SPARK-5025 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Patrick Wendell Assignee: Patrick Wendell There are an increasing number of OSS projects providing utilities and extensions to Spark. We should write a guide in the Spark docs that explains how to create, package, and publish a third party Spark library. There are a few issues here such as how to list your dependency on Spark, how to deal with your own third party dependencies, etc. We should also cover how to do this for Python libraries. In general, we should make it easy to build extension points against any of Spark's API's (e.g. for new data sources, streaming receivers, ML algos, etc) and self-publish libraries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1844) Support maven-style dependency resolution in sbt build
[ https://issues.apache.org/jira/browse/SPARK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1844. Resolution: Won't Fix Closing given the combination of (a) this is not that important and (b) seems really hard to fix. I have seen times where this discrepancy caused developers some trouble - I guess we'll just say it's part of living with the SBT build. Support maven-style dependency resolution in sbt build -- Key: SPARK-1844 URL: https://issues.apache.org/jira/browse/SPARK-1844 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Assignee: Prashant Sharma [Currently this is a brainstorm/wish - not sure it's possible] Ivy/sbt and maven use fundamentally different strategies when transitive dependencies conflict (i.e. when we have two copies of library Y in our dependency graph on different versions). This actually means our sbt and maven builds have been divergent for a long time. Ivy/sbt have a pluggable notion of a [conflict manager|http://grepcode.com/file/repo1.maven.org/maven2/org.apache.ivy/ivy/2.3.0/org/apache/ivy/plugins/conflict/ConflictManager.java]. The default chooses the newest version of the dependency. SBT [allows this to be changed|http://www.scala-sbt.org/release/sxr/sbt/IvyInterface.scala.html#sbt;ConflictManager] though. Maven employs the [nearest wins|http://techidiocy.com/maven-dependency-version-conflict-problem-and-resolution/] policy which means the version closes to the project root is chosen. It would be nice to be able to have matching semantics in the builds. We could do this by writing a conflict manger in sbt that mimics Maven's behavior. The fact that IVY-813 has existed for 6 years without anyone doing this makes me wonder if that is not possible or very hard :P -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2709) Add a tool for certifying Spark API compatiblity
[ https://issues.apache.org/jira/browse/SPARK-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2709: --- Target Version/s: (was: 1.2.0) Add a tool for certifying Spark API compatiblity Key: SPARK-2709 URL: https://issues.apache.org/jira/browse/SPARK-2709 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Patrick Wendell Assignee: Prashant Sharma As Spark is packaged by more and more distributors, it would be good to have a tool that verifies API compatiblity of a provided Spark package. The tool would certify that a vendor distrubtion of Spark contains all of the API's present in a particular upstream Spark version. This will help vendors make sure they remain API compliant when they make changes or back ports to Spark. It will also discourage vendors from knowingly breaking API's, because anyone can audit their distribution and see that they have removed support for certain API's. I'm hoping a tool like this will avoid API fragmentation in the Spark community. One poor man's implementation of this is that a vendor can just run the binary compatibility checks in the spark build against an upstream version of Spark. That's a pretty good start, but it means you can't come as a third party and audit a distribution. Another approach would be to have something where anyone can come in and audit a distribution even if they don't have access to the packaging and source code. That would look something like this: 1. For each release we publish a manifest of all public API's (we might borrow the MIMA string representation of bye code signatures) 2. We package an auditing tool as a jar file. 3. The user runs a tool with spark-submit that reflectively walks through all exposed Spark API's and makes sure that everything on the manifest is encountered. From the implementation side, this is just brainstorming at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2709) Add a tool for certifying Spark API compatiblity
[ https://issues.apache.org/jira/browse/SPARK-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2709: --- Priority: Critical (was: Major) Add a tool for certifying Spark API compatiblity Key: SPARK-2709 URL: https://issues.apache.org/jira/browse/SPARK-2709 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Patrick Wendell Assignee: Prashant Sharma Priority: Critical As Spark is packaged by more and more distributors, it would be good to have a tool that verifies API compatiblity of a provided Spark package. The tool would certify that a vendor distrubtion of Spark contains all of the API's present in a particular upstream Spark version. This will help vendors make sure they remain API compliant when they make changes or back ports to Spark. It will also discourage vendors from knowingly breaking API's, because anyone can audit their distribution and see that they have removed support for certain API's. I'm hoping a tool like this will avoid API fragmentation in the Spark community. One poor man's implementation of this is that a vendor can just run the binary compatibility checks in the spark build against an upstream version of Spark. That's a pretty good start, but it means you can't come as a third party and audit a distribution. Another approach would be to have something where anyone can come in and audit a distribution even if they don't have access to the packaging and source code. That would look something like this: 1. For each release we publish a manifest of all public API's (we might borrow the MIMA string representation of bye code signatures) 2. We package an auditing tool as a jar file. 3. The user runs a tool with spark-submit that reflectively walks through all exposed Spark API's and makes sure that everything on the manifest is encountered. From the implementation side, this is just brainstorming at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-2709) Add a tool for certifying Spark API compatiblity
[ https://issues.apache.org/jira/browse/SPARK-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reopened SPARK-2709: This came up in some recent conversations. I actually don't think we ever merged this into Spark, so re-opening the issue. Add a tool for certifying Spark API compatiblity Key: SPARK-2709 URL: https://issues.apache.org/jira/browse/SPARK-2709 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Patrick Wendell Assignee: Prashant Sharma As Spark is packaged by more and more distributors, it would be good to have a tool that verifies API compatiblity of a provided Spark package. The tool would certify that a vendor distrubtion of Spark contains all of the API's present in a particular upstream Spark version. This will help vendors make sure they remain API compliant when they make changes or back ports to Spark. It will also discourage vendors from knowingly breaking API's, because anyone can audit their distribution and see that they have removed support for certain API's. I'm hoping a tool like this will avoid API fragmentation in the Spark community. One poor man's implementation of this is that a vendor can just run the binary compatibility checks in the spark build against an upstream version of Spark. That's a pretty good start, but it means you can't come as a third party and audit a distribution. Another approach would be to have something where anyone can come in and audit a distribution even if they don't have access to the packaging and source code. That would look something like this: 1. For each release we publish a manifest of all public API's (we might borrow the MIMA string representation of bye code signatures) 2. We package an auditing tool as a jar file. 3. The user runs a tool with spark-submit that reflectively walks through all exposed Spark API's and makes sure that everything on the manifest is encountered. From the implementation side, this is just brainstorming at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6405) Spark Kryo buffer should be forced to be max. 2GB
[ https://issues.apache.org/jira/browse/SPARK-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-6405. Resolution: Fixed Assignee: Matthew Cheah Spark Kryo buffer should be forced to be max. 2GB - Key: SPARK-6405 URL: https://issues.apache.org/jira/browse/SPARK-6405 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Matt Cheah Assignee: Matthew Cheah Fix For: 1.4.0 Kryo buffers used in serialization are backed by Java byte arrays, which have a maximum size of 2GB. However, we blindly set the size without worrying about numeric overflow or regards to the maximum array size. We should enforce the maximum buffer size to be 2GB and warn the user when they have exceeded that amount. I'm open to the idea of flat-out failing the initialization of the Spark Context if the buffer size is over 2GB, but I'm afraid that could break backwards-compatability... although one can argue that the user had incorrect buffer sizes in the first place. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6549) Spark console logger logs to stderr by default
[ https://issues.apache.org/jira/browse/SPARK-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-6549. Resolution: Won't Fix I think this is a wont fix due to compatibility issues. If I'm wrong please feel free to re-open. Spark console logger logs to stderr by default -- Key: SPARK-6549 URL: https://issues.apache.org/jira/browse/SPARK-6549 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.2.0 Reporter: Pavel Sakun Priority: Trivial Labels: log4j Spark's console logger is configured to log message with INFO level to stderr by default while it should be stdout: https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/log4j-defaults.properties https://github.com/apache/spark/blob/master/conf/log4j.properties.template -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: RDD.map does not allowed to preservesPartitioning?
I think we have a version of mapPartitions that allows you to tell Spark the partitioning is preserved: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L639 We could also add a map function that does same. Or you can just write your map using an iterator. - Patrick On Thu, Mar 26, 2015 at 3:07 PM, Jonathan Coveney jcove...@gmail.com wrote: This is just a deficiency of the api, imo. I agree: mapValues could definitely be a function (K, V)=V1. The option isn't set by the function, it's on the RDD. So you could look at the code and do this. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala def mapValues[U](f: V = U): RDD[(K, U)] = { val cleanF = self.context.clean(f) new MapPartitionsRDD[(K, U), (K, V)](self, (context, pid, iter) = iter.map { case (k, v) = (k, cleanF(v)) }, preservesPartitioning = true) } What you want: def mapValues[U](f: (K, V) = U): RDD[(K, U)] = { val cleanF = self.context.clean(f) new MapPartitionsRDD[(K, U), (K, V)](self, (context, pid, iter) = iter.map { case t@(k, _) = (k, cleanF(t)) }, preservesPartitioning = true) } One of the nice things about spark is that making such new operators is very easy :) 2015-03-26 17:54 GMT-04:00 Zhan Zhang zzh...@hortonworks.com: Thanks Jonathan. You are right regarding rewrite the example. I mean providing such option to developer so that it is controllable. The example may seems silly, and I don't know the use cases. But for example, if I also want to operate both the key and value part to generate some new value with keeping key part untouched. Then mapValues may not be able to do this. Changing the code to allow this is trivial, but I don't know whether there is some special reason behind this. Thanks. Zhan Zhang On Mar 26, 2015, at 2:49 PM, Jonathan Coveney jcove...@gmail.com wrote: I believe if you do the following: sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString (8) MapPartitionsRDD[34] at reduceByKey at console:23 [] | MapPartitionsRDD[33] at mapValues at console:23 [] | ShuffledRDD[32] at reduceByKey at console:23 [] +-(8) MapPartitionsRDD[31] at map at console:23 [] | ParallelCollectionRDD[30] at parallelize at console:23 [] The difference is that spark has no way to know that your map closure doesn't change the key. if you only use mapValues, it does. Pretty cool that they optimized that :) 2015-03-26 17:44 GMT-04:00 Zhan Zhang zzh...@hortonworks.com: Hi Folks, Does anybody know what is the reason not allowing preserverPartitioning in RDD.map? Do I miss something here? Following example involves two shuffles. I think if preservePartitioning is allowed, we can avoid the second one, right? val r1 = sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)) val r2 = r1.map((_, 1)) val r3 = r2.reduceByKey(_+_) val r4 = r3.map(x=(x._1, x._2 + 1)) val r5 = r4.reduceByKey(_+_) r5.collect.foreach(println) scala r5.toDebugString res2: String = (8) ShuffledRDD[4] at reduceByKey at console:29 [] +-(8) MapPartitionsRDD[3] at map at console:27 [] | ShuffledRDD[2] at reduceByKey at console:25 [] +-(8) MapPartitionsRDD[1] at map at console:23 [] | ParallelCollectionRDD[0] at parallelize at console:21 [] Thanks. Zhan Zhang - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[jira] [Updated] (SPARK-6499) pyspark: printSchema command on a dataframe hangs
[ https://issues.apache.org/jira/browse/SPARK-6499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6499: --- Component/s: PySpark pyspark: printSchema command on a dataframe hangs - Key: SPARK-6499 URL: https://issues.apache.org/jira/browse/SPARK-6499 Project: Spark Issue Type: Bug Components: PySpark Reporter: cynepia Attachments: airports.json, pyspark.txt 1. A printSchema() on a dataframe fails to respond even after a lot of time Will attach the console logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6520) Kyro serialization broken in the shell
[ https://issues.apache.org/jira/browse/SPARK-6520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6520: --- Component/s: Spark Shell Kyro serialization broken in the shell -- Key: SPARK-6520 URL: https://issues.apache.org/jira/browse/SPARK-6520 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.3.0 Reporter: Aaron Defazio If I start spark as follows: {quote} ~/spark-1.3.0-bin-hadoop2.4/bin/spark-shell --master local[1] --conf spark.serializer=org.apache.spark.serializer.KryoSerializer {quote} Then using :paste, run {quote} case class Example(foo : String, bar : String) val ex = sc.parallelize(List(Example(foo1, bar1), Example(foo2, bar2))).collect() {quote} I get the error: {quote} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.io.IOException: com.esotericsoftware.kryo.KryoException: Error constructing instance of class: $line3.$read Serialization trace: $VAL10 ($iwC) $outer ($iwC$$iwC) $outer ($iwC$$iwC$Example) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1140) at org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:979) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1873) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) {quote} As far as I can tell, when using :paste, Kyro serialization doesn't work for classes defined in within the same paste. It does work when the statements are entered without paste. This issue seems serious to me, since Kyro serialization is virtually mandatory for performance (20x slower with default serialization on my problem), and I'm assuming feature parity between spark-shell and spark-submit is a goal. Note that this is different from SPARK-6497, which covers the case when Kyro is set to require registration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue
[ https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380432#comment-14380432 ] Patrick Wendell commented on SPARK-6481: Hey All, One issue here, (I think?) right now unfortunately no users have sufficient permission to make the state change into In Progress because of the way that the JIRA is currently set up. Currently we don't expose the Start Progress button on any screen, so I think that makes it unavailable from the API call. At least, I just used my own credentials and I was not able to see the Start Progress transition on a JIRA, even though AFAIK I have the highest permissions possible. The reason we do this I think was that we wanted to restrict assignment of JIRA's to the committership for now and the Start Progress button automatically assigns issues to a new person clicking it. In my ideal world it works such that typical users cannot modify this state transition and it is only possible to put it in progress via a github pull request. If there is such a permission scheme that allows that, then we should see about asking ASF to enable it for our JIRA. In terms of assignment, I'd say for now just leave the assignment as it was before. Set In Progress when a PR is opened for an issue -- Key: SPARK-6481 URL: https://issues.apache.org/jira/browse/SPARK-6481 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Michael Armbrust Assignee: Nicholas Chammas [~pwendell] and I are not sure if this is possible, but it would be really helpful if the JIRA status was updated to In Progress when we do the linking to an open pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: hadoop input/output format advanced control
Yeah I agree that might have been nicer, but I think for consistency with the input API's maybe we should do the same thing. We can also give an example of how to clone sc.hadoopConfiguration and then set some new values: val conf = sc.hadoopConfiguration.clone() .set(k1, v1) .set(k2, v2) val rdd = sc.objectFile(..., conf) I have no idea if that's the correct syntax, but something like that seems almost as easy as passing a hashmap with deltas. - Patrick On Wed, Mar 25, 2015 at 6:34 AM, Koert Kuipers ko...@tresata.com wrote: my personal preference would be something like a Map[String, String] that only reflects the changes you want to make the Configuration for the given input/output format (so system wide defaults continue to come from sc.hadoopConfiguration), similarly to what cascading/scalding did, but am arbitrary Configuration will work too. i will make a jira and pullreq when i have some time. On Wed, Mar 25, 2015 at 1:23 AM, Patrick Wendell pwend...@gmail.com wrote: I see - if you look, in the saving functions we have the option for the user to pass an arbitrary Configuration. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L894 It seems fine to have the same option for the loading functions, if it's easy to just pass this config into the input format. On Tue, Mar 24, 2015 at 3:46 PM, Koert Kuipers ko...@tresata.com wrote: the (compression) codec parameter that is now part of many saveAs... methods came from a similar need. see SPARK-763 hadoop has many options like this. you either going to have to allow many more of these optional arguments to all the methods that read from hadoop inputformats and write to hadoop outputformats, or you force people to re-create these methods using HadoopRDD, i think (if thats even possible). On Tue, Mar 24, 2015 at 6:40 PM, Koert Kuipers ko...@tresata.com wrote: i would like to use objectFile with some tweaks to the hadoop conf. currently there is no way to do that, except recreating objectFile myself. and some of the code objectFile uses i have no access to, since its private to spark. On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell pwend...@gmail.com wrote: Yeah - to Nick's point, I think the way to do this is to pass in a custom conf when you create a Hadoop RDD (that's AFAIK why the conf field is there). Is there anything you can't do with that feature? On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Imran, on your point to read multiple files together in a partition, is it not simpler to use the approach of copy Hadoop conf and set per-RDD settings for min split to control the input size per partition, together with something like CombineFileInputFormat? On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid iras...@cloudera.com wrote: I think this would be a great addition, I totally agree that you need to be able to set these at a finer context than just the SparkContext. Just to play devil's advocate, though -- the alternative is for you just subclass HadoopRDD yourself, or make a totally new RDD, and then you could expose whatever you need. Why is this solution better? IMO the criteria are: (a) common operations (b) error-prone / difficult to implement (c) non-obvious, but important for performance I think this case fits (a) (c), so I think its still worthwhile. But its also worth asking whether or not its too difficult for a user to extend HadoopRDD right now. There have been several cases in the past week where we've suggested that a user should read from hdfs themselves (eg., to read multiple files together in one partition) -- with*out* reusing the code in HadoopRDD, though they would lose things like the metric tracking preferred locations you get from HadoopRDD. Does HadoopRDD need to some refactoring to make that easier to do? Or do we just need a good example? Imran (sorry for hijacking your thread, Koert) On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com wrote: see email below. reynold suggested i send it to dev instead of user -- Forwarded message -- From: Koert Kuipers ko...@tresata.com Date: Mon, Mar 23, 2015 at 4:36 PM Subject: hadoop input/output format advanced control To: u...@spark.apache.org u...@spark.apache.org currently its pretty hard to control the Hadoop Input/Output formats used in Spark. The conventions seems to be to add extra parameters to all methods and then somewhere deep inside the code (for example in PairRDDFunctions.saveAsHadoopFile) all these parameters get translated into settings on the Hadoop Configuration object. for example for compression i see codec
Re: hadoop input/output format advanced control
Great - that's even easier. Maybe we could have a simple example in the doc. On Wed, Mar 25, 2015 at 7:06 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Regarding Patrick's question, you can just do new Configuration(oldConf) to get a cloned Configuration object and add any new properties to it. -Sandy On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid iras...@cloudera.com wrote: Hi Nick, I don't remember the exact details of these scenarios, but I think the user wanted a lot more control over how the files got grouped into partitions, to group the files together by some arbitrary function. I didn't think that was possible w/ CombineFileInputFormat, but maybe there is a way? thanks On Tue, Mar 24, 2015 at 1:50 PM, Nick Pentreath nick.pentre...@gmail.com wrote: Imran, on your point to read multiple files together in a partition, is it not simpler to use the approach of copy Hadoop conf and set per-RDD settings for min split to control the input size per partition, together with something like CombineFileInputFormat? On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid iras...@cloudera.com wrote: I think this would be a great addition, I totally agree that you need to be able to set these at a finer context than just the SparkContext. Just to play devil's advocate, though -- the alternative is for you just subclass HadoopRDD yourself, or make a totally new RDD, and then you could expose whatever you need. Why is this solution better? IMO the criteria are: (a) common operations (b) error-prone / difficult to implement (c) non-obvious, but important for performance I think this case fits (a) (c), so I think its still worthwhile. But its also worth asking whether or not its too difficult for a user to extend HadoopRDD right now. There have been several cases in the past week where we've suggested that a user should read from hdfs themselves (eg., to read multiple files together in one partition) -- with*out* reusing the code in HadoopRDD, though they would lose things like the metric tracking preferred locations you get from HadoopRDD. Does HadoopRDD need to some refactoring to make that easier to do? Or do we just need a good example? Imran (sorry for hijacking your thread, Koert) On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com wrote: see email below. reynold suggested i send it to dev instead of user -- Forwarded message -- From: Koert Kuipers ko...@tresata.com Date: Mon, Mar 23, 2015 at 4:36 PM Subject: hadoop input/output format advanced control To: u...@spark.apache.org u...@spark.apache.org currently its pretty hard to control the Hadoop Input/Output formats used in Spark. The conventions seems to be to add extra parameters to all methods and then somewhere deep inside the code (for example in PairRDDFunctions.saveAsHadoopFile) all these parameters get translated into settings on the Hadoop Configuration object. for example for compression i see codec: Option[Class[_ : CompressionCodec]] = None added to a bunch of methods. how scalable is this solution really? for example i need to read from a hadoop dataset and i dont want the input (part) files to get split up. the way to do this is to set mapred.min.split.size. now i dont want to set this at the level of the SparkContext (which can be done), since i dont want it to apply to input formats in general. i want it to apply to just this one specific input dataset i need to read. which leaves me with no options currently. i could go add yet another input parameter to all the methods (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile, etc.). but that seems ineffective. why can we not expose a Map[String, String] or some other generic way to manipulate settings for hadoop input/output formats? it would require adding one more parameter to all methods to deal with hadoop input/output formats, but after that its done. one parameter to rule them all then i could do: val x = sc.textFile(/some/path, formatSettings = Map(mapred.min.split.size - 12345)) or rdd.saveAsTextFile(/some/path, formatSettings = Map(mapred.output.compress - true, mapred.output.compression.codec - somecodec)) - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: 1.3 Hadoop File System problem
Hey Jim, Thanks for reporting this. Can you give a small end-to-end code example that reproduces it? If so, we can definitely fix it. - Patrick On Tue, Mar 24, 2015 at 4:55 PM, Jim Carroll jimfcarr...@gmail.com wrote: I have code that works under 1.2.1 but when I upgraded to 1.3.0 it fails to find the s3 hadoop file system. I get the java.lang.IllegalArgumentException: Wrong FS: s3://path to my file], expected: file:/// when I try to save a parquet file. This worked in 1.2.1. Has anyone else seen this? I'm running spark using local[8] so it's all internal. These are actually unit tests in our app that are failing now. Thanks. Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-3-Hadoop-File-System-problem-tp22207.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Any guidance on when to back port and how far?
My philosophy has been basically what you suggested, Sean. One thing you didn't mention though is if a bug fix seems complicated, I will think very hard before back-porting it. This is because fixes can introduce their own new bugs, in some cases worse than the original issue. It's really bad to have some upgrade to a patch release and see a regression - with our current approach this almost never happens. I will usually try to backport up to N-2, if it can be back-ported reasonably easily (for instance, with minor or no code changes). The reason I do this is that vendors do end up supporting older versions, and it's nice for them if some committer has backported a fix that they can then pull in, even if we never ship it. In terms of doing older maintenance releases, this one I think we should do according to severity of issues (for instance, if there is a security issue) or based on general command from the community. I haven't initiated many 1.X.2 releases recently because I didn't see huge demand. However, personally I don't mind doing these if there is a lot of demand, at least for releases where .0 has gone out in the last six months. On Tue, Mar 24, 2015 at 11:23 AM, Michael Armbrust mich...@databricks.com wrote: Two other criteria that I use when deciding what to backport: - Is it a regression from a previous minor release? I'm much more likely to backport fixes in this case, as I'd love for most people to stay up to date. - How scary is the change? I think the primary goal is stability of the maintenance branches. When I am confident that something is isolated and unlikely to break things (i.e. I'm fixing a confusing error message), then i'm much more likely to backport it. Regarding the length of time to continue backporting, I mostly don't backport to N-1, but this is partially because SQL is changing too fast for that to generally be useful. These old branches usually only get attention from me when there is an explicit request. I'd love to hear more feedback from others. Michael On Tue, Mar 24, 2015 at 6:13 AM, Sean Owen so...@cloudera.com wrote: So far, my rule of thumb has been: - Don't back-port new features or improvements in general, only bug fixes - Don't back-port minor bug fixes - Back-port bug fixes that seem important enough to not wait for the next minor release - Back-port site doc changes to the release most likely to go out next, to make it a part of the next site publish But, how far should back-ports go, in general? If the last minor release was 1.N, then to branch 1.N surely. Farther back is a question of expectation for support of past minor releases. Given the pace of change and time available, I assume there's not much support for continuing to use release 1.(N-1) and very little for 1.(N-2). Concretely: does anyone expect a 1.1.2 release ever? a 1.2.2 release? It'd be good to hear the received wisdom explicitly. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: hadoop input/output format advanced control
Yeah - to Nick's point, I think the way to do this is to pass in a custom conf when you create a Hadoop RDD (that's AFAIK why the conf field is there). Is there anything you can't do with that feature? On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Imran, on your point to read multiple files together in a partition, is it not simpler to use the approach of copy Hadoop conf and set per-RDD settings for min split to control the input size per partition, together with something like CombineFileInputFormat? On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid iras...@cloudera.com wrote: I think this would be a great addition, I totally agree that you need to be able to set these at a finer context than just the SparkContext. Just to play devil's advocate, though -- the alternative is for you just subclass HadoopRDD yourself, or make a totally new RDD, and then you could expose whatever you need. Why is this solution better? IMO the criteria are: (a) common operations (b) error-prone / difficult to implement (c) non-obvious, but important for performance I think this case fits (a) (c), so I think its still worthwhile. But its also worth asking whether or not its too difficult for a user to extend HadoopRDD right now. There have been several cases in the past week where we've suggested that a user should read from hdfs themselves (eg., to read multiple files together in one partition) -- with*out* reusing the code in HadoopRDD, though they would lose things like the metric tracking preferred locations you get from HadoopRDD. Does HadoopRDD need to some refactoring to make that easier to do? Or do we just need a good example? Imran (sorry for hijacking your thread, Koert) On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com wrote: see email below. reynold suggested i send it to dev instead of user -- Forwarded message -- From: Koert Kuipers ko...@tresata.com Date: Mon, Mar 23, 2015 at 4:36 PM Subject: hadoop input/output format advanced control To: u...@spark.apache.org u...@spark.apache.org currently its pretty hard to control the Hadoop Input/Output formats used in Spark. The conventions seems to be to add extra parameters to all methods and then somewhere deep inside the code (for example in PairRDDFunctions.saveAsHadoopFile) all these parameters get translated into settings on the Hadoop Configuration object. for example for compression i see codec: Option[Class[_ : CompressionCodec]] = None added to a bunch of methods. how scalable is this solution really? for example i need to read from a hadoop dataset and i dont want the input (part) files to get split up. the way to do this is to set mapred.min.split.size. now i dont want to set this at the level of the SparkContext (which can be done), since i dont want it to apply to input formats in general. i want it to apply to just this one specific input dataset i need to read. which leaves me with no options currently. i could go add yet another input parameter to all the methods (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile, etc.). but that seems ineffective. why can we not expose a Map[String, String] or some other generic way to manipulate settings for hadoop input/output formats? it would require adding one more parameter to all methods to deal with hadoop input/output formats, but after that its done. one parameter to rule them all then i could do: val x = sc.textFile(/some/path, formatSettings = Map(mapred.min.split.size - 12345)) or rdd.saveAsTextFile(/some/path, formatSettings = Map(mapred.output.compress - true, mapred.output.compression.codec - somecodec)) - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Experience using binary packages on various Hadoop distros
Hey All, For a while we've published binary packages with different Hadoop client's pre-bundled. We currently have three interfaces to a Hadoop cluster (a) the HDFS client (b) the YARN client (c) the Hive client. Because (a) and (b) are supposed to be backwards compatible interfaces. My working assumption was that for the most part (modulo Hive) our packages work with *newer* Hadoop versions. For instance, our Hadoop 2.4 package should work with HDFS 2.6 and YARN 2.6. However, I have heard murmurings that these are not compatible in practice. So I have three questions I'd like to put out to the community: 1. Have people had difficulty using 2.4 packages with newer Hadoop versions? If so, what specific incompatibilities have you hit? 2. Have people had issues using our binary Hadoop packages in general with commercial or Apache Hadoop distro's, such that you have to build from source? 3. How would people feel about publishing a bring your own Hadoop binary, where you are required to point us to a local Hadoop distribution by setting HADOOP_HOME? This might be better for ensuring full compatibility: https://issues.apache.org/jira/browse/SPARK-6511 - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jira] [Commented] (SPARK-2331) SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]
[ https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376495#comment-14376495 ] Patrick Wendell commented on SPARK-2331: By the way - [~rxin] recently pointed out to me that EmptyRDD is private[spark]. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/EmptyRDD.scala#L27 Given that I'm sort of confused how people were using it before. I'm not totally sure how making a class private[spark] affects its use in a return type. SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T] -- Key: SPARK-2331 URL: https://issues.apache.org/jira/browse/SPARK-2331 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Ian Hummel Priority: Minor The return type for SparkContext.emptyRDD is EmptyRDD[T]. It should be RDD[T]. That means you have to add extra type annotations on code like the below (which creates a union of RDDs over some subset of paths in a folder) {code} val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { (rdd, path) ⇒ rdd.union(sc.textFile(path)) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-6122) Upgrade Tachyon dependency to 0.6.0
[ https://issues.apache.org/jira/browse/SPARK-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reopened SPARK-6122: I reverted this because it looks like it was responsible for some testing failures due to the dependency changes. Upgrade Tachyon dependency to 0.6.0 --- Key: SPARK-6122 URL: https://issues.apache.org/jira/browse/SPARK-6122 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.1 Reporter: Haoyuan Li Assignee: Calvin Jia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: enum-like types in Spark
If the official solution from the Scala community is to use Java enums, then it seems strange they aren't generated in scaldoc? Maybe we can just fix that w/ Typesafe's help and then we can use them. On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen so...@cloudera.com wrote: Yeah the fully realized #4, which gets back the ability to use it in switch statements (? in Scala but not Java?) does end up being kind of huge. I confess I'm swayed a bit back to Java enums, seeing what it involves. The hashCode() issue can be 'solved' with the hash of the String representation. On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid iras...@cloudera.com wrote: I've just switched some of my code over to the new format, and I just want to make sure everyone realizes what we are getting into. I went from 10 lines as java enums https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20 to 30 lines with the new format: https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250 its not just that its verbose. each name has to be repeated 4 times, with potential typos in some locations that won't be caught by the compiler. Also, you have to manually maintain the values as you update the set of enums, the compiler won't do it for you. The only downside I've heard for java enums is enum.hashcode(). OTOH, the downsides for this version are: maintainability / verbosity, no values(), more cumbersome to use from java, no enum map / enumset. I did put together a little util to at least get back the equivalent of enum.valueOf() with this format https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala I'm not trying to prevent us from moving forward on this, its fine if this is still what everyone wants, but I feel pretty strongly java enums make more sense. thanks, Imran - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: DataFrame operation on parquet: GC overhead limit exceeded
Hey Yiannis, If you just perform a count on each name, date pair... can it succeed? If so, can you do a count and then order by to find the largest one? I'm wondering if there is a single pathologically large group here that is somehow causing OOM. Also, to be clear, you are getting GC limit warnings on the executors, not the driver. Correct? - Patrick On Mon, Mar 23, 2015 at 10:21 AM, Martin Goodson mar...@skimlinks.com wrote: Have you tried to repartition() your original data to make more partitions before you aggregate? -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Mon, Mar 23, 2015 at 4:12 PM, Yiannis Gkoufas johngou...@gmail.com wrote: Hi Yin, Yes, I have set spark.executor.memory to 8g and the worker memory to 16g without any success. I cannot figure out how to increase the number of mapPartitions tasks. Thanks a lot On 20 March 2015 at 18:44, Yin Huai yh...@databricks.com wrote: spark.sql.shuffle.partitions only control the number of tasks in the second stage (the number of reducers). For your case, I'd say that the number of tasks in the first state (number of mappers) will be the number of files you have. Actually, have you changed spark.executor.memory (it controls the memory for an executor of your application)? I did not see it in your original email. The difference between worker memory and executor memory can be found at ( http://spark.apache.org/docs/1.3.0/spark-standalone.html), SPARK_WORKER_MEMORY Total amount of memory to allow Spark applications to use on the machine, e.g. 1000m, 2g (default: total memory minus 1 GB); note that each application's individual memory is configured using its spark.executor.memory property. On Fri, Mar 20, 2015 at 9:25 AM, Yiannis Gkoufas johngou...@gmail.com wrote: Actually I realized that the correct way is: sqlContext.sql(set spark.sql.shuffle.partitions=1000) but I am still experiencing the same behavior/error. On 20 March 2015 at 16:04, Yiannis Gkoufas johngou...@gmail.com wrote: Hi Yin, the way I set the configuration is: val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext.setConf(spark.sql.shuffle.partitions,1000); it is the correct way right? In the mapPartitions task (the first task which is launched), I get again the same number of tasks and again the same error. :( Thanks a lot! On 19 March 2015 at 17:40, Yiannis Gkoufas johngou...@gmail.com wrote: Hi Yin, thanks a lot for that! Will give it a shot and let you know. On 19 March 2015 at 16:30, Yin Huai yh...@databricks.com wrote: Was the OOM thrown during the execution of first stage (map) or the second stage (reduce)? If it was the second stage, can you increase the value of spark.sql.shuffle.partitions and see if the OOM disappears? This setting controls the number of reduces Spark SQL will use and the default is 200. Maybe there are too many distinct values and the memory pressure on every task (of those 200 reducers) is pretty high. You can start with 400 and increase it until the OOM disappears. Hopefully this will help. Thanks, Yin On Wed, Mar 18, 2015 at 4:46 PM, Yiannis Gkoufas johngou...@gmail.com wrote: Hi Yin, Thanks for your feedback. I have 1700 parquet files, sized 100MB each. The number of tasks launched is equal to the number of parquet files. Do you have any idea on how to deal with this situation? Thanks a lot On 18 Mar 2015 17:35, Yin Huai yh...@databricks.com wrote: Seems there are too many distinct groups processed in a task, which trigger the problem. How many files do your dataset have and how large is a file? Seems your query will be executed with two stages, table scan and map-side aggregation in the first stage and the final round of reduce-side aggregation in the second stage. Can you take a look at the numbers of tasks launched in these two stages? Thanks, Yin On Wed, Mar 18, 2015 at 11:42 AM, Yiannis Gkoufas johngou...@gmail.com wrote: Hi there, I set the executor memory to 8g but it didn't help On 18 March 2015 at 13:59, Cheng Lian lian.cs@gmail.com wrote: You should probably increase executor memory by setting spark.executor.memory. Full list of available configurations can be found here http://spark.apache.org/docs/latest/configuration.html Cheng On 3/18/15 9:15 PM, Yiannis Gkoufas wrote: Hi there, I was trying the new DataFrame API with some basic operations on a parquet dataset. I have 7 nodes of 12 cores and 8GB RAM allocated to each worker in a standalone cluster mode. The code is the following: val people = sqlContext.parquetFile(/data.parquet); val res = people.groupBy(name,date). agg(sum(power),sum(supply)).take(10); System.out.println(res); The dataset consists of 16 billion entries. The error I get is java.lang.OutOfMemoryError: GC overhead limit exceeded My configuration is: spark.serializer org.apache.spark.serializer.KryoSerializer
[jira] [Updated] (SPARK-6449) Driver OOM results in reported application result SUCCESS
[ https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6449: --- Component/s: (was: Spark Core) YARN Driver OOM results in reported application result SUCCESS - Key: SPARK-6449 URL: https://issues.apache.org/jira/browse/SPARK-6449 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0 Reporter: Ryan Williams I ran a job yesterday that according to the History Server and YARN RM finished with status {{SUCCESS}}. Clicking around on the history server UI, there were too few stages run, and I couldn't figure out why that would have been. Finally, inspecting the end of the driver's logs, I saw: {code} 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext Exception in thread Driver scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit exceeded (of class java.lang.OutOfMemoryError) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0, (reason: Shutdown hook called before final status was reported.) 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before final status was reported.) 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered. 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1426705269584_0055 {code} The driver OOM'd, [the {{catch}} block that presumably should have caught it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484] threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and written to the event log. This should be logged as a failed job and reported as such to YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6456) Spark Sql throwing exception on large partitioned data
[ https://issues.apache.org/jira/browse/SPARK-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6456: --- Component/s: (was: Spark Core) Spark Sql throwing exception on large partitioned data -- Key: SPARK-6456 URL: https://issues.apache.org/jira/browse/SPARK-6456 Project: Spark Issue Type: Bug Components: SQL Reporter: pankaj Fix For: 1.2.1 Observation: Spark connects with hive Metastore. i am able to run simple queries like show table and select. but throws below exception while running query on the hive Table having large number of partitions. {code} Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40) at`enter code here` org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785) at org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86) at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137) at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:138) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2858) Default log4j configuration no longer seems to work
[ https://issues.apache.org/jira/browse/SPARK-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2858. Resolution: Invalid This is really old and I don't think it still an issue. I'm just closing this as invalid. Default log4j configuration no longer seems to work --- Key: SPARK-2858 URL: https://issues.apache.org/jira/browse/SPARK-2858 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell For reasons unknown this doesn't seem to be working anymore. I deleted my log4j.properties file and did a fresh build and it noticed it still gave me a verbose stack trace when port 4040 was contented (which is a log we silence in the conf). I actually think this was an issue even before [~sowen]'s changes, so not sure what's up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5863) Improve performance of convertToScala codepath.
[ https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375229#comment-14375229 ] Patrick Wendell commented on SPARK-5863: This seems worth potentially fixing in 1.3.1, so I added that. I think it will depend how surgical the fix is. Improve performance of convertToScala codepath. --- Key: SPARK-5863 URL: https://issues.apache.org/jira/browse/SPARK-5863 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.2.1 Reporter: Cristian Priority: Critical Was doing some perf testing on reading parquet files and noticed that moving from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the culprit showed up as being in ScalaReflection.convertRowToScala. Particularly this zip is the issue: {code} r.toSeq.zip(schema.fields.map(_.dataType)) {code} I see there's a comment on that currently that this is slow but it wasn't fixed. This actually produces a 3x degradation in parquet read performance, at least in my test case. Edit: the map is part of the issue as well. This whole code block is in a tight loop and allocates a new ListBuffer that needs to grow for each transformation. A possible solution is to change to using seq.view which would allocate iterators instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5863) Improve performance of convertToScala codepath.
[ https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5863: --- Target Version/s: 1.3.1, 1.4.0 (was: 1.4.0) Improve performance of convertToScala codepath. --- Key: SPARK-5863 URL: https://issues.apache.org/jira/browse/SPARK-5863 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.2.1 Reporter: Cristian Priority: Critical Was doing some perf testing on reading parquet files and noticed that moving from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the culprit showed up as being in ScalaReflection.convertRowToScala. Particularly this zip is the issue: {code} r.toSeq.zip(schema.fields.map(_.dataType)) {code} I see there's a comment on that currently that this is slow but it wasn't fixed. This actually produces a 3x degradation in parquet read performance, at least in my test case. Edit: the map is part of the issue as well. This whole code block is in a tight loop and allocates a new ListBuffer that needs to grow for each transformation. A possible solution is to change to using seq.view which would allocate iterators instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4227) Document external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4227: --- Priority: Critical (was: Major) Document external shuffle service - Key: SPARK-4227 URL: https://issues.apache.org/jira/browse/SPARK-4227 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza Priority: Critical We should add spark.shuffle.service.enabled to the Configuration page and give instructions for launching the shuffle service as an auxiliary service on YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4227) Document external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4227: --- Target Version/s: 1.3.1, 1.4.0 (was: 1.3.0, 1.4.0) Document external shuffle service - Key: SPARK-4227 URL: https://issues.apache.org/jira/browse/SPARK-4227 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Sandy Ryza We should add spark.shuffle.service.enabled to the Configuration page and give instructions for launching the shuffle service as an auxiliary service on YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5863) Improve performance of convertToScala codepath.
[ https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5863: --- Target Version/s: 1.4.0 (was: 1.3.1, 1.4.0) Improve performance of convertToScala codepath. --- Key: SPARK-5863 URL: https://issues.apache.org/jira/browse/SPARK-5863 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.2.1 Reporter: Cristian Priority: Critical Was doing some perf testing on reading parquet files and noticed that moving from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the culprit showed up as being in ScalaReflection.convertRowToScala. Particularly this zip is the issue: {code} r.toSeq.zip(schema.fields.map(_.dataType)) {code} I see there's a comment on that currently that this is slow but it wasn't fixed. This actually produces a 3x degradation in parquet read performance, at least in my test case. Edit: the map is part of the issue as well. This whole code block is in a tight loop and allocates a new ListBuffer that needs to grow for each transformation. A possible solution is to change to using seq.view which would allocate iterators instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6012) Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered operator
[ https://issues.apache.org/jira/browse/SPARK-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6012: --- Target Version/s: 1.4.0 (was: 1.3.1, 1.4.0) Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered operator -- Key: SPARK-6012 URL: https://issues.apache.org/jira/browse/SPARK-6012 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Max Seiden Priority: Critical h3. Summary I've found that a deadlock occurs when asking for the partitions from a SchemaRDD that has a TakeOrdered as its terminal operator. The problem occurs when a child RDDs asks the DAGScheduler for preferred partition locations (which locks the scheduler) and eventually hits the #execute() of the TakeOrdered operator, which submits tasks but is blocked when it also tries to get preferred locations (in a separate thread). It seems like the TakeOrdered op's #execute() method should not actually submit a task (it is calling #executeCollect() and creating a new RDD) and should instead stay more true to the comment a logically apply a Limit on top of a Sort. In my particular case, I am forcing a repartition of a SchemaRDD with a terminal Limit(..., Sort(...)), which is where the CoalescedRDD comes into play. h3. Stack Traces h4. Task Submission {noformat} main prio=5 tid=0x7f8e7280 nid=0x1303 in Object.wait() [0x00010ed5e000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter) at java.lang.Object.wait(Object.java:503) at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) - locked 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1390) at org.apache.spark.rdd.RDD.reduce(RDD.scala:884) at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1161) at org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:183) at org.apache.spark.sql.execution.TakeOrdered.execute(basicOperators.scala:188) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) - locked 0x0007c36ce038 (a org.apache.spark.sql.hive.HiveContext$$anon$7) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDD.getDependencies(SchemaRDD.scala:127) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207) at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1278) at org.apache.spark.sql.SchemaRDD.getPartitions(SchemaRDD.scala:122) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1333) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1304) - locked 0x0007f55c2238 (a org.apache.spark.scheduler.DAGScheduler) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1148) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs
[jira] [Updated] (SPARK-6012) Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered operator
[ https://issues.apache.org/jira/browse/SPARK-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6012: --- Target Version/s: 1.3.1, 1.4.0 (was: 1.4.0) Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered operator -- Key: SPARK-6012 URL: https://issues.apache.org/jira/browse/SPARK-6012 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Max Seiden Priority: Critical h3. Summary I've found that a deadlock occurs when asking for the partitions from a SchemaRDD that has a TakeOrdered as its terminal operator. The problem occurs when a child RDDs asks the DAGScheduler for preferred partition locations (which locks the scheduler) and eventually hits the #execute() of the TakeOrdered operator, which submits tasks but is blocked when it also tries to get preferred locations (in a separate thread). It seems like the TakeOrdered op's #execute() method should not actually submit a task (it is calling #executeCollect() and creating a new RDD) and should instead stay more true to the comment a logically apply a Limit on top of a Sort. In my particular case, I am forcing a repartition of a SchemaRDD with a terminal Limit(..., Sort(...)), which is where the CoalescedRDD comes into play. h3. Stack Traces h4. Task Submission {noformat} main prio=5 tid=0x7f8e7280 nid=0x1303 in Object.wait() [0x00010ed5e000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter) at java.lang.Object.wait(Object.java:503) at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) - locked 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1390) at org.apache.spark.rdd.RDD.reduce(RDD.scala:884) at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1161) at org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:183) at org.apache.spark.sql.execution.TakeOrdered.execute(basicOperators.scala:188) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) - locked 0x0007c36ce038 (a org.apache.spark.sql.hive.HiveContext$$anon$7) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDD.getDependencies(SchemaRDD.scala:127) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207) at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1278) at org.apache.spark.sql.SchemaRDD.getPartitions(SchemaRDD.scala:122) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1333) at org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1304) - locked 0x0007f55c2238 (a org.apache.spark.scheduler.DAGScheduler) at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1148) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs
[jira] [Commented] (SPARK-5863) Improve performance of convertToScala codepath.
[ https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375230#comment-14375230 ] Patrick Wendell commented on SPARK-5863: Ah actually - I see [~marmbrus] was the one who set target to 1.4.0, so I'm gonna remove 1.3.1 Improve performance of convertToScala codepath. --- Key: SPARK-5863 URL: https://issues.apache.org/jira/browse/SPARK-5863 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0, 1.2.1 Reporter: Cristian Priority: Critical Was doing some perf testing on reading parquet files and noticed that moving from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the culprit showed up as being in ScalaReflection.convertRowToScala. Particularly this zip is the issue: {code} r.toSeq.zip(schema.fields.map(_.dataType)) {code} I see there's a comment on that currently that this is slow but it wasn't fixed. This actually produces a 3x degradation in parquet read performance, at least in my test case. Edit: the map is part of the issue as well. This whole code block is in a tight loop and allocates a new ListBuffer that needs to grow for each transformation. A possible solution is to change to using seq.view which would allocate iterators instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact
[ https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4925: --- Fix Version/s: (was: 1.2.1) (was: 1.3.0) Publish Spark SQL hive-thriftserver maven artifact --- Key: SPARK-4925 URL: https://issues.apache.org/jira/browse/SPARK-4925 Project: Spark Issue Type: Improvement Components: Build, SQL Affects Versions: 1.3.0, 1.2.1 Reporter: Alex Liu The hive-thriftserver maven artifact is needed for integrating Spark SQL with Cassandra. Can we publish it to maven? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact
[ https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4925: --- Priority: Critical (was: Major) Publish Spark SQL hive-thriftserver maven artifact --- Key: SPARK-4925 URL: https://issues.apache.org/jira/browse/SPARK-4925 Project: Spark Issue Type: Improvement Components: Build, SQL Affects Versions: 1.3.0, 1.2.1 Reporter: Alex Liu Priority: Critical The hive-thriftserver maven artifact is needed for integrating Spark SQL with Cassandra. Can we publish it to maven? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact
[ https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4925: --- Affects Version/s: (was: 1.2.0) 1.3.0 1.2.1 Publish Spark SQL hive-thriftserver maven artifact --- Key: SPARK-4925 URL: https://issues.apache.org/jira/browse/SPARK-4925 Project: Spark Issue Type: Improvement Components: Build, SQL Affects Versions: 1.3.0, 1.2.1 Reporter: Alex Liu Priority: Critical The hive-thriftserver maven artifact is needed for integrating Spark SQL with Cassandra. Can we publish it to maven? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4123) Show dependency changes in pull requests
[ https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4123: --- Summary: Show dependency changes in pull requests (was: Show new dependencies added in pull requests) Show dependency changes in pull requests Key: SPARK-4123 URL: https://issues.apache.org/jira/browse/SPARK-4123 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Assignee: Brennon York Priority: Critical We should inspect the classpath of Spark's assembly jar for every pull request. This only takes a few seconds in Maven and it will help weed out dependency changes from the master branch. Ideally we'd post any dependency changes in the pull request message. {code} $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk -F/ '{print $NF}' | sort my-classpath $ git checkout apache/master $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk -F/ '{print $NF}' | sort master-classpath $ diff my-classpath master-classpath chill-java-0.3.6.jar chill_2.10-0.3.6.jar --- chill-java-0.5.0.jar chill_2.10-0.5.0.jar {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact
[ https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reopened SPARK-4925: Thanks for bringing this up. Actually - realized this wasn't fixed by some of the other work we did. The issue is that we never published hive-thriftserver before (so simply undoing the changes I made didn't make this work for hive-thritfserver). We just need to add the -Phive-thriftserver profile here: https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L122 If someone wants to send a patch I can merge it, and we can fix it for 1.3.1. Publish Spark SQL hive-thriftserver maven artifact --- Key: SPARK-4925 URL: https://issues.apache.org/jira/browse/SPARK-4925 Project: Spark Issue Type: Improvement Components: Build, SQL Affects Versions: 1.3.0, 1.2.1 Reporter: Alex Liu The hive-thriftserver maven artifact is needed for integrating Spark SQL with Cassandra. Can we publish it to maven? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact
[ https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4925: --- Target Version/s: 1.3.1 Publish Spark SQL hive-thriftserver maven artifact --- Key: SPARK-4925 URL: https://issues.apache.org/jira/browse/SPARK-4925 Project: Spark Issue Type: Improvement Components: Build, SQL Affects Versions: 1.3.0, 1.2.1 Reporter: Alex Liu Priority: Critical The hive-thriftserver maven artifact is needed for integrating Spark SQL with Cassandra. Can we publish it to maven? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org