[jira] [Updated] (SPARK-6942) Umbrella: UI Visualizations for Core and Dataframes

2015-04-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6942:
---
Component/s: Web UI

 Umbrella: UI Visualizations for Core and Dataframes 
 

 Key: SPARK-6942
 URL: https://issues.apache.org/jira/browse/SPARK-6942
 Project: Spark
  Issue Type: Umbrella
  Components: Spark Core, SQL, Web UI
Reporter: Patrick Wendell
Assignee: Patrick Wendell

 This is an umbrella issue for the assorted visualization proposals for 
 Spark's UI. The scope will likely cover Spark 1.4 and 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6942) Umbrella: UI Visualizations for Core and Dataframes

2015-04-15 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-6942:
--

 Summary: Umbrella: UI Visualizations for Core and Dataframes 
 Key: SPARK-6942
 URL: https://issues.apache.org/jira/browse/SPARK-6942
 Project: Spark
  Issue Type: Umbrella
  Components: Spark Core, SQL
Reporter: Patrick Wendell
Assignee: Patrick Wendell


This is an umbrella issue for the assorted visualization proposals for Spark's 
UI. The scope will likely cover Spark 1.4 and 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature

2015-04-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3468:
---
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-6942

 WebUI Timeline-View feature
 ---

 Key: SPARK-3468
 URL: https://issues.apache.org/jira/browse/SPARK-3468
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
 Attachments: ApplicationTimeliView.png, JobTimelineView.png, 
 TaskAssignmentTimelineView.png


 I sometimes trouble-shoot and analyse the cause of long time spending job.
 At the time, I find the stages which spends long time or fails, then I find 
 the tasks which spends long time or fails, next I analyse the proportion of 
 each phase in a task.
 Another case, I find executors which spends long time for running a task and 
 analyse the details of a task.
 In such situation, I think it's helpful to visualize timeline view of stages 
 / tasks / executors and visualize details of proportion of activity for each 
 task.
 Now I'm developing prototypes like captures I attached.
 I'll integrate these viewer into WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6943) Graphically show RDD's included in a stage

2015-04-15 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-6943:
--

 Summary: Graphically show RDD's included in a stage
 Key: SPARK-6943
 URL: https://issues.apache.org/jira/browse/SPARK-6943
 Project: Spark
  Issue Type: Sub-task
Reporter: Patrick Wendell
Assignee: Andrew Or






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3468) Provide timeline view in Job and Stage pages

2015-04-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3468:
---
Summary: Provide timeline view in Job and Stage pages  (was: WebUI 
Timeline-View feature)

 Provide timeline view in Job and Stage pages
 

 Key: SPARK-3468
 URL: https://issues.apache.org/jira/browse/SPARK-3468
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
 Attachments: ApplicationTimeliView.png, JobTimelineView.png, 
 TaskAssignmentTimelineView.png


 I sometimes trouble-shoot and analyse the cause of long time spending job.
 At the time, I find the stages which spends long time or fails, then I find 
 the tasks which spends long time or fails, next I analyse the proportion of 
 each phase in a task.
 Another case, I find executors which spends long time for running a task and 
 analyse the details of a task.
 In such situation, I think it's helpful to visualize timeline view of stages 
 / tasks / executors and visualize details of proportion of activity for each 
 task.
 Now I'm developing prototypes like captures I attached.
 I'll integrate these viewer into WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3468) Provide timeline view in Job and Stage UI pages

2015-04-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3468:
---
Summary: Provide timeline view in Job and Stage UI pages  (was: Provide 
timeline view in Job and Stage pages)

 Provide timeline view in Job and Stage UI pages
 ---

 Key: SPARK-3468
 URL: https://issues.apache.org/jira/browse/SPARK-3468
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
 Attachments: ApplicationTimeliView.png, JobTimelineView.png, 
 TaskAssignmentTimelineView.png


 I sometimes trouble-shoot and analyse the cause of long time spending job.
 At the time, I find the stages which spends long time or fails, then I find 
 the tasks which spends long time or fails, next I analyse the proportion of 
 each phase in a task.
 Another case, I find executors which spends long time for running a task and 
 analyse the details of a task.
 In such situation, I think it's helpful to visualize timeline view of stages 
 / tasks / executors and visualize details of proportion of activity for each 
 task.
 Now I'm developing prototypes like captures I attached.
 I'll integrate these viewer into WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6950) Spark master UI believes some applications are in progress when they are actually completed

2015-04-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6950:
---
Component/s: Web UI

 Spark master UI believes some applications are in progress when they are 
 actually completed
 ---

 Key: SPARK-6950
 URL: https://issues.apache.org/jira/browse/SPARK-6950
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Matt Cheah

 In Spark 1.2.x, I was able to set my spark event log directory to be a 
 different location from the default, and after the job finishes, I can replay 
 the UI by clicking on the appropriate link under Completed Applications.
 Now, on a non-deterministic basis (but seems to happen most of the time), 
 when I click on the link under Completed Applications, I instead get a 
 webpage that says:
 Application history not found (app-20150415052927-0014)
 Application myApp is still in progress.
 I am able to view the application's UI using the Spark history server, so 
 something regressed in the Spark master code between 1.2 and 1.3, but that 
 regression does not apply in the history server use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.2

2015-04-14 Thread Patrick Wendell
I'd like to close this vote to coincide with the 1.3.1 release,
however, it would be great to have more people test this release
first. I'll leave it open for a bit longer and see if others can give
a +1.

On Tue, Apr 14, 2015 at 9:55 PM, Patrick Wendell pwend...@gmail.com wrote:
 +1 from me ass well.

 On Tue, Apr 7, 2015 at 4:36 AM, Sean Owen so...@cloudera.com wrote:
 I think that's close enough for a +1:

 Signatures and hashes are good.
 LICENSE, NOTICE still check out.
 Compiles for a Hadoop 2.6 + YARN + Hive profile.

 JIRAs with target version = 1.2.x look legitimate; no blockers.

 I still observe several Hive test failures with:
 mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0
 -DskipTests clean package; mvn -Phadoop-2.4 -Pyarn -Phive
 -Phive-0.13.1 -Dhadoop.version=2.6.0 test
 .. though again I think these are not regressions but known issues in
 older branches.

 FYI there are 16 Critical issues still open for 1.2.x:

 SPARK-6209,ExecutorClassLoader can leak connections after failing to
 load classes from the REPL class server,Josh Rosen,In Progress,4/5/15
 SPARK-5098,Number of running tasks become negative after tasks
 lost,,Open,1/14/15
 SPARK-4888,Spark EC2 doesn't mount local disks for i2.8xlarge
 instances,,Open,1/27/15
 SPARK-4879,Missing output partitions after job completes with
 speculative execution,Josh Rosen,Open,3/5/15
 SPARK-4568,Publish release candidates under $VERSION-RCX instead of
 $VERSION,Patrick Wendell,Open,11/24/14
 SPARK-4520,SparkSQL exception when reading certain columns from a
 parquet file,sadhan sood,Open,1/21/15
 SPARK-4514,SparkContext localProperties does not inherit property
 updates across thread reuse,Josh Rosen,Open,3/31/15
 SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15
 SPARK-4452,Shuffle data structures can starve others on the same
 thread for memory,Tianshuo Deng,Open,1/24/15
 SPARK-4356,Test Scala 2.11 on Jenkins,Patrick Wendell,Open,11/12/14
 SPARK-4258,NPE with new Parquet Filters,Cheng Lian,Reopened,4/3/15
 SPARK-4194,Exceptions thrown during SparkContext or SparkEnv
 construction might lead to resource leaks or corrupted global
 state,,In Progress,4/2/15
 SPARK-4159,Maven build doesn't run JUnit test suites,Sean Owen,Open,1/11/15
 SPARK-4106,Shuffle write and spill to disk metrics are 
 incorrect,,Open,10/28/14
 SPARK-3492,Clean up Yarn integration code,Andrew Or,Open,9/12/14
 SPARK-3461,Support external groupByKey using
 repartitionAndSortWithinPartitions,Sandy Ryza,Open,11/10/14
 SPARK-2984,FileNotFoundException on _temporary directory,,Open,12/11/14
 SPARK-2532,Fix issues with consolidated shuffle,,Open,3/26/15
 SPARK-1312,Batch should read based on the batch interval provided in
 the StreamingContext,Tathagata Das,Open,12/24/14

 On Sun, Apr 5, 2015 at 7:24 PM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.2.2!

 The tag to be voted on is v1.2.2-rc1 (commit 7531b50):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7531b50e406ee2e3301b009ceea7c684272b2e27

 The list of fixes present in this release can be found at:
 http://bit.ly/1DCNddt

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.2-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1082/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.2-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.2.2!

 The vote is open until Thursday, April 08, at 00:30 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.2.2
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.1 (RC3)

2015-04-14 Thread Patrick Wendell
+1 from myself as well

On Mon, Apr 13, 2015 at 8:35 PM, GuoQiang Li wi...@qq.com wrote:
 +1 (non-binding)


 -- Original --
 From:  Patrick Wendell;pwend...@gmail.com;
 Date:  Sat, Apr 11, 2015 02:05 PM
 To:  dev@spark.apache.orgdev@spark.apache.org;
 Subject:  [VOTE] Release Apache Spark 1.3.1 (RC3)

 Please vote on releasing the following candidate as Apache Spark version
 1.3.1!

 The tag to be voted on is v1.3.1-rc2 (commit 3e83913):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e8391327ba586eaf54447043bd526d919043a44

 The list of fixes present in this release can be found at:
 http://bit.ly/1C2nVPY

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc3/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1088/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc3-docs/

 The patches on top of RC2 are:
 [SPARK-6851] [SQL] Create new instance for each converted parquet relation
 [SPARK-5969] [PySpark] Fix descending pyspark.rdd.sortByKey.
 [SPARK-6343] Doc driver-worker network reqs
 [SPARK-6767] [SQL] Fixed Query DSL error in spark sql Readme
 [SPARK-6781] [SQL] use sqlContext in python shell
 [SPARK-6753] Clone SparkConf in ShuffleSuite tests
 [SPARK-6506] [PySpark] Do not try to retrieve SPARK_HOME when not needed...

 Please vote on releasing this package as Apache Spark 1.3.1!

 The vote is open until Tuesday, April 14, at 07:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.3.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[RESULT] [VOTE] Release Apache Spark 1.3.1 (RC3)

2015-04-14 Thread Patrick Wendell
This vote passes with 10 +1 votes (5 binding) and no 0 or -1 votes.

+1:
Sean Owen*
Reynold Xin*
Krishna Sankar
Denny Lee
Mark Hamstra*
Sean McNamara*
Sree V
Marcelo Vanzin
GuoQiang Li
Patrick Wendell*

0:

-1:

I will work on packaging this release in the next 48 hours.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.2

2015-04-14 Thread Patrick Wendell
+1 from me ass well.

On Tue, Apr 7, 2015 at 4:36 AM, Sean Owen so...@cloudera.com wrote:
 I think that's close enough for a +1:

 Signatures and hashes are good.
 LICENSE, NOTICE still check out.
 Compiles for a Hadoop 2.6 + YARN + Hive profile.

 JIRAs with target version = 1.2.x look legitimate; no blockers.

 I still observe several Hive test failures with:
 mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0
 -DskipTests clean package; mvn -Phadoop-2.4 -Pyarn -Phive
 -Phive-0.13.1 -Dhadoop.version=2.6.0 test
 .. though again I think these are not regressions but known issues in
 older branches.

 FYI there are 16 Critical issues still open for 1.2.x:

 SPARK-6209,ExecutorClassLoader can leak connections after failing to
 load classes from the REPL class server,Josh Rosen,In Progress,4/5/15
 SPARK-5098,Number of running tasks become negative after tasks
 lost,,Open,1/14/15
 SPARK-4888,Spark EC2 doesn't mount local disks for i2.8xlarge
 instances,,Open,1/27/15
 SPARK-4879,Missing output partitions after job completes with
 speculative execution,Josh Rosen,Open,3/5/15
 SPARK-4568,Publish release candidates under $VERSION-RCX instead of
 $VERSION,Patrick Wendell,Open,11/24/14
 SPARK-4520,SparkSQL exception when reading certain columns from a
 parquet file,sadhan sood,Open,1/21/15
 SPARK-4514,SparkContext localProperties does not inherit property
 updates across thread reuse,Josh Rosen,Open,3/31/15
 SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15
 SPARK-4452,Shuffle data structures can starve others on the same
 thread for memory,Tianshuo Deng,Open,1/24/15
 SPARK-4356,Test Scala 2.11 on Jenkins,Patrick Wendell,Open,11/12/14
 SPARK-4258,NPE with new Parquet Filters,Cheng Lian,Reopened,4/3/15
 SPARK-4194,Exceptions thrown during SparkContext or SparkEnv
 construction might lead to resource leaks or corrupted global
 state,,In Progress,4/2/15
 SPARK-4159,Maven build doesn't run JUnit test suites,Sean Owen,Open,1/11/15
 SPARK-4106,Shuffle write and spill to disk metrics are 
 incorrect,,Open,10/28/14
 SPARK-3492,Clean up Yarn integration code,Andrew Or,Open,9/12/14
 SPARK-3461,Support external groupByKey using
 repartitionAndSortWithinPartitions,Sandy Ryza,Open,11/10/14
 SPARK-2984,FileNotFoundException on _temporary directory,,Open,12/11/14
 SPARK-2532,Fix issues with consolidated shuffle,,Open,3/26/15
 SPARK-1312,Batch should read based on the batch interval provided in
 the StreamingContext,Tathagata Das,Open,12/24/14

 On Sun, Apr 5, 2015 at 7:24 PM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.2.2!

 The tag to be voted on is v1.2.2-rc1 (commit 7531b50):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7531b50e406ee2e3301b009ceea7c684272b2e27

 The list of fixes present in this release can be found at:
 http://bit.ly/1DCNddt

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.2-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1082/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.2-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.2.2!

 The vote is open until Thursday, April 08, at 00:30 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.2.2
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492888#comment-14492888
 ] 

Patrick Wendell commented on SPARK-6703:


Hey [~ilganeli] - sure thing. I've pinged a couple of people to provide 
feedback on the design. Overall I think it won't be a complicated feature to 
implement. I've added you as the assignee. One note, if it gets very close to 
the 1.4 code freeze I may need to help take it across the finish line. But for 
now why don't you go ahead, I think we'll be fine.

 Provide a way to discover existing SparkContext's
 -

 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Patrick Wendell

 Right now it is difficult to write a Spark application in a way that can be 
 run independently and also be composed with other Spark applications in an 
 environment such as the JobServer, notebook servers, etc where there is a 
 shared SparkContext.
 It would be nice to provide a rendez-vous point so that applications can 
 learn whether an existing SparkContext already exists before creating one.
 The most simple/surgical way I see to do this is to have an optional static 
 SparkContext singleton that people can be retrieved as follows:
 {code}
 val sc = SparkContext.getOrCreate(conf = new SparkConf())
 {code}
 And you could also have a setter where some outer framework/server can set it 
 for use by multiple downstream applications.
 A more advanced version of this would have some named registry or something, 
 but since we only support a single SparkContext in one JVM at this point 
 anyways, this seems sufficient and much simpler. Another advanced option 
 would be to allow plugging in some other notion of configuration you'd pass 
 when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6703:
---
Assignee: Ilya Ganelin

 Provide a way to discover existing SparkContext's
 -

 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Ilya Ganelin

 Right now it is difficult to write a Spark application in a way that can be 
 run independently and also be composed with other Spark applications in an 
 environment such as the JobServer, notebook servers, etc where there is a 
 shared SparkContext.
 It would be nice to provide a rendez-vous point so that applications can 
 learn whether an existing SparkContext already exists before creating one.
 The most simple/surgical way I see to do this is to have an optional static 
 SparkContext singleton that people can be retrieved as follows:
 {code}
 val sc = SparkContext.getOrCreate(conf = new SparkConf())
 {code}
 And you could also have a setter where some outer framework/server can set it 
 for use by multiple downstream applications.
 A more advanced version of this would have some named registry or something, 
 but since we only support a single SparkContext in one JVM at this point 
 anyways, this seems sufficient and much simpler. Another advanced option 
 would be to allow plugging in some other notion of configuration you'd pass 
 when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6703:
---
Priority: Critical  (was: Major)

 Provide a way to discover existing SparkContext's
 -

 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Ilya Ganelin
Priority: Critical

 Right now it is difficult to write a Spark application in a way that can be 
 run independently and also be composed with other Spark applications in an 
 environment such as the JobServer, notebook servers, etc where there is a 
 shared SparkContext.
 It would be nice to provide a rendez-vous point so that applications can 
 learn whether an existing SparkContext already exists before creating one.
 The most simple/surgical way I see to do this is to have an optional static 
 SparkContext singleton that people can be retrieved as follows:
 {code}
 val sc = SparkContext.getOrCreate(conf = new SparkConf())
 {code}
 And you could also have a setter where some outer framework/server can set it 
 for use by multiple downstream applications.
 A more advanced version of this would have some named registry or something, 
 but since we only support a single SparkContext in one JVM at this point 
 anyways, this seems sufficient and much simpler. Another advanced option 
 would be to allow plugging in some other notion of configuration you'd pass 
 when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Configuring amount of disk space available to spark executors in mesos?

2015-04-13 Thread Patrick Wendell
Hey Jonathan,

Are you referring to disk space used for storing persisted RDD's? For
that, Spark does not bound the amount of data persisted to disk. It's
a similar story to how Spark's shuffle disk output works (and also
Hadoop and other frameworks make this assumption as well for their
shuffle data, AFAIK).

We could (in theory) add a storage level that bounds the amount of
data persisted to disk and forces re-computation if the partition did
not fit. I'd be interested to hear more about a workload where that's
relevant though, before going that route. Maybe if people are using
SSD's that would make sense.

- Patrick

On Mon, Apr 13, 2015 at 8:19 AM, Jonathan Coveney jcove...@gmail.com wrote:
 I'm surprised that I haven't been able to find this via google, but I
 haven't...

 What is the setting that requests some amount of disk space for the
 executors? Maybe I'm misunderstanding how this is configured...

 Thanks for any help!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492898#comment-14492898
 ] 

Patrick Wendell commented on SPARK-6703:


/cc [~velvia]

 Provide a way to discover existing SparkContext's
 -

 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Ilya Ganelin
Priority: Critical

 Right now it is difficult to write a Spark application in a way that can be 
 run independently and also be composed with other Spark applications in an 
 environment such as the JobServer, notebook servers, etc where there is a 
 shared SparkContext.
 It would be nice to provide a rendez-vous point so that applications can 
 learn whether an existing SparkContext already exists before creating one.
 The most simple/surgical way I see to do this is to have an optional static 
 SparkContext singleton that people can be retrieved as follows:
 {code}
 val sc = SparkContext.getOrCreate(conf = new SparkConf())
 {code}
 And you could also have a setter where some outer framework/server can set it 
 for use by multiple downstream applications.
 A more advanced version of this would have some named registry or something, 
 but since we only support a single SparkContext in one JVM at this point 
 anyways, this seems sufficient and much simpler. Another advanced option 
 would be to allow plugging in some other notion of configuration you'd pass 
 when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6511) Publish hadoop provided build with instructions for different distros

2015-04-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493183#comment-14493183
 ] 

Patrick Wendell edited comment on SPARK-6511 at 4/13/15 10:11 PM:
--

Just as an example I tried to wire Spark to work with stock Hadoop 2.6. Here is 
how I got it running after doing a hadoop-provided build. This is pretty 
clunky, so I wonder if we should just support setting HADOOP_HOME or something 
and we can automatically find and add the jar files present within that folder.

{code}
export SPARK_DIST_CLASSPATH=$(find /tmp/hadoop-2.6.0/ -name *.jar | tr \n ;)
./bin/spark-shell
{code}

[~vanzin] for your CDH packages, what do you end up setting 
SPARK_DIST_CLASSPATH to?

/cc [~srowen]


was (Author: pwendell):
Just as an example I tried to wire Spark to work with stock Hadoop 2.6. Here is 
how I got it running after doing a hadoop-provided build. This is pretty 
clunky, so I wonder if we should just support setting HADOOP_HOME or something 
and we can automatically find and add the jar files present within that folder.

{code}
export SPARK_DIST_CLASSPATH=$(find /tmp/hadoop-2.6.0/ -name *.jar | tr \n ;)
./bin/spark-shell
{code}

[~vanzin] for your CDH packages, what do you end up setting 
SPARK_DIST_CLASSPATH to?

 Publish hadoop provided build with instructions for different distros
 ---

 Key: SPARK-6511
 URL: https://issues.apache.org/jira/browse/SPARK-6511
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell

 Currently we publish a series of binaries with different Hadoop client jars. 
 This mostly works, but some users have reported compatibility issues with 
 different distributions.
 One improvement moving forward might be to publish a binary build that simply 
 asks you to set HADOOP_HOME to pick up the Hadoop client location. That way 
 it would work across multiple distributions, even if they have subtle 
 incompatibilities with upstream Hadoop.
 I think a first step for this would be to produce such a build for the 
 community and see how well it works. One potential issue is that our fancy 
 excludes and dependency re-writing won't work with the simpler append 
 Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is 
 unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes 
 for dependency conflicts) or do we allow for linking against vanilla Hive at 
 runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493179#comment-14493179
 ] 

Patrick Wendell commented on SPARK-6703:


Yes, ideally we get it into 1.4 - though I think the ultimate solution here 
could be a very small patch.

 Provide a way to discover existing SparkContext's
 -

 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Ilya Ganelin
Priority: Critical

 Right now it is difficult to write a Spark application in a way that can be 
 run independently and also be composed with other Spark applications in an 
 environment such as the JobServer, notebook servers, etc where there is a 
 shared SparkContext.
 It would be nice to provide a rendez-vous point so that applications can 
 learn whether an existing SparkContext already exists before creating one.
 The most simple/surgical way I see to do this is to have an optional static 
 SparkContext singleton that people can be retrieved as follows:
 {code}
 val sc = SparkContext.getOrCreate(conf = new SparkConf())
 {code}
 And you could also have a setter where some outer framework/server can set it 
 for use by multiple downstream applications.
 A more advanced version of this would have some named registry or something, 
 but since we only support a single SparkContext in one JVM at this point 
 anyways, this seems sufficient and much simpler. Another advanced option 
 would be to allow plugging in some other notion of configuration you'd pass 
 when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6511) Publish hadoop provided build with instructions for different distros

2015-04-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493183#comment-14493183
 ] 

Patrick Wendell commented on SPARK-6511:


Just as an example I tried to wire Spark to work with stock Hadoop 2.6. Here is 
how I got it running after doing a hadoop-provided build. This is pretty 
clunky, so I wonder if we should just support setting HADOOP_HOME or something 
and we can automatically find and add the jar files present within that folder.

{code}
export SPARK_DIST_CLASSPATH=$(find /tmp/hadoop-2.6.0/ -name *.jar | tr \n ;)
./bin/spark-shell
{code}

[~vanzin] for your CDH packages, what do you end up setting 
SPARK_DIST_CLASSPATH to?

 Publish hadoop provided build with instructions for different distros
 ---

 Key: SPARK-6511
 URL: https://issues.apache.org/jira/browse/SPARK-6511
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell

 Currently we publish a series of binaries with different Hadoop client jars. 
 This mostly works, but some users have reported compatibility issues with 
 different distributions.
 One improvement moving forward might be to publish a binary build that simply 
 asks you to set HADOOP_HOME to pick up the Hadoop client location. That way 
 it would work across multiple distributions, even if they have subtle 
 incompatibilities with upstream Hadoop.
 I think a first step for this would be to produce such a build for the 
 community and see how well it works. One potential issue is that our fancy 
 excludes and dependency re-writing won't work with the simpler append 
 Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is 
 unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes 
 for dependency conflicts) or do we allow for linking against vanilla Hive at 
 runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6511) Publish hadoop provided build with instructions for different distros

2015-04-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493274#comment-14493274
 ] 

Patrick Wendell commented on SPARK-6511:


Can we just run HADOOP_HOME/bin/hadoop classpath and then capture the result? 
I'm wondering if there is a standard interface here we can expect most Hadoop 
distributions to have.

 Publish hadoop provided build with instructions for different distros
 ---

 Key: SPARK-6511
 URL: https://issues.apache.org/jira/browse/SPARK-6511
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell

 Currently we publish a series of binaries with different Hadoop client jars. 
 This mostly works, but some users have reported compatibility issues with 
 different distributions.
 One improvement moving forward might be to publish a binary build that simply 
 asks you to set HADOOP_HOME to pick up the Hadoop client location. That way 
 it would work across multiple distributions, even if they have subtle 
 incompatibilities with upstream Hadoop.
 I think a first step for this would be to produce such a build for the 
 community and see how well it works. One potential issue is that our fancy 
 excludes and dependency re-writing won't work with the simpler append 
 Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is 
 unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes 
 for dependency conflicts) or do we allow for linking against vanilla Hive at 
 runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules

2015-04-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493254#comment-14493254
 ] 

Patrick Wendell commented on SPARK-6889:


Thanks for posting this Sean. Overall, I think this is a big improvement. Some 
comments on the proposed JIRA workflow changes:

1. I think logically Affects Version/s is required only for bugs, right? Is 
there a well defined meaning for Affects Version/s for a new feature that is 
distinct from Target Version/s?
2. I am not sure you can restrict certain priority levels to certain roles, but 
if so that would be really nice.

 Streamline contribution process with update to Contribution wiki, JIRA rules
 

 Key: SPARK-6889
 URL: https://issues.apache.org/jira/browse/SPARK-6889
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Sean Owen
Assignee: Sean Owen
 Attachments: ContributingtoSpark.pdf, 
 SparkProjectMechanicsChallenges.pdf


 From about 6 months of intimate experience with the Spark JIRA and the 
 reality of the JIRA / PR flow, I've observed some challenges, problems and 
 growing pains that have begun to encumber the project mechanics. In the 
 attached SparkProjectMechanicsChallenges.pdf document, I've collected these 
 observations and a few statistics that summarize much of what I've seen. From 
 side conversations with several of you, I think some of these will resonate. 
 (Read it first for this to make sense.)
 I'd like to improve just one aspect to start: the contribution process. A lot 
 of inbound contribution effort gets misdirected, and can burn a lot of cycles 
 for everyone, and that's a barrier to scaling up further and to general 
 happiness. I'd like to propose for discussion a change to the wiki pages, and 
 a change to some JIRA settings. 
 *Wiki*
 - Replace 
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark with 
 proposed text (NewContributingToSpark.pdf)
 - Delete 
 https://cwiki.apache.org/confluence/display/SPARK/Reviewing+and+Merging+Patches
  as it is subsumed by the new text
 - Move the IDE Setup section to 
 https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools
 - Delete 
 https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+Scheme as 
 it's a  bit out of date and not all that useful
 *JIRA*
 Now: 
 Start by removing everyone from the 'Developer' role and add them to 
 'Contributor'. Right now Developer has no permission that Contributor 
 doesn't. We may reuse Developer later for some level between Committer and 
 Contributor.
 Later, with Apache admin assistance:
 - Make Component and Affects Version required for new JIRAs
 - Set default priority to Minor and type to Question for new JIRAs. If 
 defaults aren't changed, by default it can't be that important
 - Only let Committers set Target Version and Fix Version
 - Only let Committers set Blocker Priority



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6199) Support CTE

2015-04-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6199:
---
Assignee: (was: Cheng Hao)

 Support CTE
 ---

 Key: SPARK-6199
 URL: https://issues.apache.org/jira/browse/SPARK-6199
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: haiyang
 Fix For: 1.4.0


 Support CTE in SQLContext and HiveContext



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6858) Register Java HashMap for SparkSqlSerializer

2015-04-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6858:
---
Assignee: Liang-Chi Hsieh

 Register Java HashMap for SparkSqlSerializer
 

 Key: SPARK-6858
 URL: https://issues.apache.org/jira/browse/SPARK-6858
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
Priority: Trivial
 Fix For: 1.4.0


 Since now kyro serializer is used for {{GeneralHashedRelation}} whether kyro 
 is enabled or not, it is better to register Java {{HashMap}} in 
 {{SparkSqlSerializer}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4760) ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size for tables created from Parquet files

2015-04-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4760.

Resolution: Not A Problem

 ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size 
 for tables created from Parquet files
 --

 Key: SPARK-4760
 URL: https://issues.apache.org/jira/browse/SPARK-4760
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jianshi Huang
Priority: Critical
 Fix For: 1.3.0


 In a older Spark version built around Oct. 12, I was able to use 
   ANALYZE TABLE table COMPUTE STATISTICS noscan
 to get estimated table size, which is important for optimizing joins. (I'm 
 joining 15 small dimension tables, and this is crucial to me).
 In the more recent Spark builds, it fails to estimate the table size unless I 
 remove noscan.
 Here's the statistics I got using DESC EXTENDED:
 old:
 parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166}
 new:
 parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, 
 COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
 And I've tried turning off spark.sql.hive.convertMetastoreParquet in my 
 spark-defaults.conf and the result is unaffected (in both versions).
 Looks like the Parquet support in new Hive (0.13.1) is broken?
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6611) Add support for INTEGER as synonym of INT to DDLParser

2015-04-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6611:
---
Assignee: Santiago M. Mola

 Add support for INTEGER as synonym of INT to DDLParser
 --

 Key: SPARK-6611
 URL: https://issues.apache.org/jira/browse/SPARK-6611
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Santiago M. Mola
Assignee: Santiago M. Mola
Priority: Minor
 Fix For: 1.4.0


 Add support for INTEGER as synonym of INT to DDLParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-04-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491863#comment-14491863
 ] 

Patrick Wendell commented on SPARK-1529:


Hey Kannan,

We originally considered doing something like you are proposing, where we would 
change our filesystem interactions to all use a Hadoop FileSystem class and 
then we'd use Hadoop's LocalFileSystem. However, there were two issues:

1. We used POSIX API's that are not present in Hadoop. For instance, we use 
memory mapping in various places, FileChannel in the BlockObjectWriter, etc.
2. Using LocalFileSystem has a substantial performance overheads compared with 
our current code. So we'd have to write our own implementation of a Local 
filesystem.

For this reason, we decided that our current shuffle machinery was 
fundamentally not usable for non-POSIX environments. So we decided that 
instead, we'd let people customize shuffle behavior at a higher level and we 
implemented the pluggable shuffle components. So you can create a shuffle 
manager that is specifically optimized for a particular environment (e.g. MapR).

Did you consider implementing a MapR shuffle using that mechanism instead? 
You'd have to operate at a higher level, where you reason about shuffle 
records, etc. But you'd have a lot of flexibility to optimize within that.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Kannan Rajah
 Attachments: Spark Shuffle using HDFS.pdf


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-4760) ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size for tables created from Parquet files

2015-04-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-4760:


 ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size 
 for tables created from Parquet files
 --

 Key: SPARK-4760
 URL: https://issues.apache.org/jira/browse/SPARK-4760
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jianshi Huang
Priority: Critical
 Fix For: 1.3.0


 In a older Spark version built around Oct. 12, I was able to use 
   ANALYZE TABLE table COMPUTE STATISTICS noscan
 to get estimated table size, which is important for optimizing joins. (I'm 
 joining 15 small dimension tables, and this is crucial to me).
 In the more recent Spark builds, it fails to estimate the table size unless I 
 remove noscan.
 Here's the statistics I got using DESC EXTENDED:
 old:
 parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166}
 new:
 parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, 
 COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
 And I've tried turning off spark.sql.hive.convertMetastoreParquet in my 
 spark-defaults.conf and the result is unaffected (in both versions).
 Looks like the Parquet support in new Hive (0.13.1) is broken?
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6179) Support SHOW PRINCIPALS role_name;

2015-04-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6179:
---
Assignee: Zhongshuai Pei

 Support SHOW PRINCIPALS role_name;
 

 Key: SPARK-6179
 URL: https://issues.apache.org/jira/browse/SPARK-6179
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.1
Reporter: Zhongshuai Pei
Assignee: Zhongshuai Pei
 Fix For: 1.4.0


 SHOW PRINCIPALS role_name;
 Lists all roles and users who belong to this role.
 Only the admin role has privilege for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6199) Support CTE

2015-04-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6199:
---
Assignee: Cheng Hao

 Support CTE
 ---

 Key: SPARK-6199
 URL: https://issues.apache.org/jira/browse/SPARK-6199
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: haiyang
Assignee: Cheng Hao
 Fix For: 1.4.0


 Support CTE in SQLContext and HiveContext



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6863) Formatted list broken on Hive compatibility section of SQL programming guide

2015-04-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6863:
---
Assignee: Santiago M. Mola

 Formatted list broken on Hive compatibility section of SQL programming guide
 

 Key: SPARK-6863
 URL: https://issues.apache.org/jira/browse/SPARK-6863
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Santiago M. Mola
Assignee: Santiago M. Mola
Priority: Trivial
 Fix For: 1.3.1, 1.4.0


 Formatted list broken on Hive compatibility section of SQL programming guide. 
 It does not appear as a list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.1 (RC2)

2015-04-08 Thread Patrick Wendell
Hey Denny,

I beleive the 2.4 bits are there. The 2.6 bits I had done specially
(we haven't merge that into our upstream build script). I'll do it
again now for RC2.

- Patrick

On Wed, Apr 8, 2015 at 1:53 PM, Timothy Chen tnac...@gmail.com wrote:
 +1 Tested on 4 nodes Mesos cluster with fine-grain and coarse-grain mode.

 Tim

 On Wed, Apr 8, 2015 at 9:32 AM, Denny Lee denny.g@gmail.com wrote:
 The RC2 bits are lacking Hadoop 2.4 and Hadoop 2.6 - was that intended
 (they were included in RC1)?


 On Wed, Apr 8, 2015 at 9:01 AM Tom Graves tgraves...@yahoo.com.invalid
 wrote:

 +1. Tested spark on yarn against hadoop 2.6.
 Tom


  On Wednesday, April 8, 2015 6:15 AM, Sean Owen so...@cloudera.com
 wrote:


  Still a +1 from me; same result (except that now of course the
 UISeleniumSuite test does not fail)

 On Wed, Apr 8, 2015 at 1:46 AM, Patrick Wendell pwend...@gmail.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark version
 1.3.1!
 
  The tag to be voted on is v1.3.1-rc2 (commit 7c4473a):
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 7c4473aa5a7f5de0323394aaedeefbf9738e8eb5
 
  The list of fixes present in this release can be found at:
  http://bit.ly/1C2nVPY
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.3.1-rc2/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1083/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/
 
  The patches on top of RC1 are:
 
  [SPARK-6737] Fix memory leak in OutputCommitCoordinator
  https://github.com/apache/spark/pull/5397
 
  [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py
  https://github.com/apache/spark/pull/5302
 
  [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with
  NoClassDefFoundError
  https://github.com/apache/spark/pull/4933
 
  Please vote on releasing this package as Apache Spark 1.3.1!
 
  The vote is open until Saturday, April 11, at 07:00 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.3.1
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.1 (RC2)

2015-04-08 Thread Patrick Wendell
Oh I see - ah okay I'm guessing it was a transient build error and
I'll get it posted ASAP.

On Wed, Apr 8, 2015 at 3:41 PM, Denny Lee denny.g@gmail.com wrote:
 Oh, it appears the 2.4 bits without hive are there but not the 2.4 bits with
 hive. Cool stuff on the 2.6.
 On Wed, Apr 8, 2015 at 12:30 Patrick Wendell pwend...@gmail.com wrote:

 Hey Denny,

 I beleive the 2.4 bits are there. The 2.6 bits I had done specially
 (we haven't merge that into our upstream build script). I'll do it
 again now for RC2.

 - Patrick

 On Wed, Apr 8, 2015 at 1:53 PM, Timothy Chen tnac...@gmail.com wrote:
  +1 Tested on 4 nodes Mesos cluster with fine-grain and coarse-grain
  mode.
 
  Tim
 
  On Wed, Apr 8, 2015 at 9:32 AM, Denny Lee denny.g@gmail.com wrote:
  The RC2 bits are lacking Hadoop 2.4 and Hadoop 2.6 - was that intended
  (they were included in RC1)?
 
 
  On Wed, Apr 8, 2015 at 9:01 AM Tom Graves
  tgraves...@yahoo.com.invalid
  wrote:
 
  +1. Tested spark on yarn against hadoop 2.6.
  Tom
 
 
   On Wednesday, April 8, 2015 6:15 AM, Sean Owen
  so...@cloudera.com
  wrote:
 
 
   Still a +1 from me; same result (except that now of course the
  UISeleniumSuite test does not fail)
 
  On Wed, Apr 8, 2015 at 1:46 AM, Patrick Wendell pwend...@gmail.com
  wrote:
   Please vote on releasing the following candidate as Apache Spark
   version
  1.3.1!
  
   The tag to be voted on is v1.3.1-rc2 (commit 7c4473a):
   https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
  7c4473aa5a7f5de0323394aaedeefbf9738e8eb5
  
   The list of fixes present in this release can be found at:
   http://bit.ly/1C2nVPY
  
   The release files, including signatures, digests, etc. can be found
   at:
   http://people.apache.org/~pwendell/spark-1.3.1-rc2/
  
   Release artifacts are signed with the following key:
   https://people.apache.org/keys/committer/pwendell.asc
  
   The staging repository for this release can be found at:
  
   https://repository.apache.org/content/repositories/orgapachespark-1083/
  
   The documentation corresponding to this release can be found at:
   http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/
  
   The patches on top of RC1 are:
  
   [SPARK-6737] Fix memory leak in OutputCommitCoordinator
   https://github.com/apache/spark/pull/5397
  
   [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py
   https://github.com/apache/spark/pull/5302
  
   [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with
   NoClassDefFoundError
   https://github.com/apache/spark/pull/4933
  
   Please vote on releasing this package as Apache Spark 1.3.1!
  
   The vote is open until Saturday, April 11, at 07:00 UTC and passes
   if a majority of at least 3 +1 PMC votes are cast.
  
   [ ] +1 Release this package as Apache Spark 1.3.1
   [ ] -1 Do not release this package because ...
  
   To learn more about Apache Spark, please see
   http://spark.apache.org/
  
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Resolved] (SPARK-6792) pySpark groupByKey returns rows with the same key

2015-04-08 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-6792.

Resolution: Not A Problem

Resolving per Josh's comment.

 pySpark groupByKey returns rows with the same key
 -

 Key: SPARK-6792
 URL: https://issues.apache.org/jira/browse/SPARK-6792
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0
Reporter: Charles Hayden

 Under some circumstances, pySpark groupByKey returns two or more rows with 
 the same groupby key.
 It is not reproducible by a short example, but it can be seen in the 
 following program.
 The preservesPartitioning argument is required to see the failure.
 I ran this  with cluster_url=local[4], but I think it will also show up with 
 cluster_url=local.
 =
 {noformat}
 # The RDD.groupByKey sometimes gives two results with the same   key 
 value.  This is incorrect: all results with a single key need to be grouped 
 together.
 # Report the spark version
 from pyspark import SparkContext
 import StringIO
 import csv
 sc = SparkContext()
 print sc.version
 def loadRecord(line):
 input = StringIO.StringIO(line)
 reader = csv.reader(input, delimiter='\t')
 return reader.next()
 # Read data from movielens dataset
 # This can be obtained from 
 http://files.grouplens.org/datasets/movielens/ml-100k.zip
 inputFile = 'u.data'
 input = sc.textFile(inputFile)
 data = input.map(loadRecord)
 # Trim off unneeded fields
 data = data.map(lambda row: row[0:2])
 print 'Data Sample'
 print data.take(10)
 # Use join to filter the data
 #
 # map bulds left key
 # map builds right key
 # join
 # map throws away the key and gets result
 # pick a couple of users
 j = sc.parallelize([789, 939])
 # key left
 # conversion to str is required to show the error
 keyed_j = j.map(lambda row: (str(row), None))
 # key right
 keyed_rdd = data.map(lambda row: (str(row[0]), row))
 # join
 joined = keyed_rdd.join(keyed_j)
 # throw away key
 # preservesPartitioning is required to show the error
 res = joined.map(lambda row: row[1][0], preservesPartitioning=True)
 #res = joined.map(lambda row: row[1][0])  # no error
 print 'Filtered Sample'
 print res.take(10)
 #print res.count()
 # Do the groupby
 # There should be fewer rows
 keyed_rdd = res.map(lambda row: (row[1], row), 
 preservesPartitioning=True)
 print 'Input Count', keyed_rdd.count()
 grouped_rdd = keyed_rdd.groupByKey()
 print 'Grouped Count', grouped_rdd.count()
 # There are two rows with the same key !
 print 'Group Output Sample'
 print grouped_rdd.filter(lambda row: row[0] == '508').take(10)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6785) DateUtils can not handle date before 1970/01/01 correctly

2015-04-08 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6785:
---
Component/s: SQL

 DateUtils can not handle date before 1970/01/01 correctly
 -

 Key: SPARK-6785
 URL: https://issues.apache.org/jira/browse/SPARK-6785
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu

 {code}
 scala val d = new Date(100)
 d: java.sql.Date = 1969-12-31
 scala DateUtils.toJavaDate(DateUtils.fromJavaDate(d))
 res1: java.sql.Date = 1970-01-01
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6778) SQL contexts in spark-shell and pyspark should both be called sqlContext

2015-04-08 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-6778.

Resolution: Duplicate

 SQL contexts in spark-shell and pyspark should both be called sqlContext
 

 Key: SPARK-6778
 URL: https://issues.apache.org/jira/browse/SPARK-6778
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Shell
Reporter: Matei Zaharia

 For some reason the Python one is only called sqlCtx. This is pretty 
 confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6399) Code compiled against 1.3.0 may not run against older Spark versions

2015-04-08 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486595#comment-14486595
 ] 

Patrick Wendell commented on SPARK-6399:


It would be good to document more clearly what compatibility we intend to 
provide. I am not so sure that forward compatibility is a stated or necessary 
goal for binary interfaces. I think we should just provide backwards 
compatibility for those interfaces (though in practice these will almost always 
be the same except for some issues like this with implicits).

The main area we've had really strict enforcement of forward compatibility has 
been around the serialization format of JSON logs, since we want it to be easy 
for people to use the Spark history server with newer versions of Spark in a 
multi-tenant cluster.

 Code compiled against 1.3.0 may not run against older Spark versions
 

 Key: SPARK-6399
 URL: https://issues.apache.org/jira/browse/SPARK-6399
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Spark Core
Affects Versions: 1.3.0
Reporter: Marcelo Vanzin

 Commit 65b987c3 re-organized the implicit conversions of RDDs so that they're 
 easier to use. The problem is that scalac now generates code that will not 
 run on older Spark versions if those conversions are used.
 Basically, even if you explicitly import {{SparkContext._}}, scalac will 
 generate references to the new methods in the {{RDD}} object instead. So the 
 compiled code will reference code that doesn't exist in older versions of 
 Spark.
 You can work around this by explicitly calling the methods in the 
 {{SparkContext}} object, although that's a little ugly.
 We should at least document this limitation (if there's no way to fix it), 
 since I believe forwards compatibility in the API was also a goal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6784) Clean up all the inbound/outbound conversions for DateType

2015-04-08 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6784:
---
Component/s: SQL

 Clean up all the inbound/outbound conversions for DateType
 --

 Key: SPARK-6784
 URL: https://issues.apache.org/jira/browse/SPARK-6784
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Priority: Blocker

 We had changed  the JvmType of DateType to Int, but there still some places 
 putting java.sql.Date into Row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6783) Add timing and test output for PR tests

2015-04-08 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6783:
---
Component/s: Project Infra

 Add timing and test output for PR tests
 ---

 Key: SPARK-6783
 URL: https://issues.apache.org/jira/browse/SPARK-6783
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Affects Versions: 1.3.0
Reporter: Brennon York

 Currently the PR tests that run under {{dev/tests/*}} do not provide any 
 output within the actual Jenkins run. It would be nice to not only have error 
 output, but also timing results from each test and have those surfaced within 
 the Jenkins output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.1

2015-04-07 Thread Patrick Wendell
Hey All,

Today SPARK-6737 came to my attention. This is a bug that causes a
memory leak for any long running program that repeatedly saves data
out to a Hadoop FileSystem. For that reason, it is problematic for
Spark Streaming.

My sense is that this is severe enough to cut another RC once the fix
is merged (which is imminent):

https://issues.apache.org/jira/browse/SPARK-6737

I'll leave a bit of time for others to comment, in particular if
people feel we should not wait for this fix.

- Patrick

On Tue, Apr 7, 2015 at 2:34 PM, Marcelo Vanzin van...@cloudera.com wrote:
 +1 (non-binding)

 Ran standalone and yarn tests on the hadoop-2.6 tarball, with and
 without the external shuffle service in yarn mode.

 On Sat, Apr 4, 2015 at 5:09 PM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.3.1!

 The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc4e530cbb69851

 The list of fixes present in this release can be found at:
 http://bit.ly/1C2nVPY

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1080

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.3.1!

 The vote is open until Wednesday, April 08, at 01:10 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.3.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[RESULT] [VOTE] Release Apache Spark 1.3.1

2015-04-07 Thread Patrick Wendell
This vote is cancelled in favor of RC2.

On Tue, Apr 7, 2015 at 8:13 PM, Josh Rosen rosenvi...@gmail.com wrote:
 The leak will impact long running streaming jobs even if they don't write 
 Hadoop files, although the problem may take much longer to manifest itself 
 for those jobs.

 I think we currently leak an empty HashMap per stage submitted in the common 
 case, so it could take a very long time for this to trigger an OOM.  On the 
 other hand, the worst case behavior is quite bad for streaming jobs, so we 
 should probably fix this so that 1.2.x streaming users can more safely 
 upgrade to 1.3.x.

 - Josh

 Sent from my phone

 On Apr 7, 2015, at 4:13 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey All,

 Today SPARK-6737 came to my attention. This is a bug that causes a
 memory leak for any long running program that repeatedly saves data
 out to a Hadoop FileSystem. For that reason, it is problematic for
 Spark Streaming.

 My sense is that this is severe enough to cut another RC once the fix
 is merged (which is imminent):

 https://issues.apache.org/jira/browse/SPARK-6737

 I'll leave a bit of time for others to comment, in particular if
 people feel we should not wait for this fix.

 - Patrick

 On Tue, Apr 7, 2015 at 2:34 PM, Marcelo Vanzin van...@cloudera.com wrote:
 +1 (non-binding)

 Ran standalone and yarn tests on the hadoop-2.6 tarball, with and
 without the external shuffle service in yarn mode.

 On Sat, Apr 4, 2015 at 5:09 PM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.3.1!

 The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc4e530cbb69851

 The list of fixes present in this release can be found at:
 http://bit.ly/1C2nVPY

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1080

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.3.1!

 The vote is open until Wednesday, April 08, at 01:10 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.3.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.3.1 (RC2)

2015-04-07 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.3.1!

The tag to be voted on is v1.3.1-rc2 (commit 7c4473a):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7c4473aa5a7f5de0323394aaedeefbf9738e8eb5

The list of fixes present in this release can be found at:
http://bit.ly/1C2nVPY

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.3.1-rc2/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1083/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/

The patches on top of RC1 are:

[SPARK-6737] Fix memory leak in OutputCommitCoordinator
https://github.com/apache/spark/pull/5397

[SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py
https://github.com/apache/spark/pull/5302

[SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with
NoClassDefFoundError
https://github.com/apache/spark/pull/4933

Please vote on releasing this package as Apache Spark 1.3.1!

The vote is open until Saturday, April 11, at 07:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.3.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Updated] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed

2015-04-06 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6222:
---
Fix Version/s: 1.4.0
   1.3.1

 [STREAMING] All data may not be recovered from WAL when driver is killed
 

 Key: SPARK-6222
 URL: https://issues.apache.org/jira/browse/SPARK-6222
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Hari Shreedharan
Priority: Blocker
 Fix For: 1.3.1, 1.4.0

 Attachments: AfterPatch.txt, CleanWithoutPatch.txt, SPARK-6122.patch


 When testing for our next release, our internal tests written by [~wypoon] 
 caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs 
 FlumePolling stream to read data from Flume, then kills the Application 
 Master. Once YARN restarts it, the test waits until no more data is to be 
 written and verifies the original against the data on HDFS. This was passing 
 in 1.2.0, but is failing now.
 Since the test ties into Cloudera's internal infrastructure and build 
 process, it cannot be directly run on an Apache build. But I have been 
 working on isolating the commit that may have caused the regression. I have 
 confirmed that it was caused by SPARK-5147 (PR # 
 [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several 
 times using the test and the failure is consistently reproducible. 
 To re-confirm, I reverted just this one commit (and Clock consolidation one 
 to avoid conflicts), and the issue was no longer reproducible.
 Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0
 /cc [~tdas], [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Patrick Wendell
I believe TD just forgot to set the fix version on the JIRA. There is
a fix for this in 1.3:

https://github.com/apache/spark/commit/03e263f5b527cf574f4ffcd5cd886f7723e3756e

- Patrick

On Mon, Apr 6, 2015 at 2:31 PM, Mark Hamstra m...@clearstorydata.com wrote:
 Is that correct, or is the JIRA just out of sync, since TD's PR was merged?
 https://github.com/apache/spark/pull/5008

 On Mon, Apr 6, 2015 at 11:10 AM, Hari Shreedharan
 hshreedha...@cloudera.com wrote:

 It does not look like https://issues.apache.org/jira/browse/SPARK-6222
 made it. It was targeted towards this release.




 Thanks, Hari

 On Mon, Apr 6, 2015 at 11:04 AM, York, Brennon
 brennon.y...@capitalone.com wrote:

  +1 (non-binding)
  Tested GraphX, build infrastructure,  core test suite on OSX 10.9 w/
  Java
  1.7/1.8
  On 4/6/15, 5:21 AM, Sean Owen so...@cloudera.com wrote:
 SPARK-6673 is not, in the end, relevant for 1.3.x I believe; we just
 resolved it for 1.4 anyway. False alarm there.
 
 I back-ported SPARK-6205 into the 1.3 branch for next time. We'll pick
 it up if there's another RC, but by itself is not something that needs
 a new RC. (I will give the same treatment to branch 1.2 if needed in
 light of the 1.2.2 release.)
 
 I applied the simple change in SPARK-6205 in order to continue
 executing tests and all was well. I still see a few failures in Hive
 tests:
 
 - show_create_table_serde *** FAILED ***
 - show_tblproperties *** FAILED ***
 - udf_std *** FAILED ***
 - udf_stddev *** FAILED ***
 
 with ...
 
 mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0
 -DskipTests clean package; mvn -Phadoop-2.4 -Pyarn -Phive
 -Phive-0.13.1 -Dhadoop.version=2.6.0 test
 
 ... but these are not regressions from 1.3.0.
 
 +1 from me at this point on the current artifacts.
 
 On Sun, Apr 5, 2015 at 9:24 AM, Sean Owen so...@cloudera.com wrote:
  Signatures and hashes are good.
  LICENSE, NOTICE still check out.
  Compiles for a Hadoop 2.6 + YARN + Hive profile.
 
  I still see the UISeleniumSuite test failure observed in 1.3.0, which
  is minor and already fixed. I don't know why I didn't back-port it:
  https://issues.apache.org/jira/browse/SPARK-6205
 
  If we roll another, let's get this easy fix in, but it is only an
  issue with tests.
 
 
  On JIRA, I checked open issues with Fix Version = 1.3.0 or 1.3.1 and
  all look legitimate (e.g. reopened or in progress)
 
 
  There is 1 open Blocker for 1.3.1 per Andrew:
  https://issues.apache.org/jira/browse/SPARK-6673 spark-shell.cmd can't
  start even when spark was built in Windows
 
  I believe this can be resolved quickly but as a matter of hygiene
  should be fixed or demoted before release.
 
 
  FYI there are 16 Critical issues marked for 1.3.0 / 1.3.1; worth
  examining before release to see how critical they are:
 
  SPARK-6701,Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python
  application,,Open,4/3/15
  SPARK-6484,Ganglia metrics xml reporter doesn't escape
  correctly,Josh Rosen,Open,3/24/15
  SPARK-6270,Standalone Master hangs when streaming job
 completes,,Open,3/11/15
  SPARK-6209,ExecutorClassLoader can leak connections after failing to
  load classes from the REPL class server,Josh Rosen,In Progress,4/2/15
  SPARK-5113,Audit and document use of hostnames and IP addresses in
  Spark,,Open,3/24/15
  SPARK-5098,Number of running tasks become negative after tasks
  lost,,Open,1/14/15
  SPARK-4925,Publish Spark SQL hive-thriftserver maven artifact,Patrick
  Wendell,Reopened,3/23/15
  SPARK-4922,Support dynamic allocation for coarse-grained
 Mesos,,Open,3/31/15
  SPARK-4888,Spark EC2 doesn't mount local disks for i2.8xlarge
  instances,,Open,1/27/15
  SPARK-4879,Missing output partitions after job completes with
  speculative execution,Josh Rosen,Open,3/5/15
  SPARK-4751,Support dynamic allocation for standalone mode,Andrew
  Or,Open,12/22/14
  SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15
  SPARK-4452,Shuffle data structures can starve others on the same
  thread for memory,Tianshuo Deng,Open,1/24/15
  SPARK-4352,Incorporate locality preferences in dynamic allocation
  requests,,Open,1/26/15
  SPARK-4227,Document external shuffle service,,Open,3/23/15
  SPARK-3650,Triangle Count handles reverse edges
 incorrectly,,Open,2/23/15
 
  On Sun, Apr 5, 2015 at 1:09 AM, Patrick Wendell pwend...@gmail.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark
 version 1.3.1!
 
  The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f):
 

  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f3
 1b713ed90bcec63ebc4e530cbb69851
 
  The list of fixes present in this release can be found at:
  http://bit.ly/1C2nVPY
 
  The release files, including signatures, digests, etc. can be found
  at:
  http://people.apache.org/~pwendell/spark-1.3.1-rc1/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
The issue is that if you invoke build/mvn it will start zinc again
if it sees that it is killed.

The absolute most sterile thing to do is this:
1. Kill any zinc processes.
2. Clean up spark git clean -fdx (WARNING: this will delete any
staged changes you have, if you have code modifications or extra files
around)
3. Run the 2.11 script to change the versions.
4. Run mvn package with maven that you installed on your machine.


On Mon, Apr 6, 2015 at 10:43 PM, Marty Bower sp...@mjhb.com wrote:
 I'm killing zinc (if it's running) before running each build attempt.

 Trying to build as clean as possible.


 On Mon, Apr 6, 2015 at 7:31 PM Patrick Wendell pwend...@gmail.com wrote:

 What if you don't run zinc? I.e. just download maven and run that mvn
 package It might take longer, but I wonder if it will work.

 On Mon, Apr 6, 2015 at 10:26 PM, mjhb sp...@mjhb.com wrote:
  Similar problem on 1.2 branch:
 
  [ERROR] Failed to execute goal on project spark-core_2.11: Could not
  resolve
  dependencies for project
  org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following
  artifacts
  could not be resolved:
  org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT,
  org.apache.spark:spark-network-shuffle_2.10:jar:1.2.3-SNAPSHOT: Failure
  to
  find org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT in
  http://repository.apache.org/snapshots was cached in the local
  repository,
  resolution will not be reattempted until the update interval of
  apache.snapshots has elapsed or updates are forced - [Help 1]
  org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
  execute
  goal on project spark-core_2.11: Could not resolve dependencies for
  project
  org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following
  artifacts
  could not be resolved:
  org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT,
  org.apache.spark:spark-network-shuffle_2.10:jar:1.2.3-SNAPSHOT: Failure
  to
  find org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT in
  http://repository.apache.org/snapshots was cached in the local
  repository,
  resolution will not be reattempted until the update interval of
  apache.snapshots has elapsed or updates are forced
 
 
 
 
  --
  View this message in context:
  http://apache-spark-developers-list.1001551.n3.nabble.com/1-3-Build-Error-with-Scala-2-11-tp11441p11442.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
One thing that I think can cause issues is if you run build/mvn with
Scala 2.10, then try to run it with 2.11, since I think we may store
some downloaded jars relating to zinc that will get screwed up. Not
sure that's what is happening, just an idea.

On Mon, Apr 6, 2015 at 10:54 PM, Patrick Wendell pwend...@gmail.com wrote:
 The issue is that if you invoke build/mvn it will start zinc again
 if it sees that it is killed.

 The absolute most sterile thing to do is this:
 1. Kill any zinc processes.
 2. Clean up spark git clean -fdx (WARNING: this will delete any
 staged changes you have, if you have code modifications or extra files
 around)
 3. Run the 2.11 script to change the versions.
 4. Run mvn package with maven that you installed on your machine.


 On Mon, Apr 6, 2015 at 10:43 PM, Marty Bower sp...@mjhb.com wrote:
 I'm killing zinc (if it's running) before running each build attempt.

 Trying to build as clean as possible.


 On Mon, Apr 6, 2015 at 7:31 PM Patrick Wendell pwend...@gmail.com wrote:

 What if you don't run zinc? I.e. just download maven and run that mvn
 package It might take longer, but I wonder if it will work.

 On Mon, Apr 6, 2015 at 10:26 PM, mjhb sp...@mjhb.com wrote:
  Similar problem on 1.2 branch:
 
  [ERROR] Failed to execute goal on project spark-core_2.11: Could not
  resolve
  dependencies for project
  org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following
  artifacts
  could not be resolved:
  org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT,
  org.apache.spark:spark-network-shuffle_2.10:jar:1.2.3-SNAPSHOT: Failure
  to
  find org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT in
  http://repository.apache.org/snapshots was cached in the local
  repository,
  resolution will not be reattempted until the update interval of
  apache.snapshots has elapsed or updates are forced - [Help 1]
  org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
  execute
  goal on project spark-core_2.11: Could not resolve dependencies for
  project
  org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following
  artifacts
  could not be resolved:
  org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT,
  org.apache.spark:spark-network-shuffle_2.10:jar:1.2.3-SNAPSHOT: Failure
  to
  find org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT in
  http://repository.apache.org/snapshots was cached in the local
  repository,
  resolution will not be reattempted until the update interval of
  apache.snapshots has elapsed or updates are forced
 
 
 
 
  --
  View this message in context:
  http://apache-spark-developers-list.1001551.n3.nabble.com/1-3-Build-Error-with-Scala-2-11-tp11441p11442.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
What if you don't run zinc? I.e. just download maven and run that mvn
package It might take longer, but I wonder if it will work.

On Mon, Apr 6, 2015 at 10:26 PM, mjhb sp...@mjhb.com wrote:
 Similar problem on 1.2 branch:

 [ERROR] Failed to execute goal on project spark-core_2.11: Could not resolve
 dependencies for project
 org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following artifacts
 could not be resolved:
 org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT,
 org.apache.spark:spark-network-shuffle_2.10:jar:1.2.3-SNAPSHOT: Failure to
 find org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT in
 http://repository.apache.org/snapshots was cached in the local repository,
 resolution will not be reattempted until the update interval of
 apache.snapshots has elapsed or updates are forced - [Help 1]
 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
 goal on project spark-core_2.11: Could not resolve dependencies for project
 org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following artifacts
 could not be resolved:
 org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT,
 org.apache.spark:spark-network-shuffle_2.10:jar:1.2.3-SNAPSHOT: Failure to
 find org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT in
 http://repository.apache.org/snapshots was cached in the local repository,
 resolution will not be reattempted until the update interval of
 apache.snapshots has elapsed or updates are forced




 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/1-3-Build-Error-with-Scala-2-11-tp11441p11442.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
The only think that can persist outside of Spark is if there is still
a live Zinc process. We took care to make sure this was a generally
stateless mechanism.

Both the 1.2.X and 1.3.X releases are built with Scala 2.11 for
packaging purposes. And these have been built as recently as in the
last few days, since we are voting on 1.2.2 and 1.3.1. However there
could be issues that only affect certain environments.

- Patrick

On Mon, Apr 6, 2015 at 11:02 PM, mjhb sp...@mjhb.com wrote:
 I resorted to deleting the spark directory between each build earlier today
 (attempting maximum sterility) and then re-cloning from github and switching
 to the 1.2 or 1.3 branch.

 Does anything persist outside of the spark directory?

 Are you able to build either 1.2 or 1.3 w/ Scala-2.11?



 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/1-3-Build-Error-with-Scala-2-11-tp11441p11447.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
Hmm..  Make sure you are building with the right flags. I think you need to
pass -Dscala-2.11 to maven. Take a look at the upstream docs - on my phone
now so can't easily access.
On Apr 7, 2015 1:01 AM, mjhb sp...@mjhb.com wrote:

 I even deleted my local maven repository (.m2) but still stuck when
 attempting to build w/ Scala-2.11:

 [ERROR] Failed to execute goal on project spark-core_2.11: Could not
 resolve
 dependencies for project
 org.apache.spark:spark-core_2.11:jar:1.3.2-SNAPSHOT: The following
 artifacts
 could not be resolved:
 org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT,
 org.apache.spark:spark-network-shuffle_2.10:jar:1.3.2-SNAPSHOT: Could not
 find artifact org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT
 in apache.snapshots (http://repository.apache.org/snapshots) - [Help 1]
 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
 goal on project spark-core_2.11: Could not resolve dependencies for project
 org.apache.spark:spark-core_2.11:jar:1.3.2-SNAPSHOT: The following
 artifacts
 could not be resolved:
 org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT,
 org.apache.spark:spark-network-shuffle_2.10:jar:1.3.2-SNAPSHOT: Could not
 find artifact org.apache.spark:spark-network-common_2.10:jar:1.3.2-SNAPSHOT
 in apache.snapshots (http://repository.apache.org/snapshots)




 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/1-3-Build-Error-with-Scala-2-11-tp11441p11449.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




[jira] [Updated] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-04 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6703:
---
Description: 
Right now it is difficult to write a Spark application in a way that can be run 
independently and also be composed with other Spark applications in an 
environment such as the JobServer, notebook servers, etc where there is a 
shared SparkContext.

It would be nice to provide a rendez-vous point so that applications can learn 
whether an existing SparkContext already exists before creating one.

The most simple/surgical way I see to do this is to have an optional static 
SparkContext singleton that people can be retrieved as follows:

{code}
val sc = SparkContext.getOrCreate(conf = new SparkConf())
{code}

And you could also have a setter where some outer framework/server can set it 
for use by multiple downstream applications.

A more advanced version of this would have some named registry or something, 
but since we only support a single SparkContext in one JVM at this point 
anyways, this seems sufficient and much simpler. Another advanced option would 
be to allow plugging in some other notion of configuration you'd pass when 
retrieving an existing context.


  was:
Right now it is difficult to write a Spark application in a way that can be run 
independently and also be composed with other Spark applications in an 
environment such as the JobServer, notebook servers, etc where there is a 
shared SparkContext.

It would be nice to have a way to write an application where you can get or 
create a SparkContext and have some standard type of synchronization point 
application authors can access. The most simple/surgical way I see to do this 
is to have an optional static SparkContext singleton that people can be 
retrieved as follows:

{code}
val sc = SparkContext.getOrCreate(conf = new SparkConf())
{code}

And you could also have a setter where some outer framework/server can set it 
for use by multiple downstream applications.

A more advanced version of this would have some named registry or something, 
but since we only support a single SparkContext in one JVM at this point 
anyways, this seems sufficient and much simpler. Another advanced option would 
be to allow plugging in some other notion of configuration you'd pass when 
retrieving an existing context.



 Provide a way to discover existing SparkContext's
 -

 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Patrick Wendell

 Right now it is difficult to write a Spark application in a way that can be 
 run independently and also be composed with other Spark applications in an 
 environment such as the JobServer, notebook servers, etc where there is a 
 shared SparkContext.
 It would be nice to provide a rendez-vous point so that applications can 
 learn whether an existing SparkContext already exists before creating one.
 The most simple/surgical way I see to do this is to have an optional static 
 SparkContext singleton that people can be retrieved as follows:
 {code}
 val sc = SparkContext.getOrCreate(conf = new SparkConf())
 {code}
 And you could also have a setter where some outer framework/server can set it 
 for use by multiple downstream applications.
 A more advanced version of this would have some named registry or something, 
 but since we only support a single SparkContext in one JVM at this point 
 anyways, this seems sufficient and much simpler. Another advanced option 
 would be to allow plugging in some other notion of configuration you'd pass 
 when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6676) Add hadoop 2.4+ for profiles in POM.xml

2015-04-04 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395932#comment-14395932
 ] 

Patrick Wendell commented on SPARK-6676:


[~srowen] This is such a common source of confusion for users, do you think we 
should just add 2.5 and 2.6 profiles and add a note internally that they are 
duplicates of 2.4? The maintenance cost there is pretty marginal and it might 
be better user experience, since this is something people clearly regularly 
stumble on.

 Add hadoop 2.4+ for profiles in POM.xml
 ---

 Key: SPARK-6676
 URL: https://issues.apache.org/jira/browse/SPARK-6676
 Project: Spark
  Issue Type: Improvement
  Components: Build, Tests
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 support *-Phadoop-2.5* and *-Phadoop-2.6* when building and testing Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-03 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-6703:
--

 Summary: Provide a way to discover existing SparkContext's
 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Patrick Wendell


Right now it is difficult to write a Spark application in a way that can be run 
independently and also be composed with other Spark applications in an 
environment such as the JobServer, notebook servers, etc where there is a 
shared SparkContext.

It would be nice to have a way to write an application where you can get or 
create a SparkContext and have some standard type of synchronization point 
application authors can access. The most simple/surgical way I see to do this 
is to have an optional static SparkContext singleton that people can be 
retrieved as follows:

{code}
val sc = SparkContext.getOrCreate(conf = new SparkConf())
{code}

And you could also have a setter where some outer framework/server can set it 
for use by multiple downstream applications.

A more advanced version of this would have some named registry or something, 
but since we only support a single SparkContext in one JVM at this point 
anyways, this seems sufficient and much simpler. Another advanced option would 
be to allow plugging in some other notion of configuration you'd pass when 
retrieving an existing context.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6627) Clean up of shuffle code and interfaces

2015-04-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-6627.

   Resolution: Fixed
Fix Version/s: 1.4.0

 Clean up of shuffle code and interfaces
 ---

 Key: SPARK-6627
 URL: https://issues.apache.org/jira/browse/SPARK-6627
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Critical
 Fix For: 1.4.0


 The shuffle code in Spark is somewhat messy and could use some interface 
 clean-up, especially with some larger changes outstanding. This is a catch 
 all for what may be some small improvements in a few different PR's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6659) Spark SQL 1.3 cannot read json file that only with a record.

2015-04-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6659:
---
Component/s: SQL

 Spark SQL 1.3 cannot read json file that only with a record.
 

 Key: SPARK-6659
 URL: https://issues.apache.org/jira/browse/SPARK-6659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: luochenghui

 Dear friends:
  
 Spark SQL 1.3 cannot read json file that only with a record.
 here is my json file's content.
 {name:milo,age,24}
  
 when i run Spark SQL under the local mode,it throws an exception
 rg.apache.spark.sql.AnalysisException: cannot resolve 'name' given input 
 columns _corrupt_record;
  
 what i had done:
 1  ./spark-shell
 2 
 scala val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 sqlContext: org.apache.spark.sql.SQLContext = 
 org.apache.spark.sql.SQLContext@5f3be6c8
  
 scala val df = sqlContext.jsonFile(/home/milo/person.json)
 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(163705) called with 
 curMem=0, maxMem=280248975
 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0 stored as values in 
 memory (estimated size 159.9 KB, free 267.1 MB)
 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(22692) called with 
 curMem=163705, maxMem=280248975
 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
 in memory (estimated size 22.2 KB, free 267.1 MB)
 15/03/19 22:11:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
 on localhost:35842 (size: 22.2 KB, free: 267.2 MB)
 15/03/19 22:11:45 INFO BlockManagerMaster: Updated info of block 
 broadcast_0_piece0
 15/03/19 22:11:45 INFO SparkContext: Created broadcast 0 from textFile at 
 JSONRelation.scala:98
 15/03/19 22:11:47 INFO FileInputFormat: Total input paths to process : 1
 15/03/19 22:11:47 INFO SparkContext: Starting job: reduce at JsonRDD.scala:51
 15/03/19 22:11:47 INFO DAGScheduler: Got job 0 (reduce at JsonRDD.scala:51) 
 with 1 output partitions (allowLocal=false)
 15/03/19 22:11:47 INFO DAGScheduler: Final stage: Stage 0(reduce at 
 JsonRDD.scala:51)
 15/03/19 22:11:47 INFO DAGScheduler: Parents of final stage: List()
 15/03/19 22:11:47 INFO DAGScheduler: Missing parents: List()
 15/03/19 22:11:47 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[3] 
 at map at JsonRDD.scala:51), which has no missing parents
 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(3184) called with 
 curMem=186397, maxMem=280248975
 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1 stored as values in 
 memory (estimated size 3.1 KB, free 267.1 MB)
 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(2251) called with 
 curMem=189581, maxMem=280248975
 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes 
 in memory (estimated size 2.2 KB, free 267.1 MB)
 15/03/19 22:11:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory 
 on localhost:35842 (size: 2.2 KB, free: 267.2 MB)
 15/03/19 22:11:47 INFO BlockManagerMaster: Updated info of block 
 broadcast_1_piece0
 15/03/19 22:11:47 INFO SparkContext: Created broadcast 1 from broadcast at 
 DAGScheduler.scala:839
 15/03/19 22:11:48 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
 (MapPartitionsRDD[3] at map at JsonRDD.scala:51)
 15/03/19 22:11:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
 15/03/19 22:11:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
 localhost, PROCESS_LOCAL, 1291 bytes)
 15/03/19 22:11:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
 15/03/19 22:11:48 INFO HadoopRDD: Input split: 
 file:/home/milo/person.json:0+26
 15/03/19 22:11:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
 mapreduce.task.id
 15/03/19 22:11:48 INFO deprecation: mapred.task.id is deprecated. Instead, 
 use mapreduce.task.attempt.id
 15/03/19 22:11:48 INFO deprecation: mapred.task.is.map is deprecated. 
 Instead, use mapreduce.task.ismap
 15/03/19 22:11:48 INFO deprecation: mapred.task.partition is deprecated. 
 Instead, use mapreduce.task.partition
 15/03/19 22:11:48 INFO deprecation: mapred.job.id is deprecated. Instead, use 
 mapreduce.job.id
 15/03/19 22:11:49 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2023 
 bytes result sent to driver
 15/03/19 22:11:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
 in 1209 ms on localhost (1/1)
 15/03/19 22:11:49 INFO DAGScheduler: Stage 0 (reduce at JsonRDD.scala:51) 
 finished in 1.308 s
 15/03/19 22:11:49 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
 have all completed, from pool 
 15/03/19 22:11:49 INFO DAGScheduler: Job 0 finished: reduce at 
 JsonRDD.scala:51, took 2.002429 s
 df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
  
 3  
 scala df.select(name).show()
 15/03/19 22:12:44 INFO BlockManager

[jira] [Closed] (SPARK-6659) Spark SQL 1.3 cannot read json file that only with a record.

2015-04-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell closed SPARK-6659.
--
Resolution: Invalid

Per the comment, I think the issue is the JSON is not correctly formatted.

 Spark SQL 1.3 cannot read json file that only with a record.
 

 Key: SPARK-6659
 URL: https://issues.apache.org/jira/browse/SPARK-6659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: luochenghui

 Dear friends:
  
 Spark SQL 1.3 cannot read json file that only with a record.
 here is my json file's content.
 {name:milo,age,24}
  
 when i run Spark SQL under the local mode,it throws an exception
 rg.apache.spark.sql.AnalysisException: cannot resolve 'name' given input 
 columns _corrupt_record;
  
 what i had done:
 1  ./spark-shell
 2 
 scala val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 sqlContext: org.apache.spark.sql.SQLContext = 
 org.apache.spark.sql.SQLContext@5f3be6c8
  
 scala val df = sqlContext.jsonFile(/home/milo/person.json)
 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(163705) called with 
 curMem=0, maxMem=280248975
 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0 stored as values in 
 memory (estimated size 159.9 KB, free 267.1 MB)
 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(22692) called with 
 curMem=163705, maxMem=280248975
 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
 in memory (estimated size 22.2 KB, free 267.1 MB)
 15/03/19 22:11:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
 on localhost:35842 (size: 22.2 KB, free: 267.2 MB)
 15/03/19 22:11:45 INFO BlockManagerMaster: Updated info of block 
 broadcast_0_piece0
 15/03/19 22:11:45 INFO SparkContext: Created broadcast 0 from textFile at 
 JSONRelation.scala:98
 15/03/19 22:11:47 INFO FileInputFormat: Total input paths to process : 1
 15/03/19 22:11:47 INFO SparkContext: Starting job: reduce at JsonRDD.scala:51
 15/03/19 22:11:47 INFO DAGScheduler: Got job 0 (reduce at JsonRDD.scala:51) 
 with 1 output partitions (allowLocal=false)
 15/03/19 22:11:47 INFO DAGScheduler: Final stage: Stage 0(reduce at 
 JsonRDD.scala:51)
 15/03/19 22:11:47 INFO DAGScheduler: Parents of final stage: List()
 15/03/19 22:11:47 INFO DAGScheduler: Missing parents: List()
 15/03/19 22:11:47 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[3] 
 at map at JsonRDD.scala:51), which has no missing parents
 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(3184) called with 
 curMem=186397, maxMem=280248975
 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1 stored as values in 
 memory (estimated size 3.1 KB, free 267.1 MB)
 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(2251) called with 
 curMem=189581, maxMem=280248975
 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes 
 in memory (estimated size 2.2 KB, free 267.1 MB)
 15/03/19 22:11:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory 
 on localhost:35842 (size: 2.2 KB, free: 267.2 MB)
 15/03/19 22:11:47 INFO BlockManagerMaster: Updated info of block 
 broadcast_1_piece0
 15/03/19 22:11:47 INFO SparkContext: Created broadcast 1 from broadcast at 
 DAGScheduler.scala:839
 15/03/19 22:11:48 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
 (MapPartitionsRDD[3] at map at JsonRDD.scala:51)
 15/03/19 22:11:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
 15/03/19 22:11:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
 localhost, PROCESS_LOCAL, 1291 bytes)
 15/03/19 22:11:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
 15/03/19 22:11:48 INFO HadoopRDD: Input split: 
 file:/home/milo/person.json:0+26
 15/03/19 22:11:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
 mapreduce.task.id
 15/03/19 22:11:48 INFO deprecation: mapred.task.id is deprecated. Instead, 
 use mapreduce.task.attempt.id
 15/03/19 22:11:48 INFO deprecation: mapred.task.is.map is deprecated. 
 Instead, use mapreduce.task.ismap
 15/03/19 22:11:48 INFO deprecation: mapred.task.partition is deprecated. 
 Instead, use mapreduce.task.partition
 15/03/19 22:11:48 INFO deprecation: mapred.job.id is deprecated. Instead, use 
 mapreduce.job.id
 15/03/19 22:11:49 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2023 
 bytes result sent to driver
 15/03/19 22:11:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
 in 1209 ms on localhost (1/1)
 15/03/19 22:11:49 INFO DAGScheduler: Stage 0 (reduce at JsonRDD.scala:51) 
 finished in 1.308 s
 15/03/19 22:11:49 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
 have all completed, from pool 
 15/03/19 22:11:49 INFO DAGScheduler: Job 0 finished: reduce at 
 JsonRDD.scala:51, took 2.002429 s
 df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
  
 3

Re: Unit test logs in Jenkins?

2015-04-01 Thread Patrick Wendell
Hey Marcelo,

Great question. Right now, some of the more active developers have an
account that allows them to log into this cluster to inspect logs (we
copy the logs from each run to a node on that cluster). The
infrastructure is maintained by the AMPLab.

I will put you in touch the someone there who can get you an account.

This is a short term solution. The longer term solution is to have
these scp'd regularly to an S3 bucket or somewhere people can get
access to them, but that's not ready yet.

- Patrick



On Wed, Apr 1, 2015 at 1:01 PM, Marcelo Vanzin van...@cloudera.com wrote:
 Hey all,

 Is there a way to access unit test logs in jenkins builds? e.g.,
 core/target/unit-tests.log

 That would be really helpful to debug build failures. The scalatest
 output isn't all that helpful.

 If that's currently not available, would it be possible to add those
 logs as build artifacts?

 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Created] (SPARK-6627) Clean up of shuffle code and interfaces

2015-03-31 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-6627:
--

 Summary: Clean up of shuffle code and interfaces
 Key: SPARK-6627
 URL: https://issues.apache.org/jira/browse/SPARK-6627
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Critical


The shuffle code in Spark is somewhat messy and could use some interface 
clean-up, especially with some larger changes outstanding. This is a catch all 
for what may be some small improvements in a few different PR's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6561) Add partition support in saveAsParquet

2015-03-27 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383413#comment-14383413
 ] 

Patrick Wendell commented on SPARK-6561:


FYI - I just removed Affects Version's since that is only for bugs (to 
indicate which version has the bug).

 Add partition support in saveAsParquet
 --

 Key: SPARK-6561
 URL: https://issues.apache.org/jira/browse/SPARK-6561
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Jianshi Huang

 Now ParquetRelation2 supports automatic partition discovery which is very 
 nice. 
 When we save a DataFrame into Parquet files, we also want to have it 
 partitioned.
 The proposed API looks like this:
 {code}
 def saveAsParquetFile(path: String, partitionColumns: Seq[String])
 {code}
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6561) Add partition support in saveAsParquet

2015-03-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6561:
---
Affects Version/s: (was: 1.3.1)
   (was: 1.3.0)

 Add partition support in saveAsParquet
 --

 Key: SPARK-6561
 URL: https://issues.apache.org/jira/browse/SPARK-6561
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Jianshi Huang

 Now ParquetRelation2 supports automatic partition discovery which is very 
 nice. 
 When we save a DataFrame into Parquet files, we also want to have it 
 partitioned.
 The proposed API looks like this:
 {code}
 def saveAsParquetFile(path: String, partitionColumns: Seq[String])
 {code}
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: RDD resiliency -- does it keep state?

2015-03-27 Thread Patrick Wendell
If you invoke this, you will get at-least-once semantics on failure.
For instance, if a machine dies in the middle of executing the foreach
for a single partition, that will be re-executed on another machine.
It could even fully complete on one machine, but the machine dies
immediately before reporting the result back to the driver.

This means you need to make sure the side-effects are idempotent, or
use some transactional locking. Spark's own output operations, such as
saving to Hadoop, use such mechanisms. For instance, in the case of
Hadoop it uses the OutputCommitter classes.

- Patrick

On Fri, Mar 27, 2015 at 12:36 PM, Michal Klos michal.klo...@gmail.com wrote:
 Hi Spark group,

 We haven't been able to find clear descriptions of how Spark handles the
 resiliency of RDDs in relationship to executing actions with side-effects.
 If you do an `rdd.foreach(someSideEffect)`, then you are doing a side-effect
 for each element in the RDD. If a partition goes down -- the resiliency
 rebuilds the data,  but did it keep track of how far it go in the
 partition's set of data or will it start from the beginning again. So will
 it do at-least-once execution of foreach closures or at-most-once?

 thanks,
 Michal

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[jira] [Updated] (SPARK-6544) Problem with Avro and Kryo Serialization

2015-03-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6544:
---
Fix Version/s: 1.3.1

 Problem with Avro and Kryo Serialization
 

 Key: SPARK-6544
 URL: https://issues.apache.org/jira/browse/SPARK-6544
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0, 1.3.0
Reporter: Dean Chen
 Fix For: 1.3.1, 1.4.0


 We're running in to the following bug with Avro 1.7.6 and the Kryo serializer 
 causing jobs to fail
 https://issues.apache.org/jira/browse/AVRO-1476?focusedCommentId=13999249page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13999249
 PR here
 https://github.com/apache/spark/pull/5193



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Spark 1.3 Source - Github and source tar does not seem to match

2015-03-27 Thread Patrick Wendell
The source code should match the Spark commit
4aaf48d46d13129f0f9bdafd771dd80fe568a7dc. Do you see any differences?

On Fri, Mar 27, 2015 at 11:28 AM, Manoj Samel manojsamelt...@gmail.com wrote:
 While looking into a issue, I noticed that the source displayed on Github
 site does not matches the downloaded tar for 1.3

 Thoughts ?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[jira] [Resolved] (SPARK-4073) Parquet+Snappy can cause significant off-heap memory usage

2015-03-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4073.

Resolution: Won't Fix

I have never seen someone else run into this, so closing as not urgent enough 
to deal with at the moment. One way to fix this is to fix PARQUET-118 so that 
users can use on-heap buffers with parquet.

 Parquet+Snappy can cause significant off-heap memory usage
 --

 Key: SPARK-4073
 URL: https://issues.apache.org/jira/browse/SPARK-4073
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Patrick Wendell
Priority: Critical

 The parquet snappy codec allocates off-heap buffers for decompression[1]. In 
 one cases the observed size of these buffers was high enough to add several 
 GB of data to the overall virtual memory usage of the Spark executor process. 
 I don't understand enough about our use of Snappy to fully grok how much data 
 we would _expect_ to be present in these buffers at any given time, but I can 
 say a few things.
 1. The dataset had individual rows that were fairly large, e.g. megabytes.
 2. Direct buffers are not cleaned up until GC events, and overall there was 
 not much heap contention. So maybe they just weren't being cleaned.
 I opened PARQUET-118 to see if they can provide an option to use on-heap 
 buffers for decompression. In the mean time, we could consider changing the 
 default back to gzip, or we could do nothing (not sure how many other users 
 will hit this).
 [1] 
 https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/codec/SnappyDecompressor.java#L28



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5025) Write a guide for creating well-formed packages for Spark

2015-03-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5025.

Resolution: Won't Fix

I'm closing this as wont fix. There are now a bunch of community packages as 
examples, so I think people can just follow those examples.

 Write a guide for creating well-formed packages for Spark
 -

 Key: SPARK-5025
 URL: https://issues.apache.org/jira/browse/SPARK-5025
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Patrick Wendell

 There are an increasing number of OSS projects providing utilities and 
 extensions to Spark. We should write a guide in the Spark docs that explains 
 how to create, package, and publish a third party Spark library. There are a 
 few issues here such as how to list your dependency on Spark, how to deal 
 with your own third party dependencies, etc. We should also cover how to do 
 this for Python libraries.
 In general, we should make it easy to build extension points against any of 
 Spark's API's (e.g. for new data sources, streaming receivers, ML algos, etc) 
 and self-publish libraries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1844) Support maven-style dependency resolution in sbt build

2015-03-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1844.

Resolution: Won't Fix

Closing given the combination of (a) this is not that important and (b) seems 
really hard to fix.

I have seen times where this discrepancy caused developers some trouble - I 
guess we'll just say it's part of living with the SBT build.

 Support maven-style dependency resolution in sbt build
 --

 Key: SPARK-1844
 URL: https://issues.apache.org/jira/browse/SPARK-1844
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell
Assignee: Prashant Sharma

 [Currently this is a brainstorm/wish - not sure it's possible]
 Ivy/sbt and maven use fundamentally different strategies when transitive 
 dependencies conflict (i.e. when we have two copies of library Y in our 
 dependency graph on different versions).
 This actually means our sbt and maven builds have been divergent for a long 
 time.
 Ivy/sbt have a pluggable notion of a [conflict 
 manager|http://grepcode.com/file/repo1.maven.org/maven2/org.apache.ivy/ivy/2.3.0/org/apache/ivy/plugins/conflict/ConflictManager.java].
  The default chooses the newest version of the dependency. SBT [allows this 
 to be 
 changed|http://www.scala-sbt.org/release/sxr/sbt/IvyInterface.scala.html#sbt;ConflictManager]
  though.
 Maven employs the [nearest 
 wins|http://techidiocy.com/maven-dependency-version-conflict-problem-and-resolution/]
  policy which means the version closes to the project root is chosen.
 It would be nice to be able to have matching semantics in the builds. We 
 could do this by writing a conflict manger in sbt that mimics Maven's 
 behavior. The fact that IVY-813 has existed for 6 years without anyone doing 
 this makes me wonder if that is not possible or very hard :P



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2709) Add a tool for certifying Spark API compatiblity

2015-03-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2709:
---
Target Version/s:   (was: 1.2.0)

 Add a tool for certifying Spark API compatiblity
 

 Key: SPARK-2709
 URL: https://issues.apache.org/jira/browse/SPARK-2709
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Prashant Sharma

 As Spark is packaged by more and more distributors, it would be good to have 
 a tool that verifies API compatiblity of a provided Spark package. The tool 
 would certify that a vendor distrubtion of Spark contains all of the API's 
 present in a particular upstream Spark version.
 This will help vendors make sure they remain API compliant when they make 
 changes or back ports to Spark. It will also discourage vendors from 
 knowingly breaking API's, because anyone can audit their distribution and see 
 that they have removed support for certain API's.
 I'm hoping a tool like this will avoid API fragmentation in the Spark 
 community.
 One poor man's implementation of this is that a vendor can just run the 
 binary compatibility checks in the spark build against an upstream version of 
 Spark. That's a pretty good start, but it means you can't come as a third 
 party and audit a distribution.
 Another approach would be to have something where anyone can come in and 
 audit a distribution even if they don't have access to the packaging and 
 source code. That would look something like this:
 1. For each release we publish a manifest of all public API's (we might 
 borrow the MIMA string representation of bye code signatures)
 2. We package an auditing tool as a jar file.
 3. The user runs a tool with spark-submit that reflectively walks through all 
 exposed Spark API's and makes sure that everything on the manifest is 
 encountered.
 From the implementation side, this is just brainstorming at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2709) Add a tool for certifying Spark API compatiblity

2015-03-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2709:
---
Priority: Critical  (was: Major)

 Add a tool for certifying Spark API compatiblity
 

 Key: SPARK-2709
 URL: https://issues.apache.org/jira/browse/SPARK-2709
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Critical

 As Spark is packaged by more and more distributors, it would be good to have 
 a tool that verifies API compatiblity of a provided Spark package. The tool 
 would certify that a vendor distrubtion of Spark contains all of the API's 
 present in a particular upstream Spark version.
 This will help vendors make sure they remain API compliant when they make 
 changes or back ports to Spark. It will also discourage vendors from 
 knowingly breaking API's, because anyone can audit their distribution and see 
 that they have removed support for certain API's.
 I'm hoping a tool like this will avoid API fragmentation in the Spark 
 community.
 One poor man's implementation of this is that a vendor can just run the 
 binary compatibility checks in the spark build against an upstream version of 
 Spark. That's a pretty good start, but it means you can't come as a third 
 party and audit a distribution.
 Another approach would be to have something where anyone can come in and 
 audit a distribution even if they don't have access to the packaging and 
 source code. That would look something like this:
 1. For each release we publish a manifest of all public API's (we might 
 borrow the MIMA string representation of bye code signatures)
 2. We package an auditing tool as a jar file.
 3. The user runs a tool with spark-submit that reflectively walks through all 
 exposed Spark API's and makes sure that everything on the manifest is 
 encountered.
 From the implementation side, this is just brainstorming at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-2709) Add a tool for certifying Spark API compatiblity

2015-03-27 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-2709:


This came up in some recent conversations. I actually don't think we ever 
merged this into Spark, so re-opening the issue.

 Add a tool for certifying Spark API compatiblity
 

 Key: SPARK-2709
 URL: https://issues.apache.org/jira/browse/SPARK-2709
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Prashant Sharma

 As Spark is packaged by more and more distributors, it would be good to have 
 a tool that verifies API compatiblity of a provided Spark package. The tool 
 would certify that a vendor distrubtion of Spark contains all of the API's 
 present in a particular upstream Spark version.
 This will help vendors make sure they remain API compliant when they make 
 changes or back ports to Spark. It will also discourage vendors from 
 knowingly breaking API's, because anyone can audit their distribution and see 
 that they have removed support for certain API's.
 I'm hoping a tool like this will avoid API fragmentation in the Spark 
 community.
 One poor man's implementation of this is that a vendor can just run the 
 binary compatibility checks in the spark build against an upstream version of 
 Spark. That's a pretty good start, but it means you can't come as a third 
 party and audit a distribution.
 Another approach would be to have something where anyone can come in and 
 audit a distribution even if they don't have access to the packaging and 
 source code. That would look something like this:
 1. For each release we publish a manifest of all public API's (we might 
 borrow the MIMA string representation of bye code signatures)
 2. We package an auditing tool as a jar file.
 3. The user runs a tool with spark-submit that reflectively walks through all 
 exposed Spark API's and makes sure that everything on the manifest is 
 encountered.
 From the implementation side, this is just brainstorming at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6405) Spark Kryo buffer should be forced to be max. 2GB

2015-03-26 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-6405.

Resolution: Fixed
  Assignee: Matthew Cheah

 Spark Kryo buffer should be forced to be max. 2GB
 -

 Key: SPARK-6405
 URL: https://issues.apache.org/jira/browse/SPARK-6405
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Matt Cheah
Assignee: Matthew Cheah
 Fix For: 1.4.0


 Kryo buffers used in serialization are backed by Java byte arrays, which have 
 a maximum size of 2GB. However, we blindly set the size without worrying 
 about numeric overflow or regards to the maximum array size. We should 
 enforce the maximum buffer size to be 2GB and warn the user when they have 
 exceeded that amount.
 I'm open to the idea of flat-out failing the initialization of the Spark 
 Context if the buffer size is over 2GB, but I'm afraid that could break 
 backwards-compatability... although one can argue that the user had incorrect 
 buffer sizes in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6549) Spark console logger logs to stderr by default

2015-03-26 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-6549.

Resolution: Won't Fix

I think this is a wont fix due to compatibility issues. If I'm wrong please 
feel free to re-open.

 Spark console logger logs to stderr by default
 --

 Key: SPARK-6549
 URL: https://issues.apache.org/jira/browse/SPARK-6549
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.2.0
Reporter: Pavel Sakun
Priority: Trivial
  Labels: log4j

 Spark's console logger is configured to log message with INFO level to stderr 
 by default while it should be stdout:
 https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/log4j-defaults.properties
 https://github.com/apache/spark/blob/master/conf/log4j.properties.template



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: RDD.map does not allowed to preservesPartitioning?

2015-03-26 Thread Patrick Wendell
I think we have a version of mapPartitions that allows you to tell
Spark the partitioning is preserved:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L639

We could also add a map function that does same. Or you can just write
your map using an iterator.

- Patrick

On Thu, Mar 26, 2015 at 3:07 PM, Jonathan Coveney jcove...@gmail.com wrote:
 This is just a deficiency of the api, imo. I agree: mapValues could
 definitely be a function (K, V)=V1. The option isn't set by the function,
 it's on the RDD. So you could look at the code and do this.
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala

  def mapValues[U](f: V = U): RDD[(K, U)] = {
 val cleanF = self.context.clean(f)
 new MapPartitionsRDD[(K, U), (K, V)](self,
   (context, pid, iter) = iter.map { case (k, v) = (k, cleanF(v)) },
   preservesPartitioning = true)
   }

 What you want:

  def mapValues[U](f: (K, V) = U): RDD[(K, U)] = {
 val cleanF = self.context.clean(f)
 new MapPartitionsRDD[(K, U), (K, V)](self,
   (context, pid, iter) = iter.map { case t@(k, _) = (k, cleanF(t)) },
   preservesPartitioning = true)
   }

 One of the nice things about spark is that making such new operators is very
 easy :)

 2015-03-26 17:54 GMT-04:00 Zhan Zhang zzh...@hortonworks.com:

 Thanks Jonathan. You are right regarding rewrite the example.

 I mean providing such option to developer so that it is controllable. The
 example may seems silly, and I don't know the use cases.

 But for example, if I also want to operate both the key and value part to
 generate some new value with keeping key part untouched. Then mapValues may
 not be able to  do this.

 Changing the code to allow this is trivial, but I don't know whether there
 is some special reason behind this.

 Thanks.

 Zhan Zhang




 On Mar 26, 2015, at 2:49 PM, Jonathan Coveney jcove...@gmail.com wrote:

 I believe if you do the following:


 sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4)).map((_,1)).reduceByKey(_+_).mapValues(_+1).reduceByKey(_+_).toDebugString

 (8) MapPartitionsRDD[34] at reduceByKey at console:23 []
  |  MapPartitionsRDD[33] at mapValues at console:23 []
  |  ShuffledRDD[32] at reduceByKey at console:23 []
  +-(8) MapPartitionsRDD[31] at map at console:23 []
 |  ParallelCollectionRDD[30] at parallelize at console:23 []

 The difference is that spark has no way to know that your map closure
 doesn't change the key. if you only use mapValues, it does. Pretty cool that
 they optimized that :)

 2015-03-26 17:44 GMT-04:00 Zhan Zhang zzh...@hortonworks.com:

 Hi Folks,

 Does anybody know what is the reason not allowing preserverPartitioning
 in RDD.map? Do I miss something here?

 Following example involves two shuffles. I think if preservePartitioning
 is allowed, we can avoid the second one, right?

  val r1 = sc.parallelize(List(1,2,3,4,5,5,6,6,7,8,9,10,2,4))
  val r2 = r1.map((_, 1))
  val r3 = r2.reduceByKey(_+_)
  val r4 = r3.map(x=(x._1, x._2 + 1))
  val r5 = r4.reduceByKey(_+_)
  r5.collect.foreach(println)

 scala r5.toDebugString
 res2: String =
 (8) ShuffledRDD[4] at reduceByKey at console:29 []
  +-(8) MapPartitionsRDD[3] at map at console:27 []
 |  ShuffledRDD[2] at reduceByKey at console:25 []
 +-(8) MapPartitionsRDD[1] at map at console:23 []
|  ParallelCollectionRDD[0] at parallelize at console:21 []

 Thanks.

 Zhan Zhang

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[jira] [Updated] (SPARK-6499) pyspark: printSchema command on a dataframe hangs

2015-03-25 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6499:
---
Component/s: PySpark

 pyspark: printSchema command on a dataframe hangs
 -

 Key: SPARK-6499
 URL: https://issues.apache.org/jira/browse/SPARK-6499
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: cynepia
 Attachments: airports.json, pyspark.txt


 1. A printSchema() on a dataframe fails to respond even after a lot of time
 Will attach the console logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6520) Kyro serialization broken in the shell

2015-03-25 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6520:
---
Component/s: Spark Shell

 Kyro serialization broken in the shell
 --

 Key: SPARK-6520
 URL: https://issues.apache.org/jira/browse/SPARK-6520
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.3.0
Reporter: Aaron Defazio

 If I start spark as follows:
 {quote}
 ~/spark-1.3.0-bin-hadoop2.4/bin/spark-shell --master local[1] --conf 
 spark.serializer=org.apache.spark.serializer.KryoSerializer
 {quote}
 Then using :paste, run 
 {quote}
 case class Example(foo : String, bar : String)
 val ex = sc.parallelize(List(Example(foo1, bar1), Example(foo2, 
 bar2))).collect()
 {quote}
 I get the error:
 {quote}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): java.io.IOException: 
 com.esotericsoftware.kryo.KryoException: Error constructing instance of 
 class: $line3.$read
 Serialization trace:
 $VAL10 ($iwC)
 $outer ($iwC$$iwC)
 $outer ($iwC$$iwC$Example)
   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1140)
   at 
 org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:979)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1873)
   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895)
   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 {quote}
 As far as I can tell, when using :paste, Kyro serialization doesn't work for 
 classes defined in within the same paste. It does work when the statements 
 are entered without paste.
 This issue seems serious to me, since Kyro serialization is virtually 
 mandatory for performance (20x slower with default serialization on my 
 problem), and I'm assuming feature parity between spark-shell and 
 spark-submit is a goal.
 Note that this is different from SPARK-6497, which covers the case when Kyro 
 is set to require registration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue

2015-03-25 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380432#comment-14380432
 ] 

Patrick Wendell commented on SPARK-6481:


Hey All,

One issue here, (I think?) right now unfortunately no users have sufficient 
permission to make the state change into In Progress because of the way that 
the JIRA is currently set up. Currently we don't expose the Start Progress 
button on any screen, so I think that makes it unavailable from the API call. 
At least, I just used my own credentials and I was not able to see the Start 
Progress transition on a JIRA, even though AFAIK I have the highest 
permissions possible.

The reason we do this I think was that we wanted to restrict assignment of 
JIRA's to the committership for now and the Start Progress button 
automatically assigns issues to a new person clicking it.

In my ideal world it works such that typical users cannot modify this state 
transition and it is only possible to put it in progress via a github pull 
request. If there is such a permission scheme that allows that, then we should 
see about asking ASF to enable it for our JIRA.

In terms of assignment, I'd say for now just leave the assignment as it was 
before.

 Set In Progress when a PR is opened for an issue
 --

 Key: SPARK-6481
 URL: https://issues.apache.org/jira/browse/SPARK-6481
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Michael Armbrust
Assignee: Nicholas Chammas

 [~pwendell] and I are not sure if this is possible, but it would be really 
 helpful if the JIRA status was updated to In Progress when we do the 
 linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: hadoop input/output format advanced control

2015-03-25 Thread Patrick Wendell
Yeah I agree that might have been nicer, but I think for consistency
with the input API's maybe we should do the same thing. We can also
give an example of how to clone sc.hadoopConfiguration and then set
some new values:

val conf = sc.hadoopConfiguration.clone()
  .set(k1, v1)
  .set(k2, v2)

val rdd = sc.objectFile(..., conf)

I have no idea if that's the correct syntax, but something like that
seems almost as easy as passing a hashmap with deltas.

- Patrick

On Wed, Mar 25, 2015 at 6:34 AM, Koert Kuipers ko...@tresata.com wrote:
 my personal preference would be something like a Map[String, String] that
 only reflects the changes you want to make the Configuration for the given
 input/output format (so system wide defaults continue to come from
 sc.hadoopConfiguration), similarly to what cascading/scalding did, but am
 arbitrary Configuration will work too.

 i will make a jira and pullreq when i have some time.



 On Wed, Mar 25, 2015 at 1:23 AM, Patrick Wendell pwend...@gmail.com wrote:

 I see - if you look, in the saving functions we have the option for
 the user to pass an arbitrary Configuration.


 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L894

 It seems fine to have the same option for the loading functions, if
 it's easy to just pass this config into the input format.



 On Tue, Mar 24, 2015 at 3:46 PM, Koert Kuipers ko...@tresata.com wrote:
  the (compression) codec parameter that is now part of many saveAs...
  methods
  came from a similar need. see SPARK-763
  hadoop has many options like this. you either going to have to allow
  many
  more of these optional arguments to all the methods that read from
  hadoop
  inputformats and write to hadoop outputformats, or you force people to
  re-create these methods using HadoopRDD, i think (if thats even
  possible).
 
  On Tue, Mar 24, 2015 at 6:40 PM, Koert Kuipers ko...@tresata.com
  wrote:
 
  i would like to use objectFile with some tweaks to the hadoop conf.
  currently there is no way to do that, except recreating objectFile
  myself.
  and some of the code objectFile uses i have no access to, since its
  private
  to spark.
 
 
  On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  Yeah - to Nick's point, I think the way to do this is to pass in a
  custom conf when you create a Hadoop RDD (that's AFAIK why the conf
  field is there). Is there anything you can't do with that feature?
 
  On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
  nick.pentre...@gmail.com wrote:
   Imran, on your point to read multiple files together in a partition,
   is
   it
   not simpler to use the approach of copy Hadoop conf and set per-RDD
   settings for min split to control the input size per partition,
   together
   with something like CombineFileInputFormat?
  
   On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid iras...@cloudera.com
   wrote:
  
   I think this would be a great addition, I totally agree that you
   need
   to be
   able to set these at a finer context than just the SparkContext.
  
   Just to play devil's advocate, though -- the alternative is for you
   just
   subclass HadoopRDD yourself, or make a totally new RDD, and then
   you
   could
   expose whatever you need.  Why is this solution better?  IMO the
   criteria
   are:
   (a) common operations
   (b) error-prone / difficult to implement
   (c) non-obvious, but important for performance
  
   I think this case fits (a)  (c), so I think its still worthwhile.
   But its
   also worth asking whether or not its too difficult for a user to
   extend
   HadoopRDD right now.  There have been several cases in the past
   week
   where
   we've suggested that a user should read from hdfs themselves (eg.,
   to
   read
   multiple files together in one partition) -- with*out* reusing the
   code in
   HadoopRDD, though they would lose things like the metric tracking 
   preferred locations you get from HadoopRDD.  Does HadoopRDD need to
   some
   refactoring to make that easier to do?  Or do we just need a good
   example?
  
   Imran
  
   (sorry for hijacking your thread, Koert)
  
  
  
   On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com
   wrote:
  
see email below. reynold suggested i send it to dev instead of
user
   
-- Forwarded message --
From: Koert Kuipers ko...@tresata.com
Date: Mon, Mar 23, 2015 at 4:36 PM
Subject: hadoop input/output format advanced control
To: u...@spark.apache.org u...@spark.apache.org
   
   
currently its pretty hard to control the Hadoop Input/Output
formats
used
in Spark. The conventions seems to be to add extra parameters to
all
methods and then somewhere deep inside the code (for example in
PairRDDFunctions.saveAsHadoopFile) all these parameters get
translated
   into
settings on the Hadoop Configuration object.
   
for example for compression i see codec

Re: hadoop input/output format advanced control

2015-03-25 Thread Patrick Wendell
Great - that's even easier. Maybe we could have a simple example in the doc.

On Wed, Mar 25, 2015 at 7:06 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
 Regarding Patrick's question, you can just do new Configuration(oldConf)
 to get a cloned Configuration object and add any new properties to it.

 -Sandy

 On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid iras...@cloudera.com wrote:

 Hi Nick,

 I don't remember the exact details of these scenarios, but I think the user
 wanted a lot more control over how the files got grouped into partitions,
 to group the files together by some arbitrary function.  I didn't think
 that was possible w/ CombineFileInputFormat, but maybe there is a way?

 thanks

 On Tue, Mar 24, 2015 at 1:50 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

  Imran, on your point to read multiple files together in a partition, is
 it
  not simpler to use the approach of copy Hadoop conf and set per-RDD
  settings for min split to control the input size per partition, together
  with something like CombineFileInputFormat?
 
  On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid iras...@cloudera.com
  wrote:
 
   I think this would be a great addition, I totally agree that you need
 to
  be
   able to set these at a finer context than just the SparkContext.
  
   Just to play devil's advocate, though -- the alternative is for you
 just
   subclass HadoopRDD yourself, or make a totally new RDD, and then you
  could
   expose whatever you need.  Why is this solution better?  IMO the
 criteria
   are:
   (a) common operations
   (b) error-prone / difficult to implement
   (c) non-obvious, but important for performance
  
   I think this case fits (a)  (c), so I think its still worthwhile.  But
  its
   also worth asking whether or not its too difficult for a user to extend
   HadoopRDD right now.  There have been several cases in the past week
  where
   we've suggested that a user should read from hdfs themselves (eg., to
  read
   multiple files together in one partition) -- with*out* reusing the code
  in
   HadoopRDD, though they would lose things like the metric tracking 
   preferred locations you get from HadoopRDD.  Does HadoopRDD need to
 some
   refactoring to make that easier to do?  Or do we just need a good
  example?
  
   Imran
  
   (sorry for hijacking your thread, Koert)
  
  
  
   On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com
  wrote:
  
see email below. reynold suggested i send it to dev instead of user
   
-- Forwarded message --
From: Koert Kuipers ko...@tresata.com
Date: Mon, Mar 23, 2015 at 4:36 PM
Subject: hadoop input/output format advanced control
To: u...@spark.apache.org u...@spark.apache.org
   
   
currently its pretty hard to control the Hadoop Input/Output formats
  used
in Spark. The conventions seems to be to add extra parameters to all
methods and then somewhere deep inside the code (for example in
PairRDDFunctions.saveAsHadoopFile) all these parameters get
 translated
   into
settings on the Hadoop Configuration object.
   
for example for compression i see codec: Option[Class[_ :
CompressionCodec]] = None added to a bunch of methods.
   
how scalable is this solution really?
   
for example i need to read from a hadoop dataset and i dont want the
   input
(part) files to get split up. the way to do this is to set
mapred.min.split.size. now i dont want to set this at the level of
  the
SparkContext (which can be done), since i dont want it to apply to
  input
formats in general. i want it to apply to just this one specific
 input
dataset i need to read. which leaves me with no options currently. i
   could
go add yet another input parameter to all the methods
(SparkContext.textFile, SparkContext.hadoopFile,
  SparkContext.objectFile,
etc.). but that seems ineffective.
   
why can we not expose a Map[String, String] or some other generic way
  to
manipulate settings for hadoop input/output formats? it would require
adding one more parameter to all methods to deal with hadoop
  input/output
formats, but after that its done. one parameter to rule them all
   
then i could do:
val x = sc.textFile(/some/path, formatSettings =
Map(mapred.min.split.size - 12345))
   
or
rdd.saveAsTextFile(/some/path, formatSettings =
Map(mapred.output.compress - true,
  mapred.output.compression.codec
   -
somecodec))
   
  
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: 1.3 Hadoop File System problem

2015-03-24 Thread Patrick Wendell
Hey Jim,

Thanks for reporting this. Can you give a small end-to-end code
example that reproduces it? If so, we can definitely fix it.

- Patrick

On Tue, Mar 24, 2015 at 4:55 PM, Jim Carroll jimfcarr...@gmail.com wrote:

 I have code that works under 1.2.1 but when I upgraded to 1.3.0 it fails to
 find the s3 hadoop file system.

 I get the java.lang.IllegalArgumentException: Wrong FS: s3://path to my
 file], expected: file:/// when I try to save a parquet file. This worked in
 1.2.1.

 Has anyone else seen this?

 I'm running spark using local[8] so it's all internal. These are actually
 unit tests in our app that are failing now.

 Thanks.
 Jim




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/1-3-Hadoop-File-System-problem-tp22207.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Any guidance on when to back port and how far?

2015-03-24 Thread Patrick Wendell
My philosophy has been basically what you suggested, Sean. One thing
you didn't mention though is if a bug fix seems complicated, I will
think very hard before back-porting it. This is because fixes can
introduce their own new bugs, in some cases worse than the original
issue. It's really bad to have some upgrade to a patch release and see
a regression - with our current approach this almost never happens.

I will usually try to backport up to N-2, if it can be back-ported
reasonably easily (for instance, with minor or no code changes). The
reason I do this is that vendors do end up supporting older versions,
and it's nice for them if some committer has backported a fix that
they can then pull in, even if we never ship it.

In terms of doing older maintenance releases, this one I think we
should do according to severity of issues (for instance, if there is a
security issue) or based on general command from the community. I
haven't initiated many 1.X.2 releases recently because I didn't see
huge demand. However, personally I don't mind doing these if there is
a lot of demand, at least for releases where .0 has gone out in the
last six months.

On Tue, Mar 24, 2015 at 11:23 AM, Michael Armbrust
mich...@databricks.com wrote:
 Two other criteria that I use when deciding what to backport:
  - Is it a regression from a previous minor release?  I'm much more likely
 to backport fixes in this case, as I'd love for most people to stay up to
 date.
  - How scary is the change?  I think the primary goal is stability of the
 maintenance branches.  When I am confident that something is isolated and
 unlikely to break things (i.e. I'm fixing a confusing error message), then
 i'm much more likely to backport it.

 Regarding the length of time to continue backporting, I mostly don't
 backport to N-1, but this is partially because SQL is changing too fast for
 that to generally be useful.  These old branches usually only get attention
 from me when there is an explicit request.

 I'd love to hear more feedback from others.

 Michael

 On Tue, Mar 24, 2015 at 6:13 AM, Sean Owen so...@cloudera.com wrote:

 So far, my rule of thumb has been:

 - Don't back-port new features or improvements in general, only bug fixes
 - Don't back-port minor bug fixes
 - Back-port bug fixes that seem important enough to not wait for the
 next minor release
 - Back-port site doc changes to the release most likely to go out
 next, to make it a part of the next site publish

 But, how far should back-ports go, in general? If the last minor
 release was 1.N, then to branch 1.N surely. Farther back is a question
 of expectation for support of past minor releases. Given the pace of
 change and time available, I assume there's not much support for
 continuing to use release 1.(N-1) and very little for 1.(N-2).

 Concretely: does anyone expect a 1.1.2 release ever? a 1.2.2 release?
 It'd be good to hear the received wisdom explicitly.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: hadoop input/output format advanced control

2015-03-24 Thread Patrick Wendell
Yeah - to Nick's point, I think the way to do this is to pass in a
custom conf when you create a Hadoop RDD (that's AFAIK why the conf
field is there). Is there anything you can't do with that feature?

On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
nick.pentre...@gmail.com wrote:
 Imran, on your point to read multiple files together in a partition, is it
 not simpler to use the approach of copy Hadoop conf and set per-RDD
 settings for min split to control the input size per partition, together
 with something like CombineFileInputFormat?

 On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid iras...@cloudera.com wrote:

 I think this would be a great addition, I totally agree that you need to be
 able to set these at a finer context than just the SparkContext.

 Just to play devil's advocate, though -- the alternative is for you just
 subclass HadoopRDD yourself, or make a totally new RDD, and then you could
 expose whatever you need.  Why is this solution better?  IMO the criteria
 are:
 (a) common operations
 (b) error-prone / difficult to implement
 (c) non-obvious, but important for performance

 I think this case fits (a)  (c), so I think its still worthwhile.  But its
 also worth asking whether or not its too difficult for a user to extend
 HadoopRDD right now.  There have been several cases in the past week where
 we've suggested that a user should read from hdfs themselves (eg., to read
 multiple files together in one partition) -- with*out* reusing the code in
 HadoopRDD, though they would lose things like the metric tracking 
 preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
 refactoring to make that easier to do?  Or do we just need a good example?

 Imran

 (sorry for hijacking your thread, Koert)



 On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com wrote:

  see email below. reynold suggested i send it to dev instead of user
 
  -- Forwarded message --
  From: Koert Kuipers ko...@tresata.com
  Date: Mon, Mar 23, 2015 at 4:36 PM
  Subject: hadoop input/output format advanced control
  To: u...@spark.apache.org u...@spark.apache.org
 
 
  currently its pretty hard to control the Hadoop Input/Output formats used
  in Spark. The conventions seems to be to add extra parameters to all
  methods and then somewhere deep inside the code (for example in
  PairRDDFunctions.saveAsHadoopFile) all these parameters get translated
 into
  settings on the Hadoop Configuration object.
 
  for example for compression i see codec: Option[Class[_ :
  CompressionCodec]] = None added to a bunch of methods.
 
  how scalable is this solution really?
 
  for example i need to read from a hadoop dataset and i dont want the
 input
  (part) files to get split up. the way to do this is to set
  mapred.min.split.size. now i dont want to set this at the level of the
  SparkContext (which can be done), since i dont want it to apply to input
  formats in general. i want it to apply to just this one specific input
  dataset i need to read. which leaves me with no options currently. i
 could
  go add yet another input parameter to all the methods
  (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile,
  etc.). but that seems ineffective.
 
  why can we not expose a Map[String, String] or some other generic way to
  manipulate settings for hadoop input/output formats? it would require
  adding one more parameter to all methods to deal with hadoop input/output
  formats, but after that its done. one parameter to rule them all
 
  then i could do:
  val x = sc.textFile(/some/path, formatSettings =
  Map(mapred.min.split.size - 12345))
 
  or
  rdd.saveAsTextFile(/some/path, formatSettings =
  Map(mapred.output.compress - true, mapred.output.compression.codec
 -
  somecodec))
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Experience using binary packages on various Hadoop distros

2015-03-24 Thread Patrick Wendell
Hey All,

For a while we've published binary packages with different Hadoop
client's pre-bundled. We currently have three interfaces to a Hadoop
cluster (a) the HDFS client (b) the YARN client (c) the Hive client.

Because (a) and (b) are supposed to be backwards compatible
interfaces. My working assumption was that for the most part (modulo
Hive) our packages work with *newer* Hadoop versions. For instance,
our Hadoop 2.4 package should work with HDFS 2.6 and YARN 2.6.
However, I have heard murmurings that these are not compatible in
practice.

So I have three questions I'd like to put out to the community:

1. Have people had difficulty using 2.4 packages with newer Hadoop
versions? If so, what specific incompatibilities have you hit?
2. Have people had issues using our binary Hadoop packages in general
with commercial or Apache Hadoop distro's, such that you have to build
from source?
3. How would people feel about publishing a bring your own Hadoop
binary, where you are required to point us to a local Hadoop
distribution by setting HADOOP_HOME? This might be better for ensuring
full compatibility:
https://issues.apache.org/jira/browse/SPARK-6511

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jira] [Commented] (SPARK-2331) SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]

2015-03-23 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376495#comment-14376495
 ] 

Patrick Wendell commented on SPARK-2331:


By the way - [~rxin] recently pointed out to me that EmptyRDD is private[spark].

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/EmptyRDD.scala#L27

Given that I'm sort of confused how people were using it before. I'm not 
totally sure how making a class private[spark] affects its use in a return type.

 SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]
 --

 Key: SPARK-2331
 URL: https://issues.apache.org/jira/browse/SPARK-2331
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Ian Hummel
Priority: Minor

 The return type for SparkContext.emptyRDD is EmptyRDD[T].
 It should be RDD[T].  That means you have to add extra type annotations on 
 code like the below (which creates a union of RDDs over some subset of paths 
 in a folder)
 {code}
 val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { 
 (rdd, path) ⇒
   rdd.union(sc.textFile(path))
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-6122) Upgrade Tachyon dependency to 0.6.0

2015-03-23 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-6122:


I reverted this because it looks like it was responsible for some testing 
failures due to the dependency changes.

 Upgrade Tachyon dependency to 0.6.0
 ---

 Key: SPARK-6122
 URL: https://issues.apache.org/jira/browse/SPARK-6122
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Haoyuan Li
Assignee: Calvin Jia





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: enum-like types in Spark

2015-03-23 Thread Patrick Wendell
If the official solution from the Scala community is to use Java
enums, then it seems strange they aren't generated in scaldoc? Maybe
we can just fix that w/ Typesafe's help and then we can use them.

On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen so...@cloudera.com wrote:
 Yeah the fully realized #4, which gets back the ability to use it in
 switch statements (? in Scala but not Java?) does end up being kind of
 huge.

 I confess I'm swayed a bit back to Java enums, seeing what it
 involves. The hashCode() issue can be 'solved' with the hash of the
 String representation.

 On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid iras...@cloudera.com wrote:
 I've just switched some of my code over to the new format, and I just want
 to make sure everyone realizes what we are getting into.  I went from 10
 lines as java enums

 https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20

 to 30 lines with the new format:

 https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250

 its not just that its verbose.  each name has to be repeated 4 times, with
 potential typos in some locations that won't be caught by the compiler.
 Also, you have to manually maintain the values as you update the set of
 enums, the compiler won't do it for you.

 The only downside I've heard for java enums is enum.hashcode().  OTOH, the
 downsides for this version are: maintainability / verbosity, no values(),
 more cumbersome to use from java, no enum map / enumset.

 I did put together a little util to at least get back the equivalent of
 enum.valueOf() with this format

 https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala

 I'm not trying to prevent us from moving forward on this, its fine if this
 is still what everyone wants, but I feel pretty strongly java enums make
 more sense.

 thanks,
 Imran

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-23 Thread Patrick Wendell
Hey Yiannis,

If you just perform a count on each name, date pair... can it succeed?
If so, can you do a count and then order by to find the largest one?

I'm wondering if there is a single pathologically large group here that is
somehow causing OOM.

Also, to be clear, you are getting GC limit warnings on the executors, not
the driver. Correct?

- Patrick

On Mon, Mar 23, 2015 at 10:21 AM, Martin Goodson mar...@skimlinks.com
wrote:

 Have you tried to repartition() your original data to make more partitions
 before you aggregate?


 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 [image: Inline image 1]

 On Mon, Mar 23, 2015 at 4:12 PM, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Hi Yin,

 Yes, I have set spark.executor.memory to 8g and the worker memory to 16g
 without any success.
 I cannot figure out how to increase the number of mapPartitions tasks.

 Thanks a lot

 On 20 March 2015 at 18:44, Yin Huai yh...@databricks.com wrote:

 spark.sql.shuffle.partitions only control the number of tasks in the
 second stage (the number of reducers). For your case, I'd say that the
 number of tasks in the first state (number of mappers) will be the number
 of files you have.

 Actually, have you changed spark.executor.memory (it controls the
 memory for an executor of your application)? I did not see it in your
 original email. The difference between worker memory and executor memory
 can be found at (
 http://spark.apache.org/docs/1.3.0/spark-standalone.html),

 SPARK_WORKER_MEMORY
 Total amount of memory to allow Spark applications to use on the
 machine, e.g. 1000m, 2g (default: total memory minus 1 GB); note that
 each application's individual memory is configured using its
 spark.executor.memory property.


 On Fri, Mar 20, 2015 at 9:25 AM, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Actually I realized that the correct way is:

 sqlContext.sql(set spark.sql.shuffle.partitions=1000)

 but I am still experiencing the same behavior/error.

 On 20 March 2015 at 16:04, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Hi Yin,

 the way I set the configuration is:

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 sqlContext.setConf(spark.sql.shuffle.partitions,1000);

 it is the correct way right?
 In the mapPartitions task (the first task which is launched), I get
 again the same number of tasks and again the same error. :(

 Thanks a lot!

 On 19 March 2015 at 17:40, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Hi Yin,

 thanks a lot for that! Will give it a shot and let you know.

 On 19 March 2015 at 16:30, Yin Huai yh...@databricks.com wrote:

 Was the OOM thrown during the execution of first stage (map) or the
 second stage (reduce)? If it was the second stage, can you increase the
 value of spark.sql.shuffle.partitions and see if the OOM disappears?

 This setting controls the number of reduces Spark SQL will use and
 the default is 200. Maybe there are too many distinct values and the 
 memory
 pressure on every task (of those 200 reducers) is pretty high. You can
 start with 400 and increase it until the OOM disappears. Hopefully this
 will help.

 Thanks,

 Yin


 On Wed, Mar 18, 2015 at 4:46 PM, Yiannis Gkoufas 
 johngou...@gmail.com wrote:

 Hi Yin,

 Thanks for your feedback. I have 1700 parquet files, sized 100MB
 each. The number of tasks launched is equal to the number of parquet 
 files.
 Do you have any idea on how to deal with this situation?

 Thanks a lot
 On 18 Mar 2015 17:35, Yin Huai yh...@databricks.com wrote:

 Seems there are too many distinct groups processed in a task,
 which trigger the problem.

 How many files do your dataset have and how large is a file? Seems
 your query will be executed with two stages, table scan and map-side
 aggregation in the first stage and the final round of reduce-side
 aggregation in the second stage. Can you take a look at the numbers of
 tasks launched in these two stages?

 Thanks,

 Yin

 On Wed, Mar 18, 2015 at 11:42 AM, Yiannis Gkoufas 
 johngou...@gmail.com wrote:

 Hi there, I set the executor memory to 8g but it didn't help

 On 18 March 2015 at 13:59, Cheng Lian lian.cs@gmail.com
 wrote:

 You should probably increase executor memory by setting
 spark.executor.memory.

 Full list of available configurations can be found here
 http://spark.apache.org/docs/latest/configuration.html

 Cheng


 On 3/18/15 9:15 PM, Yiannis Gkoufas wrote:

 Hi there,

 I was trying the new DataFrame API with some basic operations
 on a parquet dataset.
 I have 7 nodes of 12 cores and 8GB RAM allocated to each worker
 in a standalone cluster mode.
 The code is the following:

 val people = sqlContext.parquetFile(/data.parquet);
 val res = people.groupBy(name,date).
 agg(sum(power),sum(supply)).take(10);
 System.out.println(res);

 The dataset consists of 16 billion entries.
 The error I get is java.lang.OutOfMemoryError: GC overhead
 limit exceeded

 My configuration is:

 spark.serializer org.apache.spark.serializer.KryoSerializer
 

[jira] [Updated] (SPARK-6449) Driver OOM results in reported application result SUCCESS

2015-03-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6449:
---
Component/s: (was: Spark Core)
 YARN

 Driver OOM results in reported application result SUCCESS
 -

 Key: SPARK-6449
 URL: https://issues.apache.org/jira/browse/SPARK-6449
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0
Reporter: Ryan Williams

 I ran a job yesterday that according to the History Server and YARN RM 
 finished with status {{SUCCESS}}.
 Clicking around on the history server UI, there were too few stages run, and 
 I couldn't figure out why that would have been.
 Finally, inspecting the end of the driver's logs, I saw:
 {code}
 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
 Shutting down remote daemon.
 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
 Remote daemon shut down; proceeding with flushing remote transports.
 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext
 Exception in thread Driver scala.MatchError: java.lang.OutOfMemoryError: GC 
 overhead limit exceeded (of class java.lang.OutOfMemoryError)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
 exitCode: 0, (reason: Shutdown hook called before final status was reported.)
 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering 
 ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before 
 final status was reported.)
 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
 Remoting shut down.
 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be 
 successfully unregistered.
 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory 
 .sparkStaging/application_1426705269584_0055
 {code}
 The driver OOM'd, [the {{catch}} block that presumably should have caught 
 it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484]
  threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and 
 written to the event log.
 This should be logged as a failed job and reported as such to YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6456) Spark Sql throwing exception on large partitioned data

2015-03-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6456:
---
Component/s: (was: Spark Core)

 Spark Sql throwing exception on large partitioned data
 --

 Key: SPARK-6456
 URL: https://issues.apache.org/jira/browse/SPARK-6456
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: pankaj
 Fix For: 1.2.1


 Observation:
 Spark connects with hive Metastore. i am able to run simple queries like 
 show table and select.
 but throws below exception while running query on the hive Table having large 
 number of partitions.
 {code}
 Exception in thread main java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
 at`enter code here` 
 org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 org.apache.thrift.transport.TTransportException: 
 java.net.SocketTimeoutException: Read timed out
 at 
 org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785)
 at 
 org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316)
 at 
 org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86)
 at 
 org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253)
 at 
 org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
 at 
 org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
 at scala.Option.getOrElse(Option.scala:120)
 at 
 org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
 at 
 org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:138)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2858) Default log4j configuration no longer seems to work

2015-03-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2858.

Resolution: Invalid

This is really old and I don't think it still an issue. I'm just closing this 
as invalid.

 Default log4j configuration no longer seems to work
 ---

 Key: SPARK-2858
 URL: https://issues.apache.org/jira/browse/SPARK-2858
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell

 For reasons unknown this doesn't seem to be working anymore. I deleted my 
 log4j.properties file and did a fresh build and it noticed it still gave me a 
 verbose stack trace when port 4040 was contented (which is a log we silence 
 in the conf). I actually think this was an issue even before [~sowen]'s 
 changes, so not sure what's up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5863) Improve performance of convertToScala codepath.

2015-03-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375229#comment-14375229
 ] 

Patrick Wendell commented on SPARK-5863:


This seems worth potentially fixing in 1.3.1, so I added that. I think it will 
depend how surgical the fix is.

 Improve performance of convertToScala codepath.
 ---

 Key: SPARK-5863
 URL: https://issues.apache.org/jira/browse/SPARK-5863
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.2.1
Reporter: Cristian
Priority: Critical

 Was doing some perf testing on reading parquet files and noticed that moving 
 from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the 
 culprit showed up as being in ScalaReflection.convertRowToScala.
 Particularly this zip is the issue:
 {code}
 r.toSeq.zip(schema.fields.map(_.dataType))
 {code}
 I see there's a comment on that currently that this is slow but it wasn't 
 fixed. This actually produces a 3x degradation in parquet read performance, 
 at least in my test case.
 Edit: the map is part of the issue as well. This whole code block is in a 
 tight loop and allocates a new ListBuffer that needs to grow for each 
 transformation. A possible solution is to change to using seq.view which 
 would allocate iterators instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5863) Improve performance of convertToScala codepath.

2015-03-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5863:
---
Target Version/s: 1.3.1, 1.4.0  (was: 1.4.0)

 Improve performance of convertToScala codepath.
 ---

 Key: SPARK-5863
 URL: https://issues.apache.org/jira/browse/SPARK-5863
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.2.1
Reporter: Cristian
Priority: Critical

 Was doing some perf testing on reading parquet files and noticed that moving 
 from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the 
 culprit showed up as being in ScalaReflection.convertRowToScala.
 Particularly this zip is the issue:
 {code}
 r.toSeq.zip(schema.fields.map(_.dataType))
 {code}
 I see there's a comment on that currently that this is slow but it wasn't 
 fixed. This actually produces a 3x degradation in parquet read performance, 
 at least in my test case.
 Edit: the map is part of the issue as well. This whole code block is in a 
 tight loop and allocates a new ListBuffer that needs to grow for each 
 transformation. A possible solution is to change to using seq.view which 
 would allocate iterators instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4227) Document external shuffle service

2015-03-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4227:
---
Priority: Critical  (was: Major)

 Document external shuffle service
 -

 Key: SPARK-4227
 URL: https://issues.apache.org/jira/browse/SPARK-4227
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Critical

 We should add spark.shuffle.service.enabled to the Configuration page and 
 give instructions for launching the shuffle service as an auxiliary service 
 on YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4227) Document external shuffle service

2015-03-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4227:
---
Target Version/s: 1.3.1, 1.4.0  (was: 1.3.0, 1.4.0)

 Document external shuffle service
 -

 Key: SPARK-4227
 URL: https://issues.apache.org/jira/browse/SPARK-4227
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza

 We should add spark.shuffle.service.enabled to the Configuration page and 
 give instructions for launching the shuffle service as an auxiliary service 
 on YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5863) Improve performance of convertToScala codepath.

2015-03-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5863:
---
Target Version/s: 1.4.0  (was: 1.3.1, 1.4.0)

 Improve performance of convertToScala codepath.
 ---

 Key: SPARK-5863
 URL: https://issues.apache.org/jira/browse/SPARK-5863
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.2.1
Reporter: Cristian
Priority: Critical

 Was doing some perf testing on reading parquet files and noticed that moving 
 from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the 
 culprit showed up as being in ScalaReflection.convertRowToScala.
 Particularly this zip is the issue:
 {code}
 r.toSeq.zip(schema.fields.map(_.dataType))
 {code}
 I see there's a comment on that currently that this is slow but it wasn't 
 fixed. This actually produces a 3x degradation in parquet read performance, 
 at least in my test case.
 Edit: the map is part of the issue as well. This whole code block is in a 
 tight loop and allocates a new ListBuffer that needs to grow for each 
 transformation. A possible solution is to change to using seq.view which 
 would allocate iterators instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6012) Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered operator

2015-03-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6012:
---
Target Version/s: 1.4.0  (was: 1.3.1, 1.4.0)

 Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered 
 operator
 --

 Key: SPARK-6012
 URL: https://issues.apache.org/jira/browse/SPARK-6012
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Max Seiden
Priority: Critical

 h3. Summary
 I've found that a deadlock occurs when asking for the partitions from a 
 SchemaRDD that has a TakeOrdered as its terminal operator. The problem occurs 
 when a child RDDs asks the DAGScheduler for preferred partition locations 
 (which locks the scheduler) and eventually hits the #execute() of the 
 TakeOrdered operator, which submits tasks but is blocked when it also tries 
 to get preferred locations (in a separate thread). It seems like the 
 TakeOrdered op's #execute() method should not actually submit a task (it is 
 calling #executeCollect() and creating a new RDD) and should instead stay 
 more true to the comment a logically apply a Limit on top of a Sort. 
 In my particular case, I am forcing a repartition of a SchemaRDD with a 
 terminal Limit(..., Sort(...)), which is where the CoalescedRDD comes into 
 play.
 h3. Stack Traces
 h4. Task Submission
 {noformat}
 main prio=5 tid=0x7f8e7280 nid=0x1303 in Object.wait() 
 [0x00010ed5e000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 - waiting on 0x0007c4c239b8 (a 
 org.apache.spark.scheduler.JobWaiter)
 at java.lang.Object.wait(Object.java:503)
 at 
 org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
 - locked 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter)
 at 
 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1390)
 at org.apache.spark.rdd.RDD.reduce(RDD.scala:884)
 at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1161)
 at 
 org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:183)
 at 
 org.apache.spark.sql.execution.TakeOrdered.execute(basicOperators.scala:188)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
 - locked 0x0007c36ce038 (a 
 org.apache.spark.sql.hive.HiveContext$$anon$7)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
 at org.apache.spark.sql.SchemaRDD.getDependencies(SchemaRDD.scala:127)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207)
 at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1278)
 at org.apache.spark.sql.SchemaRDD.getPartitions(SchemaRDD.scala:122)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
 at org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1333)
 at 
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1304)
 - locked 0x0007f55c2238 (a 
 org.apache.spark.scheduler.DAGScheduler)
 at 
 org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1148)
 at 
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs

[jira] [Updated] (SPARK-6012) Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered operator

2015-03-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6012:
---
Target Version/s: 1.3.1, 1.4.0  (was: 1.4.0)

 Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered 
 operator
 --

 Key: SPARK-6012
 URL: https://issues.apache.org/jira/browse/SPARK-6012
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Max Seiden
Priority: Critical

 h3. Summary
 I've found that a deadlock occurs when asking for the partitions from a 
 SchemaRDD that has a TakeOrdered as its terminal operator. The problem occurs 
 when a child RDDs asks the DAGScheduler for preferred partition locations 
 (which locks the scheduler) and eventually hits the #execute() of the 
 TakeOrdered operator, which submits tasks but is blocked when it also tries 
 to get preferred locations (in a separate thread). It seems like the 
 TakeOrdered op's #execute() method should not actually submit a task (it is 
 calling #executeCollect() and creating a new RDD) and should instead stay 
 more true to the comment a logically apply a Limit on top of a Sort. 
 In my particular case, I am forcing a repartition of a SchemaRDD with a 
 terminal Limit(..., Sort(...)), which is where the CoalescedRDD comes into 
 play.
 h3. Stack Traces
 h4. Task Submission
 {noformat}
 main prio=5 tid=0x7f8e7280 nid=0x1303 in Object.wait() 
 [0x00010ed5e000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 - waiting on 0x0007c4c239b8 (a 
 org.apache.spark.scheduler.JobWaiter)
 at java.lang.Object.wait(Object.java:503)
 at 
 org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
 - locked 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter)
 at 
 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1390)
 at org.apache.spark.rdd.RDD.reduce(RDD.scala:884)
 at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1161)
 at 
 org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:183)
 at 
 org.apache.spark.sql.execution.TakeOrdered.execute(basicOperators.scala:188)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
 - locked 0x0007c36ce038 (a 
 org.apache.spark.sql.hive.HiveContext$$anon$7)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
 at org.apache.spark.sql.SchemaRDD.getDependencies(SchemaRDD.scala:127)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207)
 at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1278)
 at org.apache.spark.sql.SchemaRDD.getPartitions(SchemaRDD.scala:122)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
 at org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1333)
 at 
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1304)
 - locked 0x0007f55c2238 (a 
 org.apache.spark.scheduler.DAGScheduler)
 at 
 org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1148)
 at 
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs

[jira] [Commented] (SPARK-5863) Improve performance of convertToScala codepath.

2015-03-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375230#comment-14375230
 ] 

Patrick Wendell commented on SPARK-5863:


Ah actually - I see [~marmbrus] was the one who set target to 1.4.0, so I'm 
gonna remove 1.3.1

 Improve performance of convertToScala codepath.
 ---

 Key: SPARK-5863
 URL: https://issues.apache.org/jira/browse/SPARK-5863
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0, 1.2.1
Reporter: Cristian
Priority: Critical

 Was doing some perf testing on reading parquet files and noticed that moving 
 from Spark 1.1 to 1.2 the performance is 3x worse. In the profiler the 
 culprit showed up as being in ScalaReflection.convertRowToScala.
 Particularly this zip is the issue:
 {code}
 r.toSeq.zip(schema.fields.map(_.dataType))
 {code}
 I see there's a comment on that currently that this is slow but it wasn't 
 fixed. This actually produces a 3x degradation in parquet read performance, 
 at least in my test case.
 Edit: the map is part of the issue as well. This whole code block is in a 
 tight loop and allocates a new ListBuffer that needs to grow for each 
 transformation. A possible solution is to change to using seq.view which 
 would allocate iterators instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact

2015-03-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4925:
---
Fix Version/s: (was: 1.2.1)
   (was: 1.3.0)

 Publish Spark SQL hive-thriftserver maven artifact 
 ---

 Key: SPARK-4925
 URL: https://issues.apache.org/jira/browse/SPARK-4925
 Project: Spark
  Issue Type: Improvement
  Components: Build, SQL
Affects Versions: 1.3.0, 1.2.1
Reporter: Alex Liu

 The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
 Cassandra.
 Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact

2015-03-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4925:
---
Priority: Critical  (was: Major)

 Publish Spark SQL hive-thriftserver maven artifact 
 ---

 Key: SPARK-4925
 URL: https://issues.apache.org/jira/browse/SPARK-4925
 Project: Spark
  Issue Type: Improvement
  Components: Build, SQL
Affects Versions: 1.3.0, 1.2.1
Reporter: Alex Liu
Priority: Critical

 The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
 Cassandra.
 Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact

2015-03-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4925:
---
Affects Version/s: (was: 1.2.0)
   1.3.0
   1.2.1

 Publish Spark SQL hive-thriftserver maven artifact 
 ---

 Key: SPARK-4925
 URL: https://issues.apache.org/jira/browse/SPARK-4925
 Project: Spark
  Issue Type: Improvement
  Components: Build, SQL
Affects Versions: 1.3.0, 1.2.1
Reporter: Alex Liu
Priority: Critical

 The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
 Cassandra.
 Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4123) Show dependency changes in pull requests

2015-03-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4123:
---
Summary: Show dependency changes in pull requests  (was: Show new 
dependencies added in pull requests)

 Show dependency changes in pull requests
 

 Key: SPARK-4123
 URL: https://issues.apache.org/jira/browse/SPARK-4123
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Brennon York
Priority: Critical

 We should inspect the classpath of Spark's assembly jar for every pull 
 request. This only takes a few seconds in Maven and it will help weed out 
 dependency changes from the master branch. Ideally we'd post any dependency 
 changes in the pull request message.
 {code}
 $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
 INFO | tr : \n | awk -F/ '{print $NF}' | sort  my-classpath
 $ git checkout apache/master
 $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
 INFO | tr : \n | awk -F/ '{print $NF}' | sort  master-classpath
 $ diff my-classpath master-classpath
  chill-java-0.3.6.jar
  chill_2.10-0.3.6.jar
 ---
  chill-java-0.5.0.jar
  chill_2.10-0.5.0.jar
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact

2015-03-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-4925:


Thanks for bringing this up. Actually - realized this wasn't fixed by some of 
the other work we did. The issue is that we never published hive-thriftserver 
before (so simply undoing the changes I made didn't make this work for 
hive-thritfserver). We just need to add the -Phive-thriftserver profile here:

https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L122

If someone wants to send a patch I can merge it, and we can fix it for 1.3.1.

 Publish Spark SQL hive-thriftserver maven artifact 
 ---

 Key: SPARK-4925
 URL: https://issues.apache.org/jira/browse/SPARK-4925
 Project: Spark
  Issue Type: Improvement
  Components: Build, SQL
Affects Versions: 1.3.0, 1.2.1
Reporter: Alex Liu

 The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
 Cassandra.
 Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact

2015-03-20 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4925:
---
Target Version/s: 1.3.1

 Publish Spark SQL hive-thriftserver maven artifact 
 ---

 Key: SPARK-4925
 URL: https://issues.apache.org/jira/browse/SPARK-4925
 Project: Spark
  Issue Type: Improvement
  Components: Build, SQL
Affects Versions: 1.3.0, 1.2.1
Reporter: Alex Liu
Priority: Critical

 The hive-thriftserver maven artifact is needed for integrating Spark SQL with 
 Cassandra.
 Can we publish it to maven?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    2   3   4   5   6   7   8   9   10   11   >