[jira] [Created] (SPARK-1380) Add sort-merge based cogroup/joins.

2014-04-01 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-1380:


 Summary: Add sort-merge based cogroup/joins.
 Key: SPARK-1380
 URL: https://issues.apache.org/jira/browse/SPARK-1380
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Takuya Ueshin


I've written cogroup/joins based on 'Sort-Merge' algorithm.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1380) Add sort-merge based cogroup/joins.

2014-04-01 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956313#comment-13956313
 ] 

Takuya Ueshin commented on SPARK-1380:
--

Pull-requested: https://github.com/apache/spark/pull/283

> Add sort-merge based cogroup/joins.
> ---
>
> Key: SPARK-1380
> URL: https://issues.apache.org/jira/browse/SPARK-1380
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Takuya Ueshin
>
> I've written cogroup/joins based on 'Sort-Merge' algorithm.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1381) Spark to Shark direct streaming

2014-04-01 Thread Abhishek Tripathi (JIRA)
Abhishek Tripathi created SPARK-1381:


 Summary: Spark to Shark direct streaming
 Key: SPARK-1381
 URL: https://issues.apache.org/jira/browse/SPARK-1381
 Project: Spark
  Issue Type: Question
  Components: Documentation, Examples, Input/Output, Java API, Spark 
Core
Affects Versions: 0.8.1
Reporter: Abhishek Tripathi
Priority: Blocker


Hi,
I'm trying to push data coming from Spark streaming to Shark cache table. 
I thought of using JDBC API but Shark(0.81) does not support direct insert 
statement  i.e "insert into emp values(2, "Apia")  ".
I don't want to store Spark streaming into HDFS  and then copy that data to 
Shark table.
Can somebody plz help
1.  how can I directly point Spark streaming data to Shark table/cachedTable ? 
otherway how can Shark pickup data directly from Spark streaming? 
2. Does Shark0.81 has direct insert statement without referring to other table?

It is really stopping us to use Spark further more. need your assistant 
urgently.

Thanks in advance.
Abhishek




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1382) NullPointerException when calling DStream.slice() before StreamingContext.start()

2014-04-01 Thread JIRA
Alessandro Chacón created SPARK-1382:


 Summary: NullPointerException when calling DStream.slice() before 
StreamingContext.start()
 Key: SPARK-1382
 URL: https://issues.apache.org/jira/browse/SPARK-1382
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0
Reporter: Alessandro Chacón
Priority: Minor


If we call the DStream.slice() before StreamingContext.start() has been called, 
then zeroTime is still null, and it will throw a null pointer exception. 
Ideally, it should throw something like a "ContextNotInitlalized" exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1378) Build error: org.eclipse.paho:mqtt-client

2014-04-01 Thread Ken Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956536#comment-13956536
 ] 

Ken Williams commented on SPARK-1378:
-

I just checked out master and attempted a build again, and got the same failure.

I'm behind a corporate firewall, but we don't use a proxy so I don't think that 
should be an issue.

I'll try the PR25 patch, but I couldn't help noticing that @srowen didn't seem 
to like the approach.

> Build error: org.eclipse.paho:mqtt-client
> -
>
> Key: SPARK-1378
> URL: https://issues.apache.org/jira/browse/SPARK-1378
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Ken Williams
>
> Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded. I 
> attempt like so:
> {code}
> mvn -U -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests package
> {code}
> The Maven error is:
> {code}
> [ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
> resolve dependencies for project 
> org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find 
> artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus
> {code}
> My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4.
> Is there an additional Maven repository I should add or something?
> If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and 
> {{examples}} modules, the build succeeds. I'm fine without the MQTT stuff, 
> but I would really like to get the examples working because I haven't played 
> with Spark before.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1378) Build error: org.eclipse.paho:mqtt-client

2014-04-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956551#comment-13956551
 ] 

Sean Owen commented on SPARK-1378:
--

In that case, it seemed that it was clearly a proxy issue since adding proxy 
settings resolved the problem. That much is fine but not something to put in 
the general build. It also wouldn't help to add another repo... er, well, it's 
better to simply configure for access to the standard repos.

Try building the examples module (that's where it fails right?) with "-X" to 
output a lot more debug info. It is probably saying in there somewhere why it 
can't access the repos, and there may be a clue there. For example do you have 
anything that would proxy an HTTPS connection? that could break access.

Also use "-U" to make sure it is not caching lookup failures from previous runs.

If you have a home machine or can try from home that might also rule out 
network issues.

> Build error: org.eclipse.paho:mqtt-client
> -
>
> Key: SPARK-1378
> URL: https://issues.apache.org/jira/browse/SPARK-1378
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Ken Williams
>
> Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded. I 
> attempt like so:
> {code}
> mvn -U -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests package
> {code}
> The Maven error is:
> {code}
> [ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
> resolve dependencies for project 
> org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find 
> artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus
> {code}
> My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4.
> Is there an additional Maven repository I should add or something?
> If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and 
> {{examples}} modules, the build succeeds. I'm fine without the MQTT stuff, 
> but I would really like to get the examples working because I haven't played 
> with Spark before.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1383) Spark-SQL: ParquetRelation improvements

2014-04-01 Thread Andre Schumacher (JIRA)
Andre Schumacher created SPARK-1383:
---

 Summary: Spark-SQL: ParquetRelation improvements
 Key: SPARK-1383
 URL: https://issues.apache.org/jira/browse/SPARK-1383
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Andre Schumacher


Improve Spark-SQL's ParquetRelation as follows:
- Instead of files a ParquetRelation is should be backed by a directory, which 
simplifies importing data from other sources
- InsertIntoParquetTable operation should supports switching between 
overwriting or appending (at least in HiveQL)
- tests should use the new API
- Parquet logging should be forwarded to Log4J
- It should be possible to enable compression (default compression for Parquet 
files: GZIP, as in parquet-mr)
- OverwriteCatalog should support dropping of tables





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1383) Spark-SQL: ParquetRelation improvements

2014-04-01 Thread Andre Schumacher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andre Schumacher reassigned SPARK-1383:
---

Assignee: Andre Schumacher

> Spark-SQL: ParquetRelation improvements
> ---
>
> Key: SPARK-1383
> URL: https://issues.apache.org/jira/browse/SPARK-1383
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Andre Schumacher
>Assignee: Andre Schumacher
>
> Improve Spark-SQL's ParquetRelation as follows:
> - Instead of files a ParquetRelation is should be backed by a directory, 
> which simplifies importing data from other sources
> - InsertIntoParquetTable operation should supports switching between 
> overwriting or appending (at least in HiveQL)
> - tests should use the new API
> - Parquet logging should be forwarded to Log4J
> - It should be possible to enable compression (default compression for 
> Parquet files: GZIP, as in parquet-mr)
> - OverwriteCatalog should support dropping of tables



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1378) Build error: org.eclipse.paho:mqtt-client

2014-04-01 Thread Ken Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956576#comment-13956576
 ] 

Ken Williams edited comment on SPARK-1378 at 4/1/14 2:47 PM:
-

I commented over on that PR - the patch no longer applies, and it looks like 
the repo is now defined in the main {{pom.xml}} anyway.

For now I have a private branch against {{v0.9.0-incubating}} where I'm 
commenting out the 2 mentions of MQTT in the pom files, and deleting the 
{{examples/src/main/scala/org/apache/spark/streaming/examples/MQTTWordCount.scala}}
 file.  The build goes forward fine with these changes.


was (Author: kenahoo):
I commented over on that PR - the patch no longer applies, and it looks like 
the repo is now defined in the main `pom.xml` anyway.

For now I have a private branch against `v0.9.0-incubating` where I'm 
commenting out the 2 mentions of MQTT in the pom files, and deleting the 
`examples/src/main/scala/org/apache/spark/streaming/examples/MQTTWordCount.scala`
 file.  The build goes forward fine with these changes.

> Build error: org.eclipse.paho:mqtt-client
> -
>
> Key: SPARK-1378
> URL: https://issues.apache.org/jira/browse/SPARK-1378
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Ken Williams
>
> Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded. I 
> attempt like so:
> {code}
> mvn -U -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests package
> {code}
> The Maven error is:
> {code}
> [ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
> resolve dependencies for project 
> org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find 
> artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus
> {code}
> My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4.
> Is there an additional Maven repository I should add or something?
> If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and 
> {{examples}} modules, the build succeeds. I'm fine without the MQTT stuff, 
> but I would really like to get the examples working because I haven't played 
> with Spark before.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1378) Build error: org.eclipse.paho:mqtt-client

2014-04-01 Thread Ken Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956576#comment-13956576
 ] 

Ken Williams commented on SPARK-1378:
-

I commented over on that PR - the patch no longer applies, and it looks like 
the repo is now defined in the main `pom.xml` anyway.

For now I have a private branch against `v0.9.0-incubating` where I'm 
commenting out the 2 mentions of MQTT in the pom files, and deleting the 
`examples/src/main/scala/org/apache/spark/streaming/examples/MQTTWordCount.scala`
 file.  The build goes forward fine with these changes.

> Build error: org.eclipse.paho:mqtt-client
> -
>
> Key: SPARK-1378
> URL: https://issues.apache.org/jira/browse/SPARK-1378
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Ken Williams
>
> Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded. I 
> attempt like so:
> {code}
> mvn -U -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests package
> {code}
> The Maven error is:
> {code}
> [ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
> resolve dependencies for project 
> org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find 
> artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus
> {code}
> My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4.
> Is there an additional Maven repository I should add or something?
> If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and 
> {{examples}} modules, the build succeeds. I'm fine without the MQTT stuff, 
> but I would really like to get the examples working because I haven't played 
> with Spark before.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1378) Build error: org.eclipse.paho:mqtt-client

2014-04-01 Thread Ken Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956589#comment-13956589
 ] 

Ken Williams commented on SPARK-1378:
-

Thanks Sean - for me the failure is actually in the "Spark Project External 
MQTT" phase, and then it skips the "Spark Project Examples" phase because of 
the failure.

I'm indeed using the {{-U}} flag with my build, and just to be sure I nuked 
{{~/.m2/repository/org/eclipse/paho/mqtt-client}} before running.

I ran {{mvn -X -U -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 
-DskipTests --projects external/mqtt package}} and got reams of output - let me 
parse through that to see if I can see the problem, if not I'll stick it in a 
Gist and ask for help.

> Build error: org.eclipse.paho:mqtt-client
> -
>
> Key: SPARK-1378
> URL: https://issues.apache.org/jira/browse/SPARK-1378
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Ken Williams
>
> Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded. I 
> attempt like so:
> {code}
> mvn -U -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests package
> {code}
> The Maven error is:
> {code}
> [ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
> resolve dependencies for project 
> org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find 
> artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus
> {code}
> My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4.
> Is there an additional Maven repository I should add or something?
> If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and 
> {{examples}} modules, the build succeeds. I'm fine without the MQTT stuff, 
> but I would really like to get the examples working because I haven't played 
> with Spark before.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1367) NPE when joining Parquet Relations

2014-04-01 Thread Andre Schumacher (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956604#comment-13956604
 ] 

Andre Schumacher commented on SPARK-1367:
-

I believe this issue can be closed due to
https://github.com/apache/spark/commit/2861b07bb030f72769f5b757b4a7d4a635807140
?

> NPE when joining Parquet Relations
> --
>
> Key: SPARK-1367
> URL: https://issues.apache.org/jira/browse/SPARK-1367
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Andre Schumacher
>Priority: Blocker
> Fix For: 1.0.0
>
>
> {code}
>   test("self-join parquet files") {
> val x = ParquetTestData.testData.subquery('x)
> val y = ParquetTestData.testData.newInstance.subquery('y)
> val query = x.join(y).where("x.myint".attr === "y.myint".attr)
> query.collect()
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS

2014-04-01 Thread Joe Schaefer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956607#comment-13956607
 ] 

Joe Schaefer commented on SPARK-1355:
-

It just looks funny that a cutting edge project like Spark should rely on a 
vanilla cookie-cutter blog-site generator like jekyll to manage its website 
assets.  Go for broke and grasp the brass ring- bring your website technology 
to new levels with the Apache CMS!

> Switch website to the Apache CMS
> 
>
> Key: SPARK-1355
> URL: https://issues.apache.org/jira/browse/SPARK-1355
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Joe Schaefer
>
> Jekyll is ancient history useful for small blogger sites and little else.  
> Why not upgrade to the Apache CMS?  It supports the same on-disk format for 
> .md files and interfaces with pygments for code highlighting.  Thrift 
> recently switched from nanoc to the CMS and loves it!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS

2014-04-01 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956642#comment-13956642
 ] 

Mark Hamstra commented on SPARK-1355:
-

Resources are limited as we progress toward our 1.0 release.  I can't see 
reallocating those commitments just to avoid looking funny in the estimation of 
some observers.  If someone not otherwise occupied wants to contribute the work 
to convert to Apache CMS, that's another thing.

> Switch website to the Apache CMS
> 
>
> Key: SPARK-1355
> URL: https://issues.apache.org/jira/browse/SPARK-1355
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Joe Schaefer
>
> Jekyll is ancient history useful for small blogger sites and little else.  
> Why not upgrade to the Apache CMS?  It supports the same on-disk format for 
> .md files and interfaces with pygments for code highlighting.  Thrift 
> recently switched from nanoc to the CMS and loves it!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1378) Build error: org.eclipse.paho:mqtt-client

2014-04-01 Thread Ken Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Williams resolved SPARK-1378.
-

Resolution: Not a Problem

> Build error: org.eclipse.paho:mqtt-client
> -
>
> Key: SPARK-1378
> URL: https://issues.apache.org/jira/browse/SPARK-1378
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Ken Williams
>
> Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded. I 
> attempt like so:
> {code}
> mvn -U -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests package
> {code}
> The Maven error is:
> {code}
> [ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
> resolve dependencies for project 
> org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find 
> artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus
> {code}
> My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4.
> Is there an additional Maven repository I should add or something?
> If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and 
> {{examples}} modules, the build succeeds. I'm fine without the MQTT stuff, 
> but I would really like to get the examples working because I haven't played 
> with Spark before.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1378) Build error: org.eclipse.paho:mqtt-client

2014-04-01 Thread Ken Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956656#comment-13956656
 ] 

Ken Williams commented on SPARK-1378:
-

Significant update - I tried moving away my {{~/.m2/settings.xml}} to take it 
out of the equation, and everything worked.  So it looks like for me, the 
problem is local too, chalk it up to my inexperience with Maven I guess.

Thanks.

> Build error: org.eclipse.paho:mqtt-client
> -
>
> Key: SPARK-1378
> URL: https://issues.apache.org/jira/browse/SPARK-1378
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Ken Williams
>
> Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded. I 
> attempt like so:
> {code}
> mvn -U -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests package
> {code}
> The Maven error is:
> {code}
> [ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
> resolve dependencies for project 
> org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find 
> artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus
> {code}
> My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4.
> Is there an additional Maven repository I should add or something?
> If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and 
> {{examples}} modules, the build succeeds. I'm fine without the MQTT stuff, 
> but I would really like to get the examples working because I haven't played 
> with Spark before.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS

2014-04-01 Thread Joe Schaefer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956683#comment-13956683
 ] 

Joe Schaefer commented on SPARK-1355:
-

Nonesense- you have plenty of time just lack the appropriate prioritization for 
this task, which should be marked "Critical" as we are trying to help you help 
yourselves.  Do yourselves a solid and get it done this week to avoid further 
embarrassment, mkay?

> Switch website to the Apache CMS
> 
>
> Key: SPARK-1355
> URL: https://issues.apache.org/jira/browse/SPARK-1355
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Joe Schaefer
>
> Jekyll is ancient history useful for small blogger sites and little else.  
> Why not upgrade to the Apache CMS?  It supports the same on-disk format for 
> .md files and interfaces with pygments for code highlighting.  Thrift 
> recently switched from nanoc to the CMS and loves it!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1378) Build error: org.eclipse.paho:mqtt-client

2014-04-01 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956685#comment-13956685
 ] 

Mark Hamstra commented on SPARK-1378:
-

A tip: It can be useful to create a file called something like 
~/.m2/empty-settings.xml that contains nothing but .  Then 
you can test a build with no local settings interference via 'mvn -s 
~/.m2/empty-settings.xml ...'

> Build error: org.eclipse.paho:mqtt-client
> -
>
> Key: SPARK-1378
> URL: https://issues.apache.org/jira/browse/SPARK-1378
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Ken Williams
>
> Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded. I 
> attempt like so:
> {code}
> mvn -U -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests package
> {code}
> The Maven error is:
> {code}
> [ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
> resolve dependencies for project 
> org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find 
> artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus
> {code}
> My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4.
> Is there an additional Maven repository I should add or something?
> If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and 
> {{examples}} modules, the build succeeds. I'm fine without the MQTT stuff, 
> but I would really like to get the examples working because I haven't played 
> with Spark before.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS

2014-04-01 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956692#comment-13956692
 ] 

Mark Hamstra commented on SPARK-1355:
-

That looked more like an insult than a contribution.

> Switch website to the Apache CMS
> 
>
> Key: SPARK-1355
> URL: https://issues.apache.org/jira/browse/SPARK-1355
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Joe Schaefer
>
> Jekyll is ancient history useful for small blogger sites and little else.  
> Why not upgrade to the Apache CMS?  It supports the same on-disk format for 
> .md files and interfaces with pygments for code highlighting.  Thrift 
> recently switched from nanoc to the CMS and loves it!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS

2014-04-01 Thread Joe Schaefer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956697#comment-13956697
 ] 

Joe Schaefer commented on SPARK-1355:
-

Look again- it's free advice!

> Switch website to the Apache CMS
> 
>
> Key: SPARK-1355
> URL: https://issues.apache.org/jira/browse/SPARK-1355
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Joe Schaefer
>
> Jekyll is ancient history useful for small blogger sites and little else.  
> Why not upgrade to the Apache CMS?  It supports the same on-disk format for 
> .md files and interfaces with pygments for code highlighting.  Thrift 
> recently switched from nanoc to the CMS and loves it!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS

2014-04-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956713#comment-13956713
 ] 

Sean Owen commented on SPARK-1355:
--

Joe, I also can't imagine how this is "Critical". The Apache CMS is a fine 
home-grown tool, but not everyone needs/wants to use it. Your comments sound 
odd from someone I'd imagine has been around a while. I'd suggest being 
constructive, as in any JIRA, by putting out the work -- patch, steps needed, 
details, particular arguments, etc.

> Switch website to the Apache CMS
> 
>
> Key: SPARK-1355
> URL: https://issues.apache.org/jira/browse/SPARK-1355
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Joe Schaefer
>
> Jekyll is ancient history useful for small blogger sites and little else.  
> Why not upgrade to the Apache CMS?  It supports the same on-disk format for 
> .md files and interfaces with pygments for code highlighting.  Thrift 
> recently switched from nanoc to the CMS and loves it!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1378) Build error: org.eclipse.paho:mqtt-client

2014-04-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956716#comment-13956716
 ] 

Sean Owen commented on SPARK-1378:
--

I imagine it is something to do with a proxy setting or repo defined in that 
file. Maybe you can't share them, but if you suspect one in particular was the 
cause, that could be a helpful point of reference for anyone that might 
encounter this later. 

> Build error: org.eclipse.paho:mqtt-client
> -
>
> Key: SPARK-1378
> URL: https://issues.apache.org/jira/browse/SPARK-1378
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Ken Williams
>
> Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded. I 
> attempt like so:
> {code}
> mvn -U -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests package
> {code}
> The Maven error is:
> {code}
> [ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
> resolve dependencies for project 
> org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find 
> artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus
> {code}
> My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4.
> Is there an additional Maven repository I should add or something?
> If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and 
> {{examples}} modules, the build succeeds. I'm fine without the MQTT stuff, 
> but I would really like to get the examples working because I haven't played 
> with Spark before.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1368) HiveTableScan is slow

2014-04-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-1368:
-

Assignee: Cheng Lian

> HiveTableScan is slow
> -
>
> Key: SPARK-1368
> URL: https://issues.apache.org/jira/browse/SPARK-1368
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
> Fix For: 1.1.0
>
>
> The major issues here are the use of functional programming (.map, .foreach) 
> and the creation of a new Row object for each output tuple. We should switch 
> to while loops in the critical path and a single MutableRow per partition.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1374) Python API for running SQL queries

2014-04-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1374:


Priority: Blocker  (was: Major)

> Python API for running SQL queries
> --
>
> Key: SPARK-1374
> URL: https://issues.apache.org/jira/browse/SPARK-1374
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Ahir Reddy
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1364) DataTypes missing from ScalaReflection

2014-04-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1364:


Priority: Blocker  (was: Major)

> DataTypes missing from ScalaReflection
> --
>
> Key: SPARK-1364
> URL: https://issues.apache.org/jira/browse/SPARK-1364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Blocker
> Fix For: 1.0.0
>
>
> BigDecimal, possibly others.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1360) Add Timestamp Support

2014-04-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1360:


Priority: Blocker  (was: Major)

> Add Timestamp Support
> -
>
> Key: SPARK-1360
> URL: https://issues.apache.org/jira/browse/SPARK-1360
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Add Timestamp Support for Catalyst/SQLParser/HiveQl



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1371) HashAggregate should stream tuples and avoid doing an extra count

2014-04-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1371:


Priority: Blocker  (was: Major)

> HashAggregate should stream tuples and avoid doing an extra count
> -
>
> Key: SPARK-1371
> URL: https://issues.apache.org/jira/browse/SPARK-1371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1236) Update Jetty to 9

2014-04-01 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1236.


Resolution: Fixed

> Update Jetty to 9
> -
>
> Key: SPARK-1236
> URL: https://issues.apache.org/jira/browse/SPARK-1236
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Web UI
>Reporter: Reynold Xin
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.0.0
>
>
> See https://github.com/apache/spark/pull/113



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1373) Compression for In-Memory Columnar storage

2014-04-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1373:


Fix Version/s: (was: 1.1.0)
   1.0.0

> Compression for In-Memory Columnar storage
> --
>
> Key: SPARK-1373
> URL: https://issues.apache.org/jira/browse/SPARK-1373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1367) NPE when joining Parquet Relations

2014-04-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956781#comment-13956781
 ] 

Michael Armbrust commented on SPARK-1367:
-

No, in that commit there is a TODO as the testcase still NPEs.  We still need 
to remove the @transient from ParquetTableScan.  If you don't have time to do 
this I can.

> NPE when joining Parquet Relations
> --
>
> Key: SPARK-1367
> URL: https://issues.apache.org/jira/browse/SPARK-1367
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Andre Schumacher
>Priority: Blocker
> Fix For: 1.0.0
>
>
> {code}
>   test("self-join parquet files") {
> val x = ParquetTestData.testData.subquery('x)
> val y = ParquetTestData.testData.newInstance.subquery('y)
> val query = x.join(y).where("x.myint".attr === "y.myint".attr)
> query.collect()
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1373) Compression for In-Memory Columnar storage

2014-04-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1373:


Priority: Blocker  (was: Major)

> Compression for In-Memory Columnar storage
> --
>
> Key: SPARK-1373
> URL: https://issues.apache.org/jira/browse/SPARK-1373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1364) DataTypes missing from ScalaReflection

2014-04-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-1364:
---

Assignee: Michael Armbrust

> DataTypes missing from ScalaReflection
> --
>
> Key: SPARK-1364
> URL: https://issues.apache.org/jira/browse/SPARK-1364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.0.0
>
>
> BigDecimal, possibly others.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1384) spark-shell on yarn doesn't always work with secure hdfs

2014-04-01 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-1384:


 Summary: spark-shell on yarn doesn't always work with secure hdfs
 Key: SPARK-1384
 URL: https://issues.apache.org/jira/browse/SPARK-1384
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 0.9.0, 0.9.1
Reporter: Thomas Graves


 I've found an issue with the spark-shell in yarn-client mode in the 0.9.1 rc3 
release.  It doesn't work with secure HDFS unless you 
export SPARK_YARN_MODE=true before starting the shell, or if you happen to do 
something immediately with HDFS.  If you wait for the connection to the 
namenode to timeout it will fail. 

I think it was actually this way in the 0.9 release also so I thought I would 
send this and get peoples feedback to see if you want it fixed? 

Another option would be to document that you have to export 
SPARK_YARN_MODE=true for the shell.   The fix actually went in with the 
authentication changes I made in master but I never realized that change needed 
to apply to 0.9. 

https://github.com/apache/spark/commit/7edbea41b43e0dc11a2de156be220db8b7952d01#diff-0ae5b834ce90ec37c19af35aa7a5e1a0
See the SparkILoop diff.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1384) spark-shell on yarn on spark 0.9 branch doesn't always work with secure hdfs

2014-04-01 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-1384:
-

Description: 
 I've found an issue with the spark-shell in yarn-client mode in the 0.9.1 rc3 
release.  It doesn't work with secure HDFS unless you 
export SPARK_YARN_MODE=true before starting the shell, or if you happen to do 
something immediately with HDFS.  If you wait for the connection to the 
namenode to timeout it will fail. 
 
The fix actually went in to master branch  with the authentication changes I 
made in master but I never realized that change needed to apply to 0.9. 

https://github.com/apache/spark/commit/7edbea41b43e0dc11a2de156be220db8b7952d01#diff-0ae5b834ce90ec37c19af35aa7a5e1a0
See the SparkILoop diff.


  was:
 I've found an issue with the spark-shell in yarn-client mode in the 0.9.1 rc3 
release.  It doesn't work with secure HDFS unless you 
export SPARK_YARN_MODE=true before starting the shell, or if you happen to do 
something immediately with HDFS.  If you wait for the connection to the 
namenode to timeout it will fail. 

I think it was actually this way in the 0.9 release also so I thought I would 
send this and get peoples feedback to see if you want it fixed? 

Another option would be to document that you have to export 
SPARK_YARN_MODE=true for the shell.   The fix actually went in with the 
authentication changes I made in master but I never realized that change needed 
to apply to 0.9. 

https://github.com/apache/spark/commit/7edbea41b43e0dc11a2de156be220db8b7952d01#diff-0ae5b834ce90ec37c19af35aa7a5e1a0
See the SparkILoop diff.



> spark-shell on yarn on spark 0.9 branch doesn't always work with secure hdfs
> 
>
> Key: SPARK-1384
> URL: https://issues.apache.org/jira/browse/SPARK-1384
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 0.9.0, 0.9.1
>Reporter: Thomas Graves
>
>  I've found an issue with the spark-shell in yarn-client mode in the 0.9.1 
> rc3 release.  It doesn't work with secure HDFS unless you 
> export SPARK_YARN_MODE=true before starting the shell, or if you happen to do 
> something immediately with HDFS.  If you wait for the connection to the 
> namenode to timeout it will fail. 
>  
> The fix actually went in to master branch  with the authentication changes I 
> made in master but I never realized that change needed to apply to 0.9. 
> https://github.com/apache/spark/commit/7edbea41b43e0dc11a2de156be220db8b7952d01#diff-0ae5b834ce90ec37c19af35aa7a5e1a0
> See the SparkILoop diff.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1384) spark-shell on yarn on spark 0.9 branch doesn't always work with secure hdfs

2014-04-01 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-1384:
-

Summary: spark-shell on yarn on spark 0.9 branch doesn't always work with 
secure hdfs  (was: spark-shell on yarn doesn't always work with secure hdfs)

> spark-shell on yarn on spark 0.9 branch doesn't always work with secure hdfs
> 
>
> Key: SPARK-1384
> URL: https://issues.apache.org/jira/browse/SPARK-1384
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 0.9.0, 0.9.1
>Reporter: Thomas Graves
>
>  I've found an issue with the spark-shell in yarn-client mode in the 0.9.1 
> rc3 release.  It doesn't work with secure HDFS unless you 
> export SPARK_YARN_MODE=true before starting the shell, or if you happen to do 
> something immediately with HDFS.  If you wait for the connection to the 
> namenode to timeout it will fail. 
> I think it was actually this way in the 0.9 release also so I thought I would 
> send this and get peoples feedback to see if you want it fixed? 
> Another option would be to document that you have to export 
> SPARK_YARN_MODE=true for the shell.   The fix actually went in with the 
> authentication changes I made in master but I never realized that change 
> needed to apply to 0.9. 
> https://github.com/apache/spark/commit/7edbea41b43e0dc11a2de156be220db8b7952d01#diff-0ae5b834ce90ec37c19af35aa7a5e1a0
> See the SparkILoop diff.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1385) Use existing code-path for JSON de/serialization of BlockId

2014-04-01 Thread Andrew Or (JIRA)
Andrew Or created SPARK-1385:


 Summary: Use existing code-path for JSON de/serialization of 
BlockId
 Key: SPARK-1385
 URL: https://issues.apache.org/jira/browse/SPARK-1385
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.0, 0.9.1
Reporter: Andrew Or
Priority: Minor
 Fix For: 1.0.0


BlockId.scala already takes care of JSON de/serialization by parsing the string 
to and from regex. This functionality is currently duplicated in 
util/JsonProtocol.scala.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1386) Spark Streaming UI

2014-04-01 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1386:
-

Priority: Blocker  (was: Major)

> Spark Streaming UI
> --
>
> Key: SPARK-1386
> URL: https://issues.apache.org/jira/browse/SPARK-1386
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Tathagata Das
>Priority: Blocker
>
> When debugging Spark Streaming applications it is necessary to monitor 
> certain metrics that are not shown in the Spark application UI. For example, 
> what is average processing time of batches? What is the scheduling delay? Is 
> the system able to process as fast as it is receiving data? How many records 
> I am receiving through my receivers? 
> While the StreamingListener interface introduced in the 0.9 provided some of 
> this information, it could only be accessed programmatically. A UI that shows 
> information specific to the streaming applications is necessary for easier 
> debugging.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1386) Spark Streaming UI

2014-04-01 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-1386:


 Summary: Spark Streaming UI
 Key: SPARK-1386
 URL: https://issues.apache.org/jira/browse/SPARK-1386
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Tathagata Das


When debugging Spark Streaming applications it is necessary to monitor certain 
metrics that are not shown in the Spark application UI. For example, what is 
average processing time of batches? What is the scheduling delay? Is the system 
able to process as fast as it is receiving data? How many records I am 
receiving through my receivers? 

While the StreamingListener interface introduced in the 0.9 provided some of 
this information, it could only be accessed programmatically. A UI that shows 
information specific to the streaming applications is necessary for easier 
debugging.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1386) Spark Streaming UI

2014-04-01 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1386:
-

Affects Version/s: 0.9.0

> Spark Streaming UI
> --
>
> Key: SPARK-1386
> URL: https://issues.apache.org/jira/browse/SPARK-1386
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Affects Versions: 0.9.0
>Reporter: Tathagata Das
>Priority: Blocker
>
> When debugging Spark Streaming applications it is necessary to monitor 
> certain metrics that are not shown in the Spark application UI. For example, 
> what is average processing time of batches? What is the scheduling delay? Is 
> the system able to process as fast as it is receiving data? How many records 
> I am receiving through my receivers? 
> While the StreamingListener interface introduced in the 0.9 provided some of 
> this information, it could only be accessed programmatically. A UI that shows 
> information specific to the streaming applications is necessary for easier 
> debugging.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1332) Improve Spark Streaming's Network Receiver and InputDStream API for future stability

2014-04-01 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1332:
-

Priority: Blocker  (was: Critical)

> Improve Spark Streaming's Network Receiver and InputDStream API for future 
> stability
> 
>
> Key: SPARK-1332
> URL: https://issues.apache.org/jira/browse/SPARK-1332
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 0.9.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>
> The current Network Receiver API makes it slightly complicated to right a new 
> receiver as one needs to create an instance of BlockGenerator as shown in 
> SocketReceiver 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/SocketInputDStream.scala#L51
> Exposing the BlockGenerator interface has made it harder to improve the 
> receiving process. The API of NetworkReceiver (which was not a very stable 
> API anyways) needs to be change if we are to ensure future stability. 
> Additionally, the functions like streamingContext.socketStream that create 
> input streams, return DStream objects. That makes it hard to expose 
> functionality (say, rate limits) unique to input dstreams. They should return 
> InputDStream or NetworkInputDStream.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1331) Graceful shutdown of Spark Streaming computation

2014-04-01 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1331:
-

Priority: Blocker  (was: Critical)

> Graceful shutdown of Spark Streaming computation
> 
>
> Key: SPARK-1331
> URL: https://issues.apache.org/jira/browse/SPARK-1331
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 0.9.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>
> Current version of StreamingContext.stop() directly kills all the data 
> receivers (NetworkReceiver) without waiting for the data already received to 
> be persisted and processed. Fixing this requires the following.
> 1. Each receiver, when it gets a stop signal from the driver, should stop 
> receiving, and then wait for all the received data to have been persisted and 
> reported to the driver.
> 2. The driver, after stopping all the receivers, should wait for all the 
> received to be processed. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1387) Update build plugins, avoid plugin version warning, centralize versions

2014-04-01 Thread Sean Owen (JIRA)
Sean Owen created SPARK-1387:


 Summary: Update build plugins,  avoid plugin version warning, 
centralize versions
 Key: SPARK-1387
 URL: https://issues.apache.org/jira/browse/SPARK-1387
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 0.9.0
Reporter: Sean Owen
Priority: Minor


Another handful of small build changes to organize and standardize a bit, and 
avoid warnings:

- Update Maven plugin versions for good measure
- Since plugins need maven 3.0.4 already, require it explicitly (<3.0.4 had 
some bugs anyway)
- Use variables to define versions across dependencies where they should move 
in lock step
- ... and make this consistent between Maven/SBT




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1367) NPE when joining Parquet Relations

2014-04-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-1367:
---

Assignee: Michael Armbrust  (was: Andre Schumacher)

> NPE when joining Parquet Relations
> --
>
> Key: SPARK-1367
> URL: https://issues.apache.org/jira/browse/SPARK-1367
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.0.0
>
>
> {code}
>   test("self-join parquet files") {
> val x = ParquetTestData.testData.subquery('x)
> val y = ParquetTestData.testData.newInstance.subquery('y)
> val query = x.join(y).where("x.myint".attr === "y.myint".attr)
> query.collect()
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1347) SHARK error when running in server mode: java.net.BindException: Address already in use

2014-04-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957043#comment-13957043
 ] 

Michael Armbrust commented on SPARK-1347:
-

It looks like the the Spark web UI is failing to bind to its port.  Are you 
running another copy of Spark on the same machine or is something else 
listening on 4040?  If so you should change the port for the web ui by setting 
"spark.ui.port" in SparkConf.  More details on configuration can be found here: 
http://spark.apache.org/docs/0.9.0/configuration.html

> SHARK error when running in server mode: java.net.BindException: Address 
> already in use
> ---
>
> Key: SPARK-1347
> URL: https://issues.apache.org/jira/browse/SPARK-1347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 0.9.0
>Reporter: Test
>
> Start spark on cluster machine
> then start the shark in server mode using:
> ./bin/shark --service sharkserver 
> Now connect to shark using client as :
> ./bin/shark -h  -p 
> Check the hive.log:
> 2014-03-28 10:24:05,391 WARN  component.AbstractLifeCycle 
> (AbstractLifeCycle.java:setFailed(204)) - FAILED 
> org.eclipse.jetty.server.Server@5eb98063: java.net.BindException: Address 
> already in use
> java.net.BindException: Address already in use
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:444)
>   at sun.nio.ch.Net.bind(Net.java:436)
>   at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
>   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>   at 
> org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
>   at 
> org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
>   at 
> org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
>   at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>   at org.eclipse.jetty.server.Server.doStart(Server.java:286)
>   at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>   at 
> org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:118)
>   at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:118)
>   at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:118)
>   at scala.util.Try$.apply(Try.scala:161)
>   at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:118)
>   at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:129)
>   at org.apache.spark.ui.SparkUI.bind(SparkUI.scala:57)
>   at org.apache.spark.SparkContext.(SparkContext.scala:159)
>   at shark.SharkContext.(SharkContext.scala:42)
>   at shark.SharkContext.(SharkContext.scala:61)
>   at shark.SharkEnv$.initWithSharkContext(SharkEnv.scala:81)
>   at shark.SharkEnv$.init(SharkEnv.scala:41)
>   at shark.SharkEnv$.fixUncompatibleConf(SharkEnv.scala:48)
>   at shark.SharkCliDriver$.main(SharkCliDriver.scala:165)
>   at shark.SharkCliDriver.main(SharkCliDriver.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-1347) SHARK error when running in server mode: java.net.BindException: Address already in use

2014-04-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust closed SPARK-1347.
---

Resolution: Won't Fix

> SHARK error when running in server mode: java.net.BindException: Address 
> already in use
> ---
>
> Key: SPARK-1347
> URL: https://issues.apache.org/jira/browse/SPARK-1347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 0.9.0
>Reporter: Test
>
> Start spark on cluster machine
> then start the shark in server mode using:
> ./bin/shark --service sharkserver 
> Now connect to shark using client as :
> ./bin/shark -h  -p 
> Check the hive.log:
> 2014-03-28 10:24:05,391 WARN  component.AbstractLifeCycle 
> (AbstractLifeCycle.java:setFailed(204)) - FAILED 
> org.eclipse.jetty.server.Server@5eb98063: java.net.BindException: Address 
> already in use
> java.net.BindException: Address already in use
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:444)
>   at sun.nio.ch.Net.bind(Net.java:436)
>   at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
>   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>   at 
> org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
>   at 
> org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
>   at 
> org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
>   at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>   at org.eclipse.jetty.server.Server.doStart(Server.java:286)
>   at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>   at 
> org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:118)
>   at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:118)
>   at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:118)
>   at scala.util.Try$.apply(Try.scala:161)
>   at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:118)
>   at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:129)
>   at org.apache.spark.ui.SparkUI.bind(SparkUI.scala:57)
>   at org.apache.spark.SparkContext.(SparkContext.scala:159)
>   at shark.SharkContext.(SharkContext.scala:42)
>   at shark.SharkContext.(SharkContext.scala:61)
>   at shark.SharkEnv$.initWithSharkContext(SharkEnv.scala:81)
>   at shark.SharkEnv$.init(SharkEnv.scala:41)
>   at shark.SharkEnv$.fixUncompatibleConf(SharkEnv.scala:48)
>   at shark.SharkCliDriver$.main(SharkCliDriver.scala:165)
>   at shark.SharkCliDriver.main(SharkCliDriver.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS

2014-04-01 Thread Joe Schaefer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957063#comment-13957063
 ] 

Joe Schaefer commented on SPARK-1355:
-

Arguments abound, patches are pointless- I'm not doing the migration, you are.  
I'm giving this till COB today for someone to bump this to Critical and mean it 
this time.

> Switch website to the Apache CMS
> 
>
> Key: SPARK-1355
> URL: https://issues.apache.org/jira/browse/SPARK-1355
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Joe Schaefer
>
> Jekyll is ancient history useful for small blogger sites and little else.  
> Why not upgrade to the Apache CMS?  It supports the same on-disk format for 
> .md files and interfaces with pygments for code highlighting.  Thrift 
> recently switched from nanoc to the CMS and loves it!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS

2014-04-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957113#comment-13957113
 ] 

Sean Owen commented on SPARK-1355:
--

April Fools, apparently. Though this was opened on 30 March? 

> Switch website to the Apache CMS
> 
>
> Key: SPARK-1355
> URL: https://issues.apache.org/jira/browse/SPARK-1355
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Joe Schaefer
>
> Jekyll is ancient history useful for small blogger sites and little else.  
> Why not upgrade to the Apache CMS?  It supports the same on-disk format for 
> .md files and interfaces with pygments for code highlighting.  Thrift 
> recently switched from nanoc to the CMS and loves it!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS

2014-04-01 Thread Joe Schaefer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957117#comment-13957117
 ] 

Joe Schaefer commented on SPARK-1355:
-

Meow.  Rome wasn't built in a day...

> Switch website to the Apache CMS
> 
>
> Key: SPARK-1355
> URL: https://issues.apache.org/jira/browse/SPARK-1355
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Joe Schaefer
>
> Jekyll is ancient history useful for small blogger sites and little else.  
> Why not upgrade to the Apache CMS?  It supports the same on-disk format for 
> .md files and interfaces with pygments for code highlighting.  Thrift 
> recently switched from nanoc to the CMS and loves it!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1378) Build error: org.eclipse.paho:mqtt-client

2014-04-01 Thread Ken Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957124#comment-13957124
 ] 

Ken Williams commented on SPARK-1378:
-

What resolved it on our end was to either run without a local MVN repo (moving 
the {{~/.m2/settings.xml}} out of the way) or adding the mqtt-repo 
(https://repo.eclipse.org/content/repositories/paho-releases) to our set of 
mirrors.

> Build error: org.eclipse.paho:mqtt-client
> -
>
> Key: SPARK-1378
> URL: https://issues.apache.org/jira/browse/SPARK-1378
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Ken Williams
>
> Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded. I 
> attempt like so:
> {code}
> mvn -U -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests package
> {code}
> The Maven error is:
> {code}
> [ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
> resolve dependencies for project 
> org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find 
> artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus
> {code}
> My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4.
> Is there an additional Maven repository I should add or something?
> If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and 
> {{examples}} modules, the build succeeds. I'm fine without the MQTT stuff, 
> but I would really like to get the examples working because I haven't played 
> with Spark before.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-1355) Switch website to the Apache CMS

2014-04-01 Thread Joe Schaefer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe Schaefer closed SPARK-1355.
---

Resolution: Invalid

> Switch website to the Apache CMS
> 
>
> Key: SPARK-1355
> URL: https://issues.apache.org/jira/browse/SPARK-1355
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Joe Schaefer
>
> Jekyll is ancient history useful for small blogger sites and little else.  
> Why not upgrade to the Apache CMS?  It supports the same on-disk format for 
> .md files and interfaces with pygments for code highlighting.  Thrift 
> recently switched from nanoc to the CMS and loves it!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1388) ConcurrentModificationException in hadoop_common exposed by Spark

2014-04-01 Thread Nishkam Ravi (JIRA)
Nishkam Ravi created SPARK-1388:
---

 Summary: ConcurrentModificationException in hadoop_common exposed 
by Spark
 Key: SPARK-1388
 URL: https://issues.apache.org/jira/browse/SPARK-1388
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Nishkam Ravi


The following exception occurs non-deterministically:

java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
at java.util.HashMap$KeyIterator.next(HashMap.java:960)
at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
at java.util.HashSet.(HashSet.java:117)
at org.apache.hadoop.conf.Configuration.(Configuration.java:671)
at org.apache.hadoop.mapred.JobConf.(JobConf.java:439)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1388) ConcurrentModificationException in hadoop_common exposed by Spark

2014-04-01 Thread Nishkam Ravi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishkam Ravi updated SPARK-1388:


Attachment: Conf_Spark.patch

> ConcurrentModificationException in hadoop_common exposed by Spark
> -
>
> Key: SPARK-1388
> URL: https://issues.apache.org/jira/browse/SPARK-1388
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Nishkam Ravi
> Attachments: Conf_Spark.patch
>
>
> The following exception occurs non-deterministically:
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
> at java.util.HashMap$KeyIterator.next(HashMap.java:960)
> at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
> at java.util.HashSet.(HashSet.java:117)
> at org.apache.hadoop.conf.Configuration.(Configuration.java:671)
> at org.apache.hadoop.mapred.JobConf.(JobConf.java:439)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
> at org.apache.spark.scheduler.Task.run(Task.scala:53)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1388) ConcurrentModificationException in hadoop_common exposed by Spark

2014-04-01 Thread Nishkam Ravi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishkam Ravi updated SPARK-1388:


Attachment: nravi_Conf_Spark-1388.patch

> ConcurrentModificationException in hadoop_common exposed by Spark
> -
>
> Key: SPARK-1388
> URL: https://issues.apache.org/jira/browse/SPARK-1388
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Nishkam Ravi
> Attachments: nravi_Conf_Spark-1388.patch
>
>
> The following exception occurs non-deterministically:
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
> at java.util.HashMap$KeyIterator.next(HashMap.java:960)
> at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
> at java.util.HashSet.(HashSet.java:117)
> at org.apache.hadoop.conf.Configuration.(Configuration.java:671)
> at org.apache.hadoop.mapred.JobConf.(JobConf.java:439)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
> at org.apache.spark.scheduler.Task.run(Task.scala:53)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1388) ConcurrentModificationException in hadoop_common exposed by Spark

2014-04-01 Thread Nishkam Ravi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishkam Ravi updated SPARK-1388:


Attachment: (was: Conf_Spark.patch)

> ConcurrentModificationException in hadoop_common exposed by Spark
> -
>
> Key: SPARK-1388
> URL: https://issues.apache.org/jira/browse/SPARK-1388
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Nishkam Ravi
> Attachments: nravi_Conf_Spark-1388.patch
>
>
> The following exception occurs non-deterministically:
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
> at java.util.HashMap$KeyIterator.next(HashMap.java:960)
> at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
> at java.util.HashSet.(HashSet.java:117)
> at org.apache.hadoop.conf.Configuration.(Configuration.java:671)
> at org.apache.hadoop.mapred.JobConf.(JobConf.java:439)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
> at org.apache.spark.scheduler.Task.run(Task.scala:53)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1388) ConcurrentModificationException in hadoop_common exposed by Spark

2014-04-01 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957231#comment-13957231
 ] 

Nishkam Ravi commented on SPARK-1388:
-

Here is a simple fix for this issue (patch attached). Verified with mvn 
compile, mvn test and mvn install. 
This issue may be identical to SPARK-1097. 

> ConcurrentModificationException in hadoop_common exposed by Spark
> -
>
> Key: SPARK-1388
> URL: https://issues.apache.org/jira/browse/SPARK-1388
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Nishkam Ravi
> Attachments: nravi_Conf_Spark-1388.patch
>
>
> The following exception occurs non-deterministically:
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
> at java.util.HashMap$KeyIterator.next(HashMap.java:960)
> at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
> at java.util.HashSet.(HashSet.java:117)
> at org.apache.hadoop.conf.Configuration.(Configuration.java:671)
> at org.apache.hadoop.mapred.JobConf.(JobConf.java:439)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
> at org.apache.spark.scheduler.Task.run(Task.scala:53)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1097) ConcurrentModificationException

2014-04-01 Thread Nishkam Ravi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishkam Ravi updated SPARK-1097:


Attachment: nravi_Conf_Spark-1388.patch

> ConcurrentModificationException
> ---
>
> Key: SPARK-1097
> URL: https://issues.apache.org/jira/browse/SPARK-1097
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Fabrizio Milo
> Attachments: nravi_Conf_Spark-1388.patch
>
>
> {noformat}
> 14/02/16 08:18:45 WARN TaskSetManager: Loss was due to 
> java.util.ConcurrentModificationException
> java.util.ConcurrentModificationException
>   at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
>   at java.util.HashMap$KeyIterator.next(HashMap.java:960)
>   at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
>   at java.util.HashSet.(HashSet.java:117)
>   at org.apache.hadoop.conf.Configuration.(Configuration.java:554)
>   at org.apache.hadoop.mapred.JobConf.(JobConf.java:439)
>   at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:32)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:72)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>   at org.apache.spark.scheduler.Task.run(Task.scala:53)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1097) ConcurrentModificationException

2014-04-01 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957233#comment-13957233
 ] 

Nishkam Ravi commented on SPARK-1097:
-

Attached is a patch for this issue. Verified with mvn test/compile/install. 

> ConcurrentModificationException
> ---
>
> Key: SPARK-1097
> URL: https://issues.apache.org/jira/browse/SPARK-1097
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Fabrizio Milo
> Attachments: nravi_Conf_Spark-1388.patch
>
>
> {noformat}
> 14/02/16 08:18:45 WARN TaskSetManager: Loss was due to 
> java.util.ConcurrentModificationException
> java.util.ConcurrentModificationException
>   at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
>   at java.util.HashMap$KeyIterator.next(HashMap.java:960)
>   at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
>   at java.util.HashSet.(HashSet.java:117)
>   at org.apache.hadoop.conf.Configuration.(Configuration.java:554)
>   at org.apache.hadoop.mapred.JobConf.(JobConf.java:439)
>   at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:32)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:72)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>   at org.apache.spark.scheduler.Task.run(Task.scala:53)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1097) ConcurrentModificationException

2014-04-01 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957233#comment-13957233
 ] 

Nishkam Ravi edited comment on SPARK-1097 at 4/2/14 1:30 AM:
-

Attached is a patch for this issue. Verified with mvn test/compile/install. The 
fix is to move HashSet initialization to the synchronized block right above it.


was (Author: nravi):
Attached is a patch for this issue. Verified with mvn test/compile/install. 

> ConcurrentModificationException
> ---
>
> Key: SPARK-1097
> URL: https://issues.apache.org/jira/browse/SPARK-1097
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Fabrizio Milo
> Attachments: nravi_Conf_Spark-1388.patch
>
>
> {noformat}
> 14/02/16 08:18:45 WARN TaskSetManager: Loss was due to 
> java.util.ConcurrentModificationException
> java.util.ConcurrentModificationException
>   at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
>   at java.util.HashMap$KeyIterator.next(HashMap.java:960)
>   at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
>   at java.util.HashSet.(HashSet.java:117)
>   at org.apache.hadoop.conf.Configuration.(Configuration.java:554)
>   at org.apache.hadoop.mapred.JobConf.(JobConf.java:439)
>   at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:32)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:72)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>   at org.apache.spark.scheduler.Task.run(Task.scala:53)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1271) Use Iterator[X] in co-group and group-by signatures

2014-04-01 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957293#comment-13957293
 ] 

holdenk commented on SPARK-1271:


https://github.com/apache/spark/pull/242

> Use Iterator[X] in co-group and group-by signatures
> ---
>
> Key: SPARK-1271
> URL: https://issues.apache.org/jira/browse/SPARK-1271
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Priority: Blocker
> Fix For: 1.0.0
>
>
> This API change will allow us to externalize these things down the road.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-939) Allow user jars to take precedence over Spark jars, if desired

2014-04-01 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957294#comment-13957294
 ] 

holdenk commented on SPARK-939:
---

https://github.com/apache/spark/pull/217

> Allow user jars to take precedence over Spark jars, if desired
> --
>
> Key: SPARK-939
> URL: https://issues.apache.org/jira/browse/SPARK-939
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: holdenk
>Priority: Blocker
>  Labels: starter
> Fix For: 1.0.0
>
>
> Sometimes a user may want to include their own version of a jar that spark 
> itself uses. For example, if their code requires a newer version of that jar 
> than Spark offers. It would be good to have an option to give the users 
> dependencies precedence over Spark. This options should be disabled by 
> default, since it could lead to some odd behavior (e.g. parts of Spark not 
> working). But I think we should have it.
> From an implementation perspective, this would require modifying the way we 
> do class loading inside of an Executor. The default behavior of the  
> URLClassLoader is to delegate to it's parent first and, if that fails, to 
> find a class locally. We want to have the opposite behavior. This is 
> sometimes referred to as "parent-last" (as opposed to "parent-first") class 
> loading precedence. There is an example of how to do this here:
> http://stackoverflow.com/questions/5445511/how-do-i-create-a-parent-last-child-first-classloader-in-java-or-how-to-overr
> We should write a similar class which can encapsulate a URL classloader and 
> change the delegation order. Or if possible, maybe we could find a more 
> elegant way to do this. See relevant discussion on the user list here:
> https://groups.google.com/forum/#!topic/spark-users/b278DW3e38g
> Also see the corresponding option in Hadoop:
> https://issues.apache.org/jira/browse/MAPREDUCE-4521
> Some other relevant Hadoop JIRA's:
> https://issues.apache.org/jira/browse/MAPREDUCE-1700
> https://issues.apache.org/jira/browse/MAPREDUCE-1938



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1388) ConcurrentModificationException in hadoop_common exposed by Spark

2014-04-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957330#comment-13957330
 ] 

Sean Owen commented on SPARK-1388:
--

Yes this should be resolved as a duplicate instead.

> ConcurrentModificationException in hadoop_common exposed by Spark
> -
>
> Key: SPARK-1388
> URL: https://issues.apache.org/jira/browse/SPARK-1388
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Nishkam Ravi
> Attachments: nravi_Conf_Spark-1388.patch
>
>
> The following exception occurs non-deterministically:
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
> at java.util.HashMap$KeyIterator.next(HashMap.java:960)
> at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
> at java.util.HashSet.(HashSet.java:117)
> at org.apache.hadoop.conf.Configuration.(Configuration.java:671)
> at org.apache.hadoop.mapred.JobConf.(JobConf.java:439)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
> at org.apache.spark.scheduler.Task.run(Task.scala:53)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1097) ConcurrentModificationException

2014-04-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957335#comment-13957335
 ] 

Sean Owen commented on SPARK-1097:
--

Standard procedure is to provide a pull request. But, you're suggesting a fix 
to Hadoop code, which belong in your Hadoop JIRA, yes. This can't fix the 
problem from the Spark end.

(Eyeballing the Hadoop 2.2.0 code, I tend to agree with your patch. Mutation of 
finalParameters appears consistently synchronized, which means the constructor 
reading it to copy has to lock on the other Configuration or else exactly this 
can happen.)

Is a workaround in Spark to synchronize on the Configuration object when 
calling this constructor? (I smell a deadlock risk.)
Or something crazy like trying the constructor until it doesn't fail this way?

> ConcurrentModificationException
> ---
>
> Key: SPARK-1097
> URL: https://issues.apache.org/jira/browse/SPARK-1097
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Fabrizio Milo
> Attachments: nravi_Conf_Spark-1388.patch
>
>
> {noformat}
> 14/02/16 08:18:45 WARN TaskSetManager: Loss was due to 
> java.util.ConcurrentModificationException
> java.util.ConcurrentModificationException
>   at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
>   at java.util.HashMap$KeyIterator.next(HashMap.java:960)
>   at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
>   at java.util.HashSet.(HashSet.java:117)
>   at org.apache.hadoop.conf.Configuration.(Configuration.java:554)
>   at org.apache.hadoop.mapred.JobConf.(JobConf.java:439)
>   at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:32)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:72)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>   at org.apache.spark.scheduler.Task.run(Task.scala:53)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1097) ConcurrentModificationException

2014-04-01 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957354#comment-13957354
 ] 

Nishkam Ravi commented on SPARK-1097:
-

The problem should be solved at the root. This issue can be exposed by other 
systems as well, in addition to Spark. The fix is straightforward and harmless. 
I can initiate a pull request as well.

> ConcurrentModificationException
> ---
>
> Key: SPARK-1097
> URL: https://issues.apache.org/jira/browse/SPARK-1097
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Fabrizio Milo
> Attachments: nravi_Conf_Spark-1388.patch
>
>
> {noformat}
> 14/02/16 08:18:45 WARN TaskSetManager: Loss was due to 
> java.util.ConcurrentModificationException
> java.util.ConcurrentModificationException
>   at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
>   at java.util.HashMap$KeyIterator.next(HashMap.java:960)
>   at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
>   at java.util.HashSet.(HashSet.java:117)
>   at org.apache.hadoop.conf.Configuration.(Configuration.java:554)
>   at org.apache.hadoop.mapred.JobConf.(JobConf.java:439)
>   at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:32)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:72)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>   at org.apache.spark.scheduler.Task.run(Task.scala:53)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1097) ConcurrentModificationException

2014-04-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957355#comment-13957355
 ] 

Sean Owen commented on SPARK-1097:
--

I agree, but, we can't patch Hadoop from here. I'm just saying that for 
purposes of a SPARK-* issue, in anything like the short-term, one would have to 
propose a workaround within Spark code, if anything. While also trying to fix 
it at the root, separately, in a HADOOP-* issue. (Spark does not have a copy of 
Hadoop, yesterday's April Fools joke aside.)

> ConcurrentModificationException
> ---
>
> Key: SPARK-1097
> URL: https://issues.apache.org/jira/browse/SPARK-1097
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Fabrizio Milo
> Attachments: nravi_Conf_Spark-1388.patch
>
>
> {noformat}
> 14/02/16 08:18:45 WARN TaskSetManager: Loss was due to 
> java.util.ConcurrentModificationException
> java.util.ConcurrentModificationException
>   at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
>   at java.util.HashMap$KeyIterator.next(HashMap.java:960)
>   at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
>   at java.util.HashSet.(HashSet.java:117)
>   at org.apache.hadoop.conf.Configuration.(Configuration.java:554)
>   at org.apache.hadoop.mapred.JobConf.(JobConf.java:439)
>   at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:32)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:72)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>   at org.apache.spark.scheduler.Task.run(Task.scala:53)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1389) Make numPartitions in Exchange configurable

2014-04-01 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-1389:
---

 Summary: Make numPartitions in Exchange configurable
 Key: SPARK-1389
 URL: https://issues.apache.org/jira/browse/SPARK-1389
 Project: Spark
  Issue Type: Improvement
Reporter: Michael Armbrust
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1390) Refactor RDD backed matrices

2014-04-01 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1390:


 Summary: Refactor RDD backed matrices
 Key: SPARK-1390
 URL: https://issues.apache.org/jira/browse/SPARK-1390
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Blocker


The current interfaces of RDD backed matrices needs refactoring for v1.0 
release. It would be better if we have a clear separation of local matrices and 
those backed by RDD. Right now, we have 

1. org.apache.spark.mllib.linalg.SparseMatrix, which is a wrapper over an RDD 
of matrix entries, i.e., coordinate list format.
2. org.apache.spark.mllib.linalg.TallSkinnyDenseMatrix, which is a wrapper over 
RDD[Array[Double]], i.e. row-oriented format.

We will see naming collision when we introduce local SparseMatrix and the name 
TallSkinnyDenseMatrix is not exact if we switch to RDD[Vector] instead of 
RDD[Array[Double]]. It would be better to have "RDD" in the type name to 
suggest that operations will trigger a job.

The proposed names (all under org.apache.spark.mllib.linalg.rdd):

1. RDDMatrix: trait for matrices backed by one or more RDDs
2. CoordinateRDDMatrix: wrapper of RDD[RDDMatrixEntry]
3. RowRDDMatrix: wrapper of RDD[Vector] whose rows do not have special ordering
4. IndexedRowRDDMatrix: wrapper of RDD[(Long, Vector)] whose rows are 
associated with indices

The proposal is subject to charge, but it would be nice to make the changes 
before v1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size

2014-04-01 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-1391:


 Summary: BlockManager cannot transfer blocks larger than 2G in size
 Key: SPARK-1391
 URL: https://issues.apache.org/jira/browse/SPARK-1391
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Shuffle
Affects Versions: 1.0.0
Reporter: Shivaram Venkataraman


If a task tries to remotely access a cached RDD block, I get an exception when 
the block size is > 2G. The exception is pasted below.

Memory capacities are huge these days (> 60G), and many workflows depend on 
having large blocks in memory, so it would be good to fix this bug.

I don't know if the same thing happens on shuffles if one transfer (from mapper 
to reducer) is > 2G.

14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer 
message
java.lang.ArrayIndexOutOfBoundsException
at 
it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96)
at 
it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134)
at 
it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38)
at 
org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93)
at 
org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26)
at 
org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913)
at 
org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922)
at org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
at 
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348)
at 
org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323)
at 
org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
at 
org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
at 
org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
at 
org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661)
at 
org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1371) HashAggregate should stream tuples and avoid doing an extra count

2014-04-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957365#comment-13957365
 ] 

Michael Armbrust commented on SPARK-1371:
-

https://github.com/apache/spark/pull/295

> HashAggregate should stream tuples and avoid doing an extra count
> -
>
> Key: SPARK-1371
> URL: https://issues.apache.org/jira/browse/SPARK-1371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1372) Expose in-memory columnar caching for tables.

2014-04-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-1372.
-

Resolution: Fixed

> Expose in-memory columnar caching for tables.
> -
>
> Key: SPARK-1372
> URL: https://issues.apache.org/jira/browse/SPARK-1372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1373) Compression for In-Memory Columnar storage

2014-04-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957370#comment-13957370
 ] 

Michael Armbrust commented on SPARK-1373:
-

https://github.com/apache/spark/pull/285

> Compression for In-Memory Columnar storage
> --
>
> Key: SPARK-1373
> URL: https://issues.apache.org/jira/browse/SPARK-1373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1372) Expose in-memory columnar caching for tables.

2014-04-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957367#comment-13957367
 ] 

Michael Armbrust commented on SPARK-1372:
-

https://github.com/apache/spark/pull/282

> Expose in-memory columnar caching for tables.
> -
>
> Key: SPARK-1372
> URL: https://issues.apache.org/jira/browse/SPARK-1372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1364) DataTypes missing from ScalaReflection

2014-04-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957366#comment-13957366
 ] 

Michael Armbrust commented on SPARK-1364:
-

https://github.com/apache/spark/pull/293

> DataTypes missing from ScalaReflection
> --
>
> Key: SPARK-1364
> URL: https://issues.apache.org/jira/browse/SPARK-1364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.0.0
>
>
> BigDecimal, possibly others.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1392) Local spark-shell Runs Out of Memory With Default Settings

2014-04-01 Thread Pat McDonough (JIRA)
Pat McDonough created SPARK-1392:


 Summary: Local spark-shell Runs Out of Memory With Default Settings
 Key: SPARK-1392
 URL: https://issues.apache.org/jira/browse/SPARK-1392
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
 Environment: OS X 10.9.2, Java 1.7.0_51, Scala 2.10.3
Reporter: Pat McDonough


When running the spark-shell locally in out of the box configuration, and 
attempting to cache all the attached data, spark OOMs with: 
{{java.lang.OutOfMemoryError: GC overhead limit exceeded}}

{code}
val explore = 
sc.textFile("/Users/pat/Projects/training-materials/Data/wiki_links")
explore.cache
explore.count
{code}

You can work around the issue by either decreasing 
{{spark.storage.memoryFraction}} or increasing {{SPARK_MEM}}





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1392) Local spark-shell Runs Out of Memory With Default Settings

2014-04-01 Thread Pat McDonough (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat McDonough updated SPARK-1392:
-

Description: 
Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
the spark-shell locally in out of the box configuration, and attempting to 
cache all the attached data, spark OOMs with: {{java.lang.OutOfMemoryError: GC 
overhead limit exceeded}}

{code}
val explore = 
sc.textFile("/Users/pat/Projects/training-materials/Data/wiki_links")
explore.cache
explore.count
{code}

You can work around the issue by either decreasing 
{{spark.storage.memoryFraction}} or increasing {{SPARK_MEM}}



  was:
When running the spark-shell locally in out of the box configuration, and 
attempting to cache all the attached data, spark OOMs with: 
{{java.lang.OutOfMemoryError: GC overhead limit exceeded}}

{code}
val explore = 
sc.textFile("/Users/pat/Projects/training-materials/Data/wiki_links")
explore.cache
explore.count
{code}

You can work around the issue by either decreasing 
{{spark.storage.memoryFraction}} or increasing {{SPARK_MEM}}




> Local spark-shell Runs Out of Memory With Default Settings
> --
>
> Key: SPARK-1392
> URL: https://issues.apache.org/jira/browse/SPARK-1392
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
> Environment: OS X 10.9.2, Java 1.7.0_51, Scala 2.10.3
>Reporter: Pat McDonough
>
> Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
> the spark-shell locally in out of the box configuration, and attempting to 
> cache all the attached data, spark OOMs with: {{java.lang.OutOfMemoryError: 
> GC overhead limit exceeded}}
> {code}
> val explore = 
> sc.textFile("/Users/pat/Projects/training-materials/Data/wiki_links")
> explore.cache
> explore.count
> {code}
> You can work around the issue by either decreasing 
> {{spark.storage.memoryFraction}} or increasing {{SPARK_MEM}}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1392) Local spark-shell Runs Out of Memory With Default Settings

2014-04-01 Thread Pat McDonough (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat McDonough updated SPARK-1392:
-

Description: 
Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
the spark-shell locally in out of the box configuration, and attempting to 
cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC 
overhead limit exceeded

You can work around the issue by either decreasing 
{{spark.storage.memoryFraction}} or increasing {{SPARK_MEM}}

  was:
Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
the spark-shell locally in out of the box configuration, and attempting to 
cache all the attached data, spark OOMs with: {{java.lang.OutOfMemoryError: GC 
overhead limit exceeded}}

{code}
val explore = 
sc.textFile("/Users/pat/Projects/training-materials/Data/wiki_links")
explore.cache
explore.count
{code}

You can work around the issue by either decreasing 
{{spark.storage.memoryFraction}} or increasing {{SPARK_MEM}}




> Local spark-shell Runs Out of Memory With Default Settings
> --
>
> Key: SPARK-1392
> URL: https://issues.apache.org/jira/browse/SPARK-1392
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
> Environment: OS X 10.9.2, Java 1.7.0_51, Scala 2.10.3
>Reporter: Pat McDonough
>
> Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
> the spark-shell locally in out of the box configuration, and attempting to 
> cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC 
> overhead limit exceeded
> You can work around the issue by either decreasing 
> {{spark.storage.memoryFraction}} or increasing {{SPARK_MEM}}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1392) Local spark-shell Runs Out of Memory With Default Settings

2014-04-01 Thread Pat McDonough (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat McDonough updated SPARK-1392:
-

Description: 
Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
the spark-shell locally in out of the box configuration, and attempting to 
cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC 
overhead limit exceeded

You can work around the issue by either decreasing spark.storage.memoryFraction 
or increasing SPARK_MEM

  was:
Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
the spark-shell locally in out of the box configuration, and attempting to 
cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC 
overhead limit exceeded

You can work around the issue by either decreasing 
{{spark.storage.memoryFraction}} or increasing {{SPARK_MEM}}


> Local spark-shell Runs Out of Memory With Default Settings
> --
>
> Key: SPARK-1392
> URL: https://issues.apache.org/jira/browse/SPARK-1392
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
> Environment: OS X 10.9.2, Java 1.7.0_51, Scala 2.10.3
>Reporter: Pat McDonough
>
> Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
> the spark-shell locally in out of the box configuration, and attempting to 
> cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC 
> overhead limit exceeded
> You can work around the issue by either decreasing 
> spark.storage.memoryFraction or increasing SPARK_MEM



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1392) Local spark-shell Runs Out of Memory With Default Settings

2014-04-01 Thread Pat McDonough (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957379#comment-13957379
 ] 

Pat McDonough commented on SPARK-1392:
--

Running the following with the attached data results in the errors below:
{code}
scala> val explore = 
sc.textFile("/Users/pat/Projects/training-materials/Data/wiki_links")
...
scala> explore.cache
res1: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at 
:12
...
scala> explore.count
...
14/04/01 22:52:48 INFO HadoopRDD: Input split: 
file:/Users/pat/Projects/training-materials/Data/wiki_links/part-7:0+25009430
14/04/01 22:52:54 INFO MemoryStore: ensureFreeSpace(55520836) called with 
curMem=271402430, maxMem=309225062
14/04/01 22:52:54 INFO MemoryStore: Will not store rdd_1_7 as it would require 
dropping another block from the same RDD
14/04/01 22:52:54 INFO BlockManager: Dropping block rdd_1_7 from memory
14/04/01 22:52:54 WARN BlockManager: Block rdd_1_7 could not be dropped from 
memory as it does not exist
14/04/01 22:52:54 INFO BlockManagerMaster: Updated info of block rdd_1_7
14/04/01 22:52:54 INFO BlockManagerMaster: Updated info of block rdd_1_7
14/04/01 22:52:54 INFO Executor: Serialized size of result for 7 is 563
14/04/01 22:52:54 INFO Executor: Sending result for 7 directly to driver
14/04/01 22:52:54 INFO Executor: Finished task ID 7
14/04/01 22:52:54 INFO TaskSetManager: Starting task 0.0:8 as TID 8 on executor 
localhost: localhost (PROCESS_LOCAL)
14/04/01 22:52:54 INFO TaskSetManager: Serialized task 0.0:8 as 1606 bytes in 2 
ms
14/04/01 22:52:54 INFO Executor: Running task ID 8
14/04/01 22:52:54 INFO TaskSetManager: Finished TID 7 in 6714 ms on localhost 
(progress: 7/10)
14/04/01 22:52:54 INFO DAGScheduler: Completed ResultTask(0, 7)
14/04/01 22:52:54 INFO BlockManager: Found block broadcast_0 locally
14/04/01 22:52:54 INFO CacheManager: Partition rdd_1_8 not found, computing it
14/04/01 22:52:54 INFO HadoopRDD: Input split: 
file:/Users/pat/Projects/training-materials/Data/wiki_links/part-8:0+25904930
14/04/01 22:52:59 INFO TaskSetManager: Starting task 0.0:9 as TID 9 on executor 
localhost: localhost (PROCESS_LOCAL)
14/04/01 22:52:59 ERROR Executor: Exception in task ID 8
{code}


{noformat}
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.nio.HeapCharBuffer.(HeapCharBuffer.java:57)
at java.nio.CharBuffer.allocate(CharBuffer.java:331)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:777)
at org.apache.hadoop.io.Text.decode(Text.java:405)
at org.apache.hadoop.io.Text.decode(Text.java:382)
at org.apache.hadoop.io.Text.toString(Text.java:280)
at 
org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:344)
at 
org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:344)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:75)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{noformat}


> Local spark-shell Runs Out of Memory With Default Settings
> --
>
> Key: SPARK-1392
> URL: https://issues.apache.org/jira/browse/SPARK-1392
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
> Environment: OS X 10.9.2, Java 1.7.0_51, Scala 2.10.3
>Reporter: Pat McDonough
>
> Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
> the spark-shell locally in out of the box configuration, and attempting to 
> cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC 
> overhead limit exceeded
> You can work around the issue by either decreasing 
> spark.storage.memoryFraction or increasing SPARK_MEM



--
This message was sent by Atlassian JIRA
(v6.2

[jira] [Commented] (SPARK-1392) Local spark-shell Runs Out of Memory With Default Settings

2014-04-01 Thread Pat McDonough (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957381#comment-13957381
 ] 

Pat McDonough commented on SPARK-1392:
--

Attachment was too big, so here's a link: 
https://drive.google.com/file/d/0BwrkCxCycBCyTmlWYXp0MmdEakk/edit?usp=sharing

> Local spark-shell Runs Out of Memory With Default Settings
> --
>
> Key: SPARK-1392
> URL: https://issues.apache.org/jira/browse/SPARK-1392
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
> Environment: OS X 10.9.2, Java 1.7.0_51, Scala 2.10.3
>Reporter: Pat McDonough
>
> Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
> the spark-shell locally in out of the box configuration, and attempting to 
> cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC 
> overhead limit exceeded
> You can work around the issue by either decreasing 
> spark.storage.memoryFraction or increasing SPARK_MEM



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1392) Local spark-shell Runs Out of Memory With Default Settings

2014-04-01 Thread Pat McDonough (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957386#comment-13957386
 ] 

Pat McDonough commented on SPARK-1392:
--

Tried it with the hadoop-1.0.4 build and there was no OOM

> Local spark-shell Runs Out of Memory With Default Settings
> --
>
> Key: SPARK-1392
> URL: https://issues.apache.org/jira/browse/SPARK-1392
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
> Environment: OS X 10.9.2, Java 1.7.0_51, Scala 2.10.3
>Reporter: Pat McDonough
>
> Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
> the spark-shell locally in out of the box configuration, and attempting to 
> cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC 
> overhead limit exceeded
> You can work around the issue by either decreasing 
> spark.storage.memoryFraction or increasing SPARK_MEM



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1363) Add streaming support for Spark SQL module

2014-04-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1363:


Assignee: (was: Michael Armbrust)

> Add streaming support for Spark SQL module
> --
>
> Key: SPARK-1363
> URL: https://issues.apache.org/jira/browse/SPARK-1363
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Saisai Shao
> Attachments: StreamSQLDesignDoc.pdf
>
>
> Currently there exists some projects like Pig On Storm, SQL on storm (Squall, 
> SQLstream) that can query over streaming data, but for Spark Streaming, it is 
> a blank area. It will be a good feature to add streaming supported SQL to 
> Spark SQL.
> From semantic perspective, DStream is quite alike RDD, they both have join, 
> filter, groupBy operators and so on, also DStream is backed by RDD, so it is 
> transplant-able and reusable from existing spark plan.
> Also Catalyst has a clear division for each step, we can fully use its parse 
> and logical plan analysis steps,  with only different physical plan.
> So here we propose to add streaming support in Catalyst.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1363) Add streaming support for Spark SQL module

2014-04-01 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-1363:
---

Assignee: Michael Armbrust

> Add streaming support for Spark SQL module
> --
>
> Key: SPARK-1363
> URL: https://issues.apache.org/jira/browse/SPARK-1363
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Saisai Shao
>Assignee: Michael Armbrust
> Attachments: StreamSQLDesignDoc.pdf
>
>
> Currently there exists some projects like Pig On Storm, SQL on storm (Squall, 
> SQLstream) that can query over streaming data, but for Spark Streaming, it is 
> a blank area. It will be a good feature to add streaming supported SQL to 
> Spark SQL.
> From semantic perspective, DStream is quite alike RDD, they both have join, 
> filter, groupBy operators and so on, also DStream is backed by RDD, so it is 
> transplant-able and reusable from existing spark plan.
> Also Catalyst has a clear division for each step, we can fully use its parse 
> and logical plan analysis steps,  with only different physical plan.
> So here we propose to add streaming support in Catalyst.



--
This message was sent by Atlassian JIRA
(v6.2#6252)