[jira] [Commented] (SPARK-2065) Have spark-ec2 set EC2 instance names

2014-06-06 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020740#comment-14020740
 ] 

Reynold Xin commented on SPARK-2065:


This would be really cool. Do you mind submitting a pull request for this?

> Have spark-ec2 set EC2 instance names
> -
>
> Key: SPARK-2065
> URL: https://issues.apache.org/jira/browse/SPARK-2065
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 1.0.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> {{spark-ec2}} launches EC2 instances with no names. It would be nice if it 
> gave each instance it launched a descriptive name.
> I suggest:
> {code}
> spark-{spark-cluster-name}-{master,slave}-{instance-id}
> {code}
> For example, the instances of a Spark cluster called {{prod1}} would have the 
> following names:
> {code}
> spark-prod1-master-i-18a1f548
> spark-prod1-slave-i-01a1f551
> spark-prod1-slave-i-04a1f554
> spark-prod1-slave-i-05a1f555
> spark-prod1-slave-i-06a1f556
> {code}
> Amazon implements instance names as 
> [tags|http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html], so 
> that's what would need to be set for each launched instance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1704) Support EXPLAIN in Spark SQL

2014-06-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1704:
---

Summary: Support EXPLAIN in Spark SQL  (was: java.lang.AssertionError: 
assertion failed: No plan for ExplainCommand (Project [*]))

> Support EXPLAIN in Spark SQL
> 
>
> Key: SPARK-1704
> URL: https://issues.apache.org/jira/browse/SPARK-1704
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
> Environment: linux
>Reporter: Yangjp
>  Labels: sql
> Fix For: 1.1.0
>
>   Original Estimate: 612h
>  Remaining Estimate: 612h
>
> 14/05/03 22:08:40 INFO ParseDriver: Parsing command: explain select * from src
> 14/05/03 22:08:40 INFO ParseDriver: Parse Completed
> 14/05/03 22:08:40 WARN LoggingFilter: EXCEPTION :
> java.lang.AssertionError: assertion failed: No plan for ExplainCommand 
> (Project [*])
> at scala.Predef$.assert(Predef.scala:179)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:263)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:263)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:264)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:264)
> at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:260)
> at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:248)
> at 
> org.apache.spark.sql.hive.api.java.JavaHiveContext.hql(JavaHiveContext.scala:39)
> at 
> org.apache.spark.examples.TimeServerHandler.messageReceived(TimeServerHandler.java:72)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:690)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
> at 
> org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:407)
> at 
> org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:236)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
> at 
> org.apache.mina.filter.logging.LoggingFilter.messageReceived(LoggingFilter.java:208)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
> at 
> org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.fireMessageReceived(DefaultIoFilterChain.java:410)
> at 
> org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:710)
> at 
> org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:664)
> at 
> org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:653)
> at 
> org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67)
> at 
> org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1124)
> at 
> org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:701)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1841) update scalatest to version 2.1.5

2014-06-06 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020647#comment-14020647
 ] 

Guoqiang Li commented on SPARK-1841:


Well, I'm sorry, thank you

> update scalatest to version 2.1.5
> -
>
> Key: SPARK-1841
> URL: https://issues.apache.org/jira/browse/SPARK-1841
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
> Fix For: 1.1.0
>
>
> scalatest 1.9.* not support Scala 2.11



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2063) Creating a SchemaRDD via sql() does not correctly resolve nested types

2014-06-06 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020641#comment-14020641
 ] 

Michael Armbrust commented on SPARK-2063:
-

This seems to work for me:
{code}
scala> val popularTweets = sql("SELECT retweeted_status.text, 
MAX(retweeted_status.retweet_count) AS s FROM tweetTable WHERE retweeted_status 
is not NULL GROUP BY retweeted_status.text ORDER BY s DESC LIMIT 30") 
popularTweets: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[13] at RDD at SchemaRDD.scala:117
== Query Plan ==
TakeOrdered 30, [s#26:1 DESC]
 Aggregate false, [text#27], [text#27,MAX(PartialMax#30) AS s#26]
  Exchange (HashPartitioning [text#27:0], 150)
   Aggregate true, [retweeted_status#23.text AS text#27], 
[retweeted_status#23.text AS text#27,MAX(retweeted_status#23.retweet_count AS 
retweet_count#29) AS PartialMax#30]
Project [retweeted_status#23:23]
 Filter IS NOT NULL retweeted_status#23:23
  ExistingRdd 
[contributors#0,created_at#1,favorite_count#2,favorited#3,filter_level#4,id#5L,id_str#6,in_reply_to_screen_name#7,in_reply_to_status_id#8L,in_reply_to_status_id_str#9,in_reply_to_user_id#10,in_reply_to_user_id_str#11,lang#12,possibly_sensitive#13,retweet_count#14,retwee...
scala> popularTweets.collect()
res3: Array[org.apache.spark.sql.Row] = Array([Four more years. 
http://t.co/bAJE6Vom,793368], [388726,388726], [Thank you all for helping me 
through this time with your enormous love & support. Cory will forever be 
in my heart. http://t.co/XVlZnh9vOc,388719], [Yesss ! I'm 20 ! Wohooo ! No more 
teens!,369389], [Never been more happy in my life ! Thank you to everyone that 
has been so lovely about my engagement to my beautiful fiancé ! Big love z 
X,320358], [Note to self. Don't 'twerk'.,314666], [Harry wake up !! :D 
http://t.co/cuhD5bC5,311770], [In honor of Kim and Kanye's baby "North West" I 
will be naming my first son "Taco",311575], [ITS NIALL BITCH!! HAHA,283976], 
[what makes you so beautiful is that you dont know how beautiful you are... to 
me,279850], [279578,279578], [applied ...
{code}

> Creating a SchemaRDD via sql() does not correctly resolve nested types
> --
>
> Key: SPARK-2063
> URL: https://issues.apache.org/jira/browse/SPARK-2063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Assignee: Michael Armbrust
>
> For example, from the typical twitter dataset:
> {code}
> scala> val popularTweets = sql("SELECT retweeted_status.text, 
> MAX(retweeted_status.retweet_count) AS s FROM tweets WHERE retweeted_status 
> is not NULL GROUP BY retweeted_status.text ORDER BY s DESC LIMIT 30")
> scala> popularTweets.toString
> 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch MultiInstanceRelations
> 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for 
> batch CaseInsensitiveAttributeReferences
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> qualifiers on unresolved object, tree: 'retweeted_status.text
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:51)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:47)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:67)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:65)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:65)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:100)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:97)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:51)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:65)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>

[jira] [Commented] (SPARK-1841) update scalatest to version 2.1.5

2014-06-06 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020635#comment-14020635
 ] 

Patrick Wendell commented on SPARK-1841:


This was not fixed in 1.0.1, it was only fixed in 1.1.0 [~gq] so I'm reverting 
your change.

> update scalatest to version 2.1.5
> -
>
> Key: SPARK-1841
> URL: https://issues.apache.org/jira/browse/SPARK-1841
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
> Fix For: 1.1.0
>
>
> scalatest 1.9.* not support Scala 2.11



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1841) update scalatest to version 2.1.5

2014-06-06 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1841:
---

Fix Version/s: (was: 1.0.1)

> update scalatest to version 2.1.5
> -
>
> Key: SPARK-1841
> URL: https://issues.apache.org/jira/browse/SPARK-1841
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
> Fix For: 1.1.0
>
>
> scalatest 1.9.* not support Scala 2.11



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1994) Weird data corruption bug when running Spark SQL on data in HDFS

2014-06-06 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-1994:
---

Assignee: Michael Armbrust

> Weird data corruption bug when running Spark SQL on data in HDFS
> 
>
> Key: SPARK-1994
> URL: https://issues.apache.org/jira/browse/SPARK-1994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> [~adav] has a full reproduction but he has found a case where the first run 
> returns corrupted results, but the second case does not.  The same does not 
> occur when reading from HDFS a second time...
> {code}
> sql("SELECT lang, COUNT(*) AS cnt FROM tweetTable GROUP BY lang ORDER BY cnt 
> DESC").collect.foreach(println)
> [bg,16636]
> [16266,16266]
> [16223,16223]
> [16161,16161]
> [16047,16047]
> [lt,11405]
> [hu,11380]
> [el,10845]
> [da,10289]
> [fi,10261]
> [9897,9897]
> [9765,9765]
> [9751,9751]
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1704) java.lang.AssertionError: assertion failed: No plan for ExplainCommand (Project [*])

2014-06-06 Thread Zongheng Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020628#comment-14020628
 ] 

Zongheng Yang commented on SPARK-1704:
--

Github pull request: https://github.com/apache/spark/pull/1003

> java.lang.AssertionError: assertion failed: No plan for ExplainCommand 
> (Project [*])
> 
>
> Key: SPARK-1704
> URL: https://issues.apache.org/jira/browse/SPARK-1704
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
> Environment: linux
>Reporter: Yangjp
>  Labels: sql
> Fix For: 1.1.0
>
>   Original Estimate: 612h
>  Remaining Estimate: 612h
>
> 14/05/03 22:08:40 INFO ParseDriver: Parsing command: explain select * from src
> 14/05/03 22:08:40 INFO ParseDriver: Parse Completed
> 14/05/03 22:08:40 WARN LoggingFilter: EXCEPTION :
> java.lang.AssertionError: assertion failed: No plan for ExplainCommand 
> (Project [*])
> at scala.Predef$.assert(Predef.scala:179)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:263)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:263)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:264)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:264)
> at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:260)
> at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:248)
> at 
> org.apache.spark.sql.hive.api.java.JavaHiveContext.hql(JavaHiveContext.scala:39)
> at 
> org.apache.spark.examples.TimeServerHandler.messageReceived(TimeServerHandler.java:72)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:690)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
> at 
> org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:407)
> at 
> org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:236)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
> at 
> org.apache.mina.filter.logging.LoggingFilter.messageReceived(LoggingFilter.java:208)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
> at 
> org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.fireMessageReceived(DefaultIoFilterChain.java:410)
> at 
> org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:710)
> at 
> org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:664)
> at 
> org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:653)
> at 
> org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67)
> at 
> org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1124)
> at 
> org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:701)



--
This message w

[jira] [Updated] (SPARK-1841) update scalatest to version 2.1.5

2014-06-06 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-1841:
---

Fix Version/s: 1.0.1

> update scalatest to version 2.1.5
> -
>
> Key: SPARK-1841
> URL: https://issues.apache.org/jira/browse/SPARK-1841
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
> Fix For: 1.0.1, 1.1.0
>
>
> scalatest 1.9.* not support Scala 2.11



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2065) Have spark-ec2 set EC2 instance names

2014-06-06 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-2065:
---

 Summary: Have spark-ec2 set EC2 instance names
 Key: SPARK-2065
 URL: https://issues.apache.org/jira/browse/SPARK-2065
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.0.0
Reporter: Nicholas Chammas
Priority: Trivial


{{spark-ec2}} launches EC2 instances with no names. It would be nice if it gave 
each instance it launched a descriptive name.

I suggest:
{code}
spark-{spark-cluster-name}-{master,slave}-{instance-id}
{code}

For example, the instances of a Spark cluster called {{prod1}} would have the 
following names:
{code}
spark-prod1-master-i-18a1f548
spark-prod1-slave-i-01a1f551
spark-prod1-slave-i-04a1f554
spark-prod1-slave-i-05a1f555
spark-prod1-slave-i-06a1f556
{code}

Amazon implements instance names as 
[tags|http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html], so 
that's what would need to be set for each launched instance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2064) web ui should not remove executors if they are dead

2014-06-06 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2064:
--

 Summary: web ui should not remove executors if they are dead
 Key: SPARK-2064
 URL: https://issues.apache.org/jira/browse/SPARK-2064
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


We should always show the list of executors that have ever been connected, and 
add a status column to mark them as dead if they have been disconnected.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2015) Spark UI issues at scale

2014-06-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2015:
---

Target Version/s: 1.1.0

> Spark UI issues at scale
> 
>
> Key: SPARK-2015
> URL: https://issues.apache.org/jira/browse/SPARK-2015
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>
> This is an umbrella ticket for issues related to Spark's web ui when we run 
> Spark at scale (large datasets, large number of machines, or large number of 
> tasks).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-06-06 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020460#comment-14020460
 ] 

Xiangrui Meng commented on SPARK-1977:
--

[~smolav] and [~coderxiang]:

Thanks for testing it! Could you post the exact error message you got with 
stack trace? Based on your description, it should be caused by the default 
serialization of kryo. It may treat BitSet as a general Java collection, then 
run into error in ser/de.

> mutable.BitSet in ALS not serializable with KryoSerializer
> --
>
> Key: SPARK-1977
> URL: https://issues.apache.org/jira/browse/SPARK-1977
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Neville Li
>Priority: Minor
>
> OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
> KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
> register mutable.BitSet.
> Right now we have to register mutable.BitSet manually. A proper fix would be 
> using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
> 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
> com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
> scala.collection.mutable.HashSet
> Serialization trace:
> shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
> 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> org.apache.spark.scheduler.Task.run(Task.scala:51)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> java.lang.Thread.run(Thread.java:662)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
>   at 
> org

[jira] [Created] (SPARK-2063) Creating a SchemaRDD via sql() does not correctly resolve nested types

2014-06-06 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2063:
-

 Summary: Creating a SchemaRDD via sql() does not correctly resolve 
nested types
 Key: SPARK-2063
 URL: https://issues.apache.org/jira/browse/SPARK-2063
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Michael Armbrust


For example, from the typical twitter dataset:

{code}
scala> val popularTweets = sql("SELECT retweeted_status.text, 
MAX(retweeted_status.retweet_count) AS s FROM tweets WHERE retweeted_status is 
not NULL GROUP BY retweeted_status.text ORDER BY s DESC LIMIT 30")

scala> popularTweets.toString
14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for batch 
MultiInstanceRelations
14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for batch 
CaseInsensitiveAttributeReferences
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
qualifiers on unresolved object, tree: 'retweeted_status.text
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:51)
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:47)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:67)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:65)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:65)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:100)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:97)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:51)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:65)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:64)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:69)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:40)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3.applyOrElse(Analyzer.scala:97)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3.applyOrElse(Analyzer.scala:94)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:217)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:94)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:93)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$

[jira] [Updated] (SPARK-2062) VertexRDD.apply does not use the mergeFunc

2014-06-06 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-2062:
--

Component/s: GraphX

> VertexRDD.apply does not use the mergeFunc
> --
>
> Key: SPARK-2062
> URL: https://issues.apache.org/jira/browse/SPARK-2062
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> Here: 
> https://github.com/apache/spark/blob/b1feb60209174433262de2a26d39616ba00edcc8/graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala#L410



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2062) VertexRDD.apply does not use the mergeFunc

2014-06-06 Thread Ankur Dave (JIRA)
Ankur Dave created SPARK-2062:
-

 Summary: VertexRDD.apply does not use the mergeFunc
 Key: SPARK-2062
 URL: https://issues.apache.org/jira/browse/SPARK-2062
 Project: Spark
  Issue Type: Bug
Reporter: Ankur Dave
Assignee: Ankur Dave


Here: 
https://github.com/apache/spark/blob/b1feb60209174433262de2a26d39616ba00edcc8/graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala#L410



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1097) ConcurrentModificationException

2014-06-06 Thread Nishkam Ravi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020353#comment-14020353
 ] 

Nishkam Ravi commented on SPARK-1097:
-

Have initiated a PR for a workaround in Spark as well (for developers using < 
2.4.1):
https://github.com/apache/spark/pull/1000

> ConcurrentModificationException
> ---
>
> Key: SPARK-1097
> URL: https://issues.apache.org/jira/browse/SPARK-1097
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Fabrizio Milo
> Attachments: nravi_Conf_Spark-1388.patch
>
>
> {noformat}
> 14/02/16 08:18:45 WARN TaskSetManager: Loss was due to 
> java.util.ConcurrentModificationException
> java.util.ConcurrentModificationException
>   at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
>   at java.util.HashMap$KeyIterator.next(HashMap.java:960)
>   at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
>   at java.util.HashSet.(HashSet.java:117)
>   at org.apache.hadoop.conf.Configuration.(Configuration.java:554)
>   at org.apache.hadoop.mapred.JobConf.(JobConf.java:439)
>   at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:32)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:72)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>   at org.apache.spark.scheduler.Task.run(Task.scala:53)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1704) java.lang.AssertionError: assertion failed: No plan for ExplainCommand (Project [*])

2014-06-06 Thread Zongheng Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020340#comment-14020340
 ] 

Zongheng Yang commented on SPARK-1704:
--

Just so I understand the desirable output for Explain: sql("EXPLAIN SELECT * 
FROM src") should return a SchemaRDD, on which when .collect() is called 
returns:

== Query Plan ==
HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None

where each Row corresponds to one line of the description. Does this sound good?

> java.lang.AssertionError: assertion failed: No plan for ExplainCommand 
> (Project [*])
> 
>
> Key: SPARK-1704
> URL: https://issues.apache.org/jira/browse/SPARK-1704
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
> Environment: linux
>Reporter: Yangjp
>  Labels: sql
> Fix For: 1.1.0
>
>   Original Estimate: 612h
>  Remaining Estimate: 612h
>
> 14/05/03 22:08:40 INFO ParseDriver: Parsing command: explain select * from src
> 14/05/03 22:08:40 INFO ParseDriver: Parse Completed
> 14/05/03 22:08:40 WARN LoggingFilter: EXCEPTION :
> java.lang.AssertionError: assertion failed: No plan for ExplainCommand 
> (Project [*])
> at scala.Predef$.assert(Predef.scala:179)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:263)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:263)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:264)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:264)
> at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:260)
> at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:248)
> at 
> org.apache.spark.sql.hive.api.java.JavaHiveContext.hql(JavaHiveContext.scala:39)
> at 
> org.apache.spark.examples.TimeServerHandler.messageReceived(TimeServerHandler.java:72)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:690)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
> at 
> org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:407)
> at 
> org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:236)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
> at 
> org.apache.mina.filter.logging.LoggingFilter.messageReceived(LoggingFilter.java:208)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
> at 
> org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
> at 
> org.apache.mina.core.filterchain.DefaultIoFilterChain.fireMessageReceived(DefaultIoFilterChain.java:410)
> at 
> org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:710)
> at 
> org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:664)
> at 
> org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:653)
> at 
> org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67)
> at 
> org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1124)
> at 
> org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnabl

[jira] [Commented] (SPARK-1097) ConcurrentModificationException

2014-06-06 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020336#comment-14020336
 ] 

Tsuyoshi OZAWA commented on SPARK-1097:
---

[~jblomo], thank you for reporting. This issue is fixed in next minor Hadoop 
release - 2.4.1. Note that 2.4.0 doesn't include the fix.

> ConcurrentModificationException
> ---
>
> Key: SPARK-1097
> URL: https://issues.apache.org/jira/browse/SPARK-1097
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Fabrizio Milo
> Attachments: nravi_Conf_Spark-1388.patch
>
>
> {noformat}
> 14/02/16 08:18:45 WARN TaskSetManager: Loss was due to 
> java.util.ConcurrentModificationException
> java.util.ConcurrentModificationException
>   at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
>   at java.util.HashMap$KeyIterator.next(HashMap.java:960)
>   at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
>   at java.util.HashSet.(HashSet.java:117)
>   at org.apache.hadoop.conf.Configuration.(Configuration.java:554)
>   at org.apache.hadoop.mapred.JobConf.(JobConf.java:439)
>   at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:32)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:72)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>   at org.apache.spark.scheduler.Task.run(Task.scala:53)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles

2014-06-06 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020329#comment-14020329
 ] 

Matei Zaharia commented on SPARK-2044:
--

So BTW I think what I'll do is move over the current shuffle but without 
MapOutputTracker, then we can open another JIRA to move MapOutputTracker behind 
the hash shuffle implementation.

> Pluggable interface for shuffles
> 
>
> Key: SPARK-2044
> URL: https://issues.apache.org/jira/browse/SPARK-2044
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
> Attachments: Pluggableshuffleproposal.pdf
>
>
> Given that a lot of the current activity in Spark Core is in shuffles, I 
> wanted to propose factoring out shuffle implementations in a way that will 
> make experimentation easier. Ideally we will converge on one implementation, 
> but for a while, this could also be used to have several implementations 
> coexist. I'm suggesting this because I aware of at least three efforts to 
> look at shuffle (from Yahoo!, Intel and Databricks). Some of the things 
> people are investigating are:
> * Push-based shuffle where data moves directly from mappers to reducers
> * Sorting-based instead of hash-based shuffle, to create fewer files (helps a 
> lot with file handles and memory usage on large shuffles)
> * External spilling within a key
> * Changing the level of parallelism or even algorithm for downstream stages 
> at runtime based on statistics of the map output (this is a thing we had 
> prototyped in the Shark research project but never merged in core)
> I've attached a design doc with a proposed interface. It's not too crazy 
> because the interface between shuffles and the rest of the code is already 
> pretty narrow (just some iterators for reading data and a writer interface 
> for writing it). Bigger changes will be needed in the interaction with 
> DAGScheduler and BlockManager for some of the ideas above, but we can handle 
> those separately, and this interface will allow us to experiment with some 
> short-term stuff sooner.
> If things go well I'd also like to send a sort-based shuffle implementation 
> for 1.1, but we'll see how the timing on that works out.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2061) Deprecate `splits` in JavaRDDLike and add `partitions`

2014-06-06 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-2061:
--

 Summary: Deprecate `splits` in JavaRDDLike and add `partitions`
 Key: SPARK-2061
 URL: https://issues.apache.org/jira/browse/SPARK-2061
 Project: Spark
  Issue Type: Bug
  Components: Java API
Reporter: Patrick Wendell
Priority: Minor


Most of spark has used over to consistently using `partitions` instead of 
`splits`. We should do likewise and add a `partitions` method to JavaRDDLike 
and have `splits` just call that. We should also go through all cases where 
other API's (e.g. Python) call `splits` and we should change those to use the 
newer API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1388) ConcurrentModificationException in hadoop_common exposed by Spark

2014-06-06 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1388.


Resolution: Duplicate

> ConcurrentModificationException in hadoop_common exposed by Spark
> -
>
> Key: SPARK-1388
> URL: https://issues.apache.org/jira/browse/SPARK-1388
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Nishkam Ravi
> Attachments: nravi_Conf_Spark-1388.patch
>
>
> The following exception occurs non-deterministically:
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
> at java.util.HashMap$KeyIterator.next(HashMap.java:960)
> at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
> at java.util.HashSet.(HashSet.java:117)
> at org.apache.hadoop.conf.Configuration.(Configuration.java:671)
> at org.apache.hadoop.mapred.JobConf.(JobConf.java:439)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
> at org.apache.spark.scheduler.Task.run(Task.scala:53)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1388) ConcurrentModificationException in hadoop_common exposed by Spark

2014-06-06 Thread Jim Blomo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020279#comment-14020279
 ] 

Jim Blomo commented on SPARK-1388:
--

+1 for resolving

> ConcurrentModificationException in hadoop_common exposed by Spark
> -
>
> Key: SPARK-1388
> URL: https://issues.apache.org/jira/browse/SPARK-1388
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Nishkam Ravi
> Attachments: nravi_Conf_Spark-1388.patch
>
>
> The following exception occurs non-deterministically:
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
> at java.util.HashMap$KeyIterator.next(HashMap.java:960)
> at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
> at java.util.HashSet.(HashSet.java:117)
> at org.apache.hadoop.conf.Configuration.(Configuration.java:671)
> at org.apache.hadoop.mapred.JobConf.(JobConf.java:439)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
> at org.apache.spark.scheduler.Task.run(Task.scala:53)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1097) ConcurrentModificationException

2014-06-06 Thread Jim Blomo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020278#comment-14020278
 ] 

Jim Blomo commented on SPARK-1097:
--

FYI still seeing this on spark 1.0, Hadoop 2.4

{code:java}
java.util.ConcurrentModificationException 
(java.util.ConcurrentModificationException)
java.util.HashMap$HashIterator.nextEntry(HashMap.java:922)
java.util.HashMap$KeyIterator.next(HashMap.java:956)
java.util.AbstractCollection.addAll(AbstractCollection.java:341)
java.util.HashSet.(HashSet.java:117)
org.apache.hadoop.conf.Configuration.(Configuration.java:671)
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:98)
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2402)
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2436)
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2418)
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:107)
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:190)
org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:181)
org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:93)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:200)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:175)
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:174)
{code}

> ConcurrentModificationException
> ---
>
> Key: SPARK-1097
> URL: https://issues.apache.org/jira/browse/SPARK-1097
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Fabrizio Milo
> Attachments: nravi_Conf_Spark-1388.patch
>
>
> {noformat}
> 14/02/16 08:18:45 WARN TaskSetManager: Loss was due to 
> java.util.ConcurrentModificationException
> java.util.ConcurrentModificationException
>   at java.util.HashMap$HashIterator.nextEntry(HashMap.java:926)
>   at java.util.HashMap$KeyIterator.next(HashMap.java:960)
>   at java.util.AbstractCollection.addAll(AbstractCollection.java:341)
>   at java.util.HashSet.(HashSet.java:117)
>   at org.apache.hadoop.conf.Configuration.(Configuration.java:554)
>   at org.apache.hadoop.mapred.JobConf.(JobConf.java:439)
>   at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:110)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:154)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
>   at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:32)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:72)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>   at org.apache.spark.scheduler.Task.run(Task.scala:53)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>   at 
> java.util.concurre

[jira] [Commented] (SPARK-1841) update scalatest to version 2.1.5

2014-06-06 Thread Bernardo Gomez Palacio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020211#comment-14020211
 ] 

Bernardo Gomez Palacio commented on SPARK-1841:
---

[~gq] thank you for addressing this!

> update scalatest to version 2.1.5
> -
>
> Key: SPARK-1841
> URL: https://issues.apache.org/jira/browse/SPARK-1841
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
> Fix For: 1.1.0
>
>
> scalatest 1.9.* not support Scala 2.11



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1841) update scalatest to version 2.1.5

2014-06-06 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1841.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 713
[https://github.com/apache/spark/pull/713]

> update scalatest to version 2.1.5
> -
>
> Key: SPARK-1841
> URL: https://issues.apache.org/jira/browse/SPARK-1841
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
> Fix For: 1.1.0
>
>
> scalatest 1.9.* not support Scala 2.11



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2060) Querying JSON Datasets with SQL and DSL in Spark SQL

2014-06-06 Thread Yin Huai (JIRA)
Yin Huai created SPARK-2060:
---

 Summary: Querying JSON Datasets with SQL and DSL in Spark SQL
 Key: SPARK-2060
 URL: https://issues.apache.org/jira/browse/SPARK-2060
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2059) Unresolved Attributes should cause a failure before execution time

2014-06-06 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-2059:
---

 Summary: Unresolved Attributes should cause a failure before 
execution time
 Key: SPARK-2059
 URL: https://issues.apache.org/jira/browse/SPARK-2059
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust


Here's a partial solution: https://github.com/marmbrus/spark/tree/analysisChecks



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles

2014-06-06 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020090#comment-14020090
 ] 

Matei Zaharia commented on SPARK-2044:
--

{quote}
Hi Matei, thanks for your reply. I will carefully read your doc and follow your 
work. So where should I start ?
{quote}
I'm going to spend some time today completing some of the refactoring I started 
to do and then post it to my branch so you can work from there.

> Pluggable interface for shuffles
> 
>
> Key: SPARK-2044
> URL: https://issues.apache.org/jira/browse/SPARK-2044
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
> Attachments: Pluggableshuffleproposal.pdf
>
>
> Given that a lot of the current activity in Spark Core is in shuffles, I 
> wanted to propose factoring out shuffle implementations in a way that will 
> make experimentation easier. Ideally we will converge on one implementation, 
> but for a while, this could also be used to have several implementations 
> coexist. I'm suggesting this because I aware of at least three efforts to 
> look at shuffle (from Yahoo!, Intel and Databricks). Some of the things 
> people are investigating are:
> * Push-based shuffle where data moves directly from mappers to reducers
> * Sorting-based instead of hash-based shuffle, to create fewer files (helps a 
> lot with file handles and memory usage on large shuffles)
> * External spilling within a key
> * Changing the level of parallelism or even algorithm for downstream stages 
> at runtime based on statistics of the map output (this is a thing we had 
> prototyped in the Shark research project but never merged in core)
> I've attached a design doc with a proposed interface. It's not too crazy 
> because the interface between shuffles and the rest of the code is already 
> pretty narrow (just some iterators for reading data and a writer interface 
> for writing it). Bigger changes will be needed in the interaction with 
> DAGScheduler and BlockManager for some of the ideas above, but we can handle 
> those separately, and this interface will allow us to experiment with some 
> short-term stuff sooner.
> If things go well I'd also like to send a sort-based shuffle implementation 
> for 1.1, but we'll see how the timing on that works out.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles

2014-06-06 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020087#comment-14020087
 ] 

Matei Zaharia commented on SPARK-2044:
--

{quote}
1. Is it a goal to support more kind of shuffle: e.g. moving sort from reducer 
to mapper? If yes, it seems it is better to add additional flag to 
shuffleManager. I find similar statements in page 3 
{quote}
The goal is to allow diverse shuffle implementations, so it doesn't make sense 
to add a flag for it. If we add a flag, every ShuffleManager will need to 
implement this feature. Instead we're trying to make the smallest interface 
that the code consuming this data needs, so that we can try multiple 
implementations of ShuffleManager and see which of these features work best.

{quote}
??When the shuffle has no Aggregator (i.e. null or None is passed in), keys and 
values are simply sent across the network. Optionally we might allow the 
ShuffleManager to specify whether keys read from a ShuffleReader are sorted, or 
add a flag to registerShuffle that requests this for keys that have an 
Ordering. This would simplify grouping operators downstream (e.g. cogroup).??
Does this mean that ordering is an inherit property of input data or it wants 
ShuffleManager to perform sorting for the data?
{quote}
The Ordering object means that keys are comparable. This flag here would be to 
tell the ShuffleManager to sort the data, so that downstream algorithms like 
joins can work more efficiently.

{quote}
2. Is it a goal to support prefetch of map data at reducer side?
{quote}
Again this might be done by some implementations of ShuffleManager

{quote}
3. for ShuffleReader, why only partition range is allowed? How about extend 
this API to support multiple indididual partitions? For example, if reducer 
knows that partitions 1,3,5 are ready while 2,4,6 are not, reducer can fetch 
1,3,5 at first. Instead of making 3 calls of getReader, making one call can 
reduce mapper side disk seek operations, e.g. if partitions 3,5 are on 
continous on one node.
{quote}
The reducer code shouldn't have to worry about what order to fetch things in. 
Instead, when you request a range, the ShuffleManager implementation can decide 
which partitions to fetch first based on what's available. The idea was that 
some code in DAGScheduler decides on the number of reduce tasks and their 
partition ranges (by looking at the map output size for each partition) and 
then the ShuffleManager on each node fetches the right partitions. Ranges are 
simpler to deal with than arbitrary sets and more space-efficient to represent 
(e.g. imagine we had 100,000 map tasks).

{quote}
4. I am not sure whether such a partition list or range shall return one reader 
instance or mulitple ones.
{quote}
It returns one reader that gathers and combines key-value pairs across all the 
partitions.


> Pluggable interface for shuffles
> 
>
> Key: SPARK-2044
> URL: https://issues.apache.org/jira/browse/SPARK-2044
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
> Attachments: Pluggableshuffleproposal.pdf
>
>
> Given that a lot of the current activity in Spark Core is in shuffles, I 
> wanted to propose factoring out shuffle implementations in a way that will 
> make experimentation easier. Ideally we will converge on one implementation, 
> but for a while, this could also be used to have several implementations 
> coexist. I'm suggesting this because I aware of at least three efforts to 
> look at shuffle (from Yahoo!, Intel and Databricks). Some of the things 
> people are investigating are:
> * Push-based shuffle where data moves directly from mappers to reducers
> * Sorting-based instead of hash-based shuffle, to create fewer files (helps a 
> lot with file handles and memory usage on large shuffles)
> * External spilling within a key
> * Changing the level of parallelism or even algorithm for downstream stages 
> at runtime based on statistics of the map output (this is a thing we had 
> prototyped in the Shark research project but never merged in core)
> I've attached a design doc with a proposed interface. It's not too crazy 
> because the interface between shuffles and the rest of the code is already 
> pretty narrow (just some iterators for reading data and a writer interface 
> for writing it). Bigger changes will be needed in the interaction with 
> DAGScheduler and BlockManager for some of the ideas above, but we can handle 
> those separately, and this interface will allow us to experiment with some 
> short-term stuff sooner.
> If things go well I'd also like to send a sort-based shuffle implementation 
> for 1.1, but we'll see how the timing on that works out.



--
This message was sent by Atlass

[jira] [Commented] (SPARK-2026) Maven "hadoop*" Profiles Should Set the expected Hadoop Version.

2014-06-06 Thread Bernardo Gomez Palacio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020080#comment-14020080
 ] 

Bernardo Gomez Palacio commented on SPARK-2026:
---

https://github.com/apache/spark/pull/998

> Maven "hadoop*" Profiles Should Set the expected Hadoop Version.
> 
>
> Key: SPARK-2026
> URL: https://issues.apache.org/jira/browse/SPARK-2026
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Bernardo Gomez Palacio
>
> The Maven Profiles that refer to _hadoopX_, e.g. hadoop2.4, should set the 
> expected _hadoop.version_.
> e.g.
> {code}
> 
>   hadoop-2.4
>   
> 2.5.0
> 0.9.0
>   
> 
> {code}
> as it is suggested
> {code}
> 
>   hadoop-2.4
>   
> 2.4.0
>  ${hadoop.version}
> 2.5.0
> 0.9.0
>   
> 
> {code}
> Builds can still define the -Dhadoop.version option but this will correctly 
> default the Hadoop Version to the one that is expected according the profile 
> that is selected.
> e.g.
> {code}
> $ mvn -P hadoop-2.4,yarn clean compile
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2009) Key not found exception when slow receiver starts

2014-06-06 Thread Vadim Chekan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020057#comment-14020057
 ] 

Vadim Chekan commented on SPARK-2009:
-

Folks,

Patch is in progress of review here:
https://github.com/apache/spark/pull/961#issuecomment-45125185

> Key not found exception when slow receiver starts
> -
>
> Key: SPARK-2009
> URL: https://issues.apache.org/jira/browse/SPARK-2009
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: Vadim Chekan
>
> I got "java.util.NoSuchElementException: key not found: 1401756085000 ms" 
> exception when using kafka stream and 1 sec batchPeriod.
> Investigation showed that the reason is that ReceiverLauncher.startReceivers 
> is asynchronous (started in a thread).
> https://github.com/vchekan/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L206
> In case of slow starting receiver, such as Kafka, it easily takes more than 
> 2sec to start. In result, no single "compute" will be called on 
> ReceiverInputDStream before first batch job is executed and receivedBlockInfo 
> remains empty (obviously). Batch job will cause 
> ReceiverInputDStream.getReceivedBlockInfo call and "key not found" exception.
> The patch makes getReceivedBlockInfo more robust by tolerating missing values.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-06-06 Thread Shuo Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020030#comment-14020030
 ] 

Shuo Xiang commented on SPARK-1977:
---

Update: I also reproduce similar error message for a larger data set (~ 3GB).

> mutable.BitSet in ALS not serializable with KryoSerializer
> --
>
> Key: SPARK-1977
> URL: https://issues.apache.org/jira/browse/SPARK-1977
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Neville Li
>Priority: Minor
>
> OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
> KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
> register mutable.BitSet.
> Right now we have to register mutable.BitSet manually. A proper fix would be 
> using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
> 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
> com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
> scala.collection.mutable.HashSet
> Serialization trace:
> shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
> 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> org.apache.spark.scheduler.Task.run(Task.scala:51)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> java.lang.Thread.run(Thread.java:662)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
> 

[jira] [Commented] (SPARK-2058) SPARK_CONF_DIR should override all present configs

2014-06-06 Thread Eugen Cepoi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019851#comment-14019851
 ] 

Eugen Cepoi commented on SPARK-2058:


Proposed fix : https://github.com/apache/spark/pull/997

> SPARK_CONF_DIR should override all present configs
> --
>
> Key: SPARK-2058
> URL: https://issues.apache.org/jira/browse/SPARK-2058
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.0, 1.0.1, 1.1.0
>Reporter: Eugen Cepoi
>Priority: Trivial
> Fix For: 1.0.1, 1.1.0
>
>
> When the user defines SPARK_CONF_DIR I think spark should use all the configs 
> available there not only spark-env.
> This involves changing SparkSubmitArguments to first read from 
> SPARK_CONF_DIR, and updating the scripts to add SPARK_CONF_DIR to the 
> computed classpath for configs such as log4j, metrics, etc.
> I have already prepared a PR for this. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1308) Add partitions() method to PySpark RDDs

2014-06-06 Thread Syed A. Hashmi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019742#comment-14019742
 ] 

Syed A. Hashmi commented on SPARK-1308:
---

Created pull request https://github.com/apache/spark/pull/995 to address this 
issue.

[~matei]: Can you please assign this JIRA to me and review the PR?

> Add partitions() method to PySpark RDDs
> ---
>
> Key: SPARK-1308
> URL: https://issues.apache.org/jira/browse/SPARK-1308
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 0.9.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> In Spark, you can do this:
> {code}
> // Scala
> val a = sc.parallelize(List(1, 2, 3, 4), 4)
> a.partitions.size
> {code}
> Please make this possible in PySpark too.
> The work-around available is quite simple:
> {code}
> # Python
> a = sc.parallelize([1, 2, 3, 4], 4)
> a._jrdd.splits().size()
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2058) SPARK_CONF_DIR should override all present configs

2014-06-06 Thread Eugen Cepoi (JIRA)
Eugen Cepoi created SPARK-2058:
--

 Summary: SPARK_CONF_DIR should override all present configs
 Key: SPARK-2058
 URL: https://issues.apache.org/jira/browse/SPARK-2058
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0, 1.0.1, 1.1.0
Reporter: Eugen Cepoi
Priority: Trivial
 Fix For: 1.0.1, 1.1.0


When the user defines SPARK_CONF_DIR I think spark should use all the configs 
available there not only spark-env.
This involves changing SparkSubmitArguments to first read from SPARK_CONF_DIR, 
and updating the scripts to add SPARK_CONF_DIR to the computed classpath for 
configs such as log4j, metrics, etc.

I have already prepared a PR for this. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2057) run-example can only be ran under spark_home

2014-06-06 Thread maji2014 (JIRA)
maji2014 created SPARK-2057:
---

 Summary: run-example can only be ran under spark_home
 Key: SPARK-2057
 URL: https://issues.apache.org/jira/browse/SPARK-2057
 Project: Spark
  Issue Type: Bug
Reporter: maji2014
Priority: Minor
 Fix For: 1.0.0


Old code can only be ran under spark_home and use "bin/run-example".
 Error "./run-example: line 55: ./bin/spark-submit: No such file or directory" 
appears when running in other place. 

this code should be ""$FWDIR"/bin/spark-submit "




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-06-06 Thread Santiago M. Mola (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019655#comment-14019655
 ] 

Santiago M. Mola edited comment on SPARK-1977 at 6/6/14 7:55 AM:
-

I can reproduce this depending on the size of the dataset:

{noformat}
spark-submit mllib-movielens-evaluation-assembly-1.0.jar --master 
spark://mllib1:7077
--class com.example.MovieLensALS --rank 10 --numIterations 20 --lambda 1.0 
--kryo
hdfs:/movielens/oversampled.dat
{noformat}

The exception will not be thrown for small datasets. It will successfully run 
with MovieLens 100k and 10M. However, when I run it on a 100M dataset, the 
exception will be thrown.

My MovieLensALS is mostly the same as the one shipped with Spark. I just added 
cross-validation. Rating is registered in Kryo just as in the stock example.

{noformat}
# cat RELEASE 
Spark 1.0.0 built for Hadoop 2.2.0
{noformat}




was (Author: smolav):
I can reproduce this depending on the size of the dataset:

{noformat}
spark-submit mllib-movielens-evaluation-assembly-1.0.jar --master 
spark://mllib1:7077
--class com.example.MovieLensALS --rank 10 --numIterations 20 --lambda 1.0 
--kryo
hdfs:/movielens/oversampled.dat
{noformat}

The exeption will not be triggered for small datasets. It will successfully run 
with MovieLens 100k and 10M. However, when I run it on a 100M dataset, the 
exception will be triggered.

My MovieLensALS is mostly the same as the one shipped with Spark. I just added 
cross-validation. Rating is registered in Kryo just as in the stock example.

{noformat}
# cat RELEASE 
Spark 1.0.0 built for Hadoop 2.2.0
{noformat}



> mutable.BitSet in ALS not serializable with KryoSerializer
> --
>
> Key: SPARK-1977
> URL: https://issues.apache.org/jira/browse/SPARK-1977
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Neville Li
>Priority: Minor
>
> OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
> KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
> register mutable.BitSet.
> Right now we have to register mutable.BitSet manually. A proper fix would be 
> using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
> 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
> com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
> scala.collection.mutable.HashSet
> Serialization trace:
> shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
> 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.CacheManager.getOrCompute(CacheMa

[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-06-06 Thread Santiago M. Mola (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019655#comment-14019655
 ] 

Santiago M. Mola commented on SPARK-1977:
-

I can reproduce this depending on the size of the dataset:

{noformat}
spark-submit mllib-movielens-evaluation-assembly-1.0.jar --master 
spark://mllib1:7077
--class com.example.MovieLensALS --rank 10 --numIterations 20 --lambda 1.0 
--kryo
hdfs:/movielens/oversampled.dat
{noformat}

The exeption will not be triggered for small datasets. It will successfully run 
with MovieLens 100k and 10M. However, when I run it on a 100M dataset, the 
exception will be triggered.

My MovieLensALS is mostly the same as the one shipped with Spark. I just added 
cross-validation. Rating is registered in Kryo just as in the stock example.

{noformat}
# cat RELEASE 
Spark 1.0.0 built for Hadoop 2.2.0
{noformat}



> mutable.BitSet in ALS not serializable with KryoSerializer
> --
>
> Key: SPARK-1977
> URL: https://issues.apache.org/jira/browse/SPARK-1977
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Neville Li
>Priority: Minor
>
> OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
> KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
> register mutable.BitSet.
> Right now we have to register mutable.BitSet manually. A proper fix would be 
> using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
> 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
> com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
> scala.collection.mutable.HashSet
> Serialization trace:
> shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
> 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> org.apache.spark.scheduler.Task.run(Task.scala:51)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> java.lang.Thread.run(Thread.java:662)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortS