[jira] [Commented] (SPARK-1374) Python API for running SQL queries

2014-04-09 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13963922#comment-13963922
 ] 

Michael Armbrust commented on SPARK-1374:
-

https://github.com/apache/spark/pull/363

> Python API for running SQL queries
> --
>
> Key: SPARK-1374
> URL: https://issues.apache.org/jira/browse/SPARK-1374
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Ahir Reddy
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1081) Annotate developer and experimental API's [Core]

2014-04-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1081.


Resolution: Fixed

> Annotate developer and experimental API's [Core]
> 
>
> Key: SPARK-1081
> URL: https://issues.apache.org/jira/browse/SPARK-1081
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
> Fix For: 1.0.0
>
>
> We should annotate API's that are internal despite being java/scala private. 
> An example is the internal listener interface.
> The main issue is figuring out the nicest way we can do this in scala and 
> decide how we document it in the scala docs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-542) Cache Miss when machine have multiple hostname

2014-04-09 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13963926#comment-13963926
 ] 

Shixiong Zhu commented on SPARK-542:


using ip has similar problem. A machine can have multiple ips.

> Cache Miss when machine have multiple hostname
> --
>
> Key: SPARK-542
> URL: https://issues.apache.org/jira/browse/SPARK-542
> Project: Spark
>  Issue Type: Bug
>Reporter: frankvictor
>
> HI, I encountered a weird runtime of pagerank in last few day.
> After debugging the job, I found it was caused by the DNS name.
> The machines of my cluster have multiple hostname, for example, slave 1 have 
> name (c001 and c001.cm.cluster)
> when spark adding cache in cacheTracker, it get "c001" and add cache use it.
> But when schedule task in SimpleJob, the msos offer give spark 
> "c001.cm.cluster".
> so It will never get preferred location!
> I thinks spark should handle the multiple hostname case(by using ip instead 
> of hostname, or some other methods).
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1215) Clustering: Index out of bounds error

2014-04-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-1215:


Assignee: Xiangrui Meng

> Clustering: Index out of bounds error
> -
>
> Key: SPARK-1215
> URL: https://issues.apache.org/jira/browse/SPARK-1215
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: dewshick
>Assignee: Xiangrui Meng
>
> code:
> import org.apache.spark.mllib.clustering._
> val test = sc.makeRDD(Array(4,4,4,4,4).map(e => Array(e.toDouble)))
> val kmeans = new KMeans().setK(4)
> kmeans.run(test) evals with java.lang.ArrayIndexOutOfBoundsException
> error:
> 14/01/17 12:35:54 INFO scheduler.DAGScheduler: Stage 25 (collectAsMap at 
> KMeans.scala:243) finished in 0.047 s
> 14/01/17 12:35:54 INFO spark.SparkContext: Job finished: collectAsMap at 
> KMeans.scala:243, took 16.389537116 s
> Exception in thread "main" java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.simontuffs.onejar.Boot.run(Boot.java:340)
>   at com.simontuffs.onejar.Boot.main(Boot.java:166)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
>   at 
> org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:47)
>   at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:247)
>   at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.foreach(Range.scala:81)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.map(Range.scala:46)
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:244)
>   at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:124)
>   at Clustering$$anonfun$1.apply$mcDI$sp(Clustering.scala:21)
>   at Clustering$$anonfun$1.apply(Clustering.scala:19)
>   at Clustering$$anonfun$1.apply(Clustering.scala:19)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.foreach(Range.scala:78)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.map(Range.scala:46)
>   at Clustering$.main(Clustering.scala:19)
>   at Clustering.main(Clustering.scala)
>   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1452) dynamic partition creation not working on cached table

2014-04-09 Thread Jai Kumar Singh (JIRA)
Jai Kumar Singh created SPARK-1452:
--

 Summary: dynamic partition creation not working on cached table
 Key: SPARK-1452
 URL: https://issues.apache.org/jira/browse/SPARK-1452
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
 Environment: Shark git 
commit dfc0e81366c0e1d0293ecf9b490eeabcc2a9c904
Merge: 517ebca 7652f0d

Reporter: Jai Kumar Singh


dynamic partition creation via shark QL command not working with cached table. 
Though it works fine with non-cached tables.

shark> desc sample;
OK
cid string  None
hoststring  None
url string  None
bytes   int None
pckts   int None
app string  None
cat string  None
Time taken: 0.149 seconds
shark> 

shark> desc sample_cached;
OK
cat string  from deserializer   
hoststring  from deserializer   
cid string  None
 
# Partition Information  
# col_name  data_type   comment 
 
cid string  None
Time taken: 0.15 seconds
shark> 

shark> insert into table sample_cached partition(cid) select cat,host,cid from 
sample;
FAILED: Hive Internal Error: java.lang.NullPointerException(null)
shark> 

shark> insert into table sample_cached partition(cid="my-cid") select cat,host 
from sample limit 20;
java.lang.InstantiationException: scala.Some
Continuing ...
java.lang.RuntimeException: failed to evaluate: =Class.new();
Continuing ...
Loading data to table default.sample_cached partition (cid=my-cid)
OK
Time taken: 64.268 seconds




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1452) dynamic partition creation not working on cached table

2014-04-09 Thread Jai Kumar Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jai Kumar Singh updated SPARK-1452:
---

Description: 
dynamic partition creation via shark QL command not working with cached table. 
Though it works fine with non-cached tables.
Also static partition is working fine with cached table. 


shark> desc sample;
OK
cid string  None
hoststring  None
url string  None
bytes   int None
pckts   int None
app string  None
cat string  None
Time taken: 0.149 seconds
shark> 

shark> desc sample_cached;
OK
cat string  from deserializer   
hoststring  from deserializer   
cid string  None
 
# Partition Information  
# col_name  data_type   comment 
 
cid string  None
Time taken: 0.15 seconds
shark> 

shark> insert into table sample_cached partition(cid) select cat,host,cid from 
sample;
FAILED: Hive Internal Error: java.lang.NullPointerException(null)
shark> 

shark> insert into table sample_cached partition(cid="my-cid") select cat,host 
from sample limit 20;
java.lang.InstantiationException: scala.Some
Continuing ...
java.lang.RuntimeException: failed to evaluate: =Class.new();
Continuing ...
Loading data to table default.sample_cached partition (cid=my-cid)
OK
Time taken: 64.268 seconds



I am logging this issue over here because 
https://spark-project.atlassian.net/browse/SHARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:issues-panel
 not allowing me to log the issue there.

  was:
dynamic partition creation via shark QL command not working with cached table. 
Though it works fine with non-cached tables.

shark> desc sample;
OK
cid string  None
hoststring  None
url string  None
bytes   int None
pckts   int None
app string  None
cat string  None
Time taken: 0.149 seconds
shark> 

shark> desc sample_cached;
OK
cat string  from deserializer   
hoststring  from deserializer   
cid string  None
 
# Partition Information  
# col_name  data_type   comment 
 
cid string  None
Time taken: 0.15 seconds
shark> 

shark> insert into table sample_cached partition(cid) select cat,host,cid from 
sample;
FAILED: Hive Internal Error: java.lang.NullPointerException(null)
shark> 

shark> insert into table sample_cached partition(cid="my-cid") select cat,host 
from sample limit 20;
java.lang.InstantiationException: scala.Some
Continuing ...
java.lang.RuntimeException: failed to evaluate: =Class.new();
Continuing ...
Loading data to table default.sample_cached partition (cid=my-cid)
OK
Time taken: 64.268 seconds



> dynamic partition creation not working on cached table
> --
>
> Key: SPARK-1452
> URL: https://issues.apache.org/jira/browse/SPARK-1452
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 0.9.0
> Environment: Shark git 
> commit dfc0e81366c0e1d0293ecf9b490eeabcc2a9c904
> Merge: 517ebca 7652f0d
>Reporter: Jai Kumar Singh
>  Labels: Shark
>
> dynamic partition creation via shark QL command not working with cached 
> table. Though it works fine with non-cached tables.
> Also static partition is working fine with cached table. 
> shark> desc sample;
> OK
> cid string  None
> hoststring  None
> url string  None
> bytes   int None
> pckts   int None
> app string  None
> cat string  None
> Time taken: 0.149

[jira] [Resolved] (SPARK-1357) [MLLIB] Annotate developer and experimental API's

2014-04-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1357.


Resolution: Fixed

> [MLLIB] Annotate developer and experimental API's
> -
>
> Key: SPARK-1357
> URL: https://issues.apache.org/jira/browse/SPARK-1357
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Patrick Wendell
>Assignee: Xiangrui Meng
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1300) Clean-up and clarify private vs public fields in MLLib

2014-04-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1300.


Resolution: Fixed

> Clean-up and clarify private vs public fields in MLLib
> --
>
> Key: SPARK-1300
> URL: https://issues.apache.org/jira/browse/SPARK-1300
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Patrick Wendell
>Assignee: Xiangrui Meng
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Some of the MLLib implementations have random internal varaibles that are 
> exposed and should be made private.
> For the regression package, there are a few fields (optimizer, updater, 
> gradient) that we should just make private[spark].
> For other internal components we should annotate them as semi-private. I.e. 
> they are deeper API's for more advanced developers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1338) Create Additional Style Rules

2014-04-09 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-1338:
---

Fix Version/s: 1.0.0

> Create Additional Style Rules
> -
>
> Key: SPARK-1338
> URL: https://issues.apache.org/jira/browse/SPARK-1338
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Cogan
>Assignee: Prashant Sharma
> Fix For: 1.0.0
>
>
> There are a few other rules that would be helpful to have. Also we should add 
> tests for these rules because it's easy to get them wrong. I gave some 
> example comparisons from a javascript style checker.
> Require spaces in type declarations:
> def foo:String = X // no
> def foo: String = XXX
> def x:Int = 100 // no
> val x: Int = 100
> Require spaces after keywords:
> if(x - 3) // no
> if (x + 10)
> See: requireSpaceAfterKeywords from
> https://github.com/mdevils/node-jscs
> Disallow spaces inside of parentheses:
> val x = ( 3 + 5 ) // no
> val x = (3 + 5)
> See: disallowSpacesInsideParentheses from
> https://github.com/mdevils/node-jscs
> Require spaces before and after binary operators:
> See: requireSpaceBeforeBinaryOperators
> See: disallowSpaceAfterBinaryOperators
> from https://github.com/mdevils/node-jscs



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1338) Create Additional Style Rules

2014-04-09 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma reassigned SPARK-1338:
--

Assignee: Prashant Sharma  (was: prashant)

> Create Additional Style Rules
> -
>
> Key: SPARK-1338
> URL: https://issues.apache.org/jira/browse/SPARK-1338
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Cogan
>Assignee: Prashant Sharma
> Fix For: 1.0.0
>
>
> There are a few other rules that would be helpful to have. Also we should add 
> tests for these rules because it's easy to get them wrong. I gave some 
> example comparisons from a javascript style checker.
> Require spaces in type declarations:
> def foo:String = X // no
> def foo: String = XXX
> def x:Int = 100 // no
> val x: Int = 100
> Require spaces after keywords:
> if(x - 3) // no
> if (x + 10)
> See: requireSpaceAfterKeywords from
> https://github.com/mdevils/node-jscs
> Disallow spaces inside of parentheses:
> val x = ( 3 + 5 ) // no
> val x = (3 + 5)
> See: disallowSpacesInsideParentheses from
> https://github.com/mdevils/node-jscs
> Require spaces before and after binary operators:
> See: requireSpaceBeforeBinaryOperators
> See: disallowSpaceAfterBinaryOperators
> from https://github.com/mdevils/node-jscs



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1357) [MLLIB] Annotate developer and experimental API's

2014-04-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13963999#comment-13963999
 ] 

Sean Owen commented on SPARK-1357:
--

I know I'm late to this party, but I just had a look and wanted to throw out a 
few last minute ideas.

(Do you not want to just declare all of MLlib experimental? is it really 1.0? 
that's a fairly significant set of shackles to put on for a long time.)

OK, that aside, I have two suggestions to mark as experimental:

1. ALS Rating object assumes users and items are Int. I suggest it will be 
eventually interesting to support String, or at least switch to Long.

2. Per old MLLIB-29, I feel pretty certain that ClassificationModel can't 
return RDD[Double], and will want to support returning a distribution over 
labels at some point. Similarly the input to it and RegressionModel seems like 
it will have to change to encompass something more than Vector to properly 
allow for categorical values. DecisionTreeModel has the same issue but is 
experimental (and doesn't integrate with these APIs?)

The point is not so much whether one agrees with these, but whether there is a 
non-trivial chance of wanting to change something this year.

Other parts that I'm interested in personally look pretty strong. Humbly 
submitted.

> [MLLIB] Annotate developer and experimental API's
> -
>
> Key: SPARK-1357
> URL: https://issues.apache.org/jira/browse/SPARK-1357
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Patrick Wendell
>Assignee: Xiangrui Meng
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

2014-04-09 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964195#comment-13964195
 ] 

Andrew Ash commented on SPARK-1021:
---

I tried making the change of making that val lazy and a cluster job was still 
launched, so it'll need to be a more involved fix than just adding lazy.

The PR for that is probably somewhere on an old github repo.

> sortByKey() launches a cluster job when it shouldn't
> 
>
> Key: SPARK-1021
> URL: https://issues.apache.org/jira/browse/SPARK-1021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Andrew Ash
>  Labels: starter
>
> The sortByKey() method is listed as a transformation, not an action, in the 
> documentation.  But it launches a cluster job regardless.
> http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
> Some discussion on the mailing list suggested that this is a problem with the 
> rdd.count() call inside Partitioner.scala's rangeBounds method.
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102
> Josh Rosen suggests that rangeBounds should be made into a lazy variable:
> {quote}
> I wonder whether making RangePartitoner .rangeBounds into a lazy val would 
> fix this 
> (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
>   We'd need to make sure that rangeBounds() is never called before an action 
> is performed.  This could be tricky because it's called in the 
> RangePartitioner.equals() method.  Maybe it's sufficient to just compare the 
> number of partitions, the ids of the RDDs used to create the 
> RangePartitioner, and the sort ordering.  This still supports the case where 
> I range-partition one RDD and pass the same partitioner to a different RDD.  
> It breaks support for the case where two range partitioners created on 
> different RDDs happened to have the same rangeBounds(), but it seems unlikely 
> that this would really harm performance since it's probably unlikely that the 
> range partitioners are equal by chance.
> {quote}
> Can we please make this happen?  I'll send a PR on GitHub to start the 
> discussion and testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

2014-04-09 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964207#comment-13964207
 ] 

Andrew Ash commented on SPARK-1021:
---

https://github.com/ash211/spark/commit/a62e828234d5b69585495593730032f2877932ae





> sortByKey() launches a cluster job when it shouldn't
> 
>
> Key: SPARK-1021
> URL: https://issues.apache.org/jira/browse/SPARK-1021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Andrew Ash
>  Labels: starter
>
> The sortByKey() method is listed as a transformation, not an action, in the 
> documentation.  But it launches a cluster job regardless.
> http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
> Some discussion on the mailing list suggested that this is a problem with the 
> rdd.count() call inside Partitioner.scala's rangeBounds method.
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102
> Josh Rosen suggests that rangeBounds should be made into a lazy variable:
> {quote}
> I wonder whether making RangePartitoner .rangeBounds into a lazy val would 
> fix this 
> (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
>   We'd need to make sure that rangeBounds() is never called before an action 
> is performed.  This could be tricky because it's called in the 
> RangePartitioner.equals() method.  Maybe it's sufficient to just compare the 
> number of partitions, the ids of the RDDs used to create the 
> RangePartitioner, and the sort ordering.  This still supports the case where 
> I range-partition one RDD and pass the same partitioner to a different RDD.  
> It breaks support for the case where two range partitioners created on 
> different RDDs happened to have the same rangeBounds(), but it seems unlikely 
> that this would really harm performance since it's probably unlikely that the 
> range partitioners are equal by chance.
> {quote}
> Can we please make this happen?  I'll send a PR on GitHub to start the 
> discussion and testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1063) Add .sortBy(f) method on RDD

2014-04-09 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964221#comment-13964221
 ] 

Andrew Ash commented on SPARK-1063:
---

https://github.com/apache/spark/pull/369

> Add .sortBy(f) method on RDD
> 
>
> Key: SPARK-1063
> URL: https://issues.apache.org/jira/browse/SPARK-1063
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Andrew Ash
>Assignee: Andrew Ash
>
> Based on this GitHub PR: https://github.com/apache/incubator-spark/pull/508
> I've written the below so many times that I think it'd be broadly useful to 
> have a .sortBy(f) method on RDD:
> {code}
>   .keyBy{l =>  }
>   .sortByKey()
>   .map(_._2)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1063) Add .sortBy(f) method on RDD

2014-04-09 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964221#comment-13964221
 ] 

Andrew Ash edited comment on SPARK-1063 at 4/9/14 2:40 PM:
---

The old PR is gone now (and was never merged) so I'm re-aiming at the 
apache/spark repo here:

https://github.com/apache/spark/pull/369


was (Author: aash):
https://github.com/apache/spark/pull/369

> Add .sortBy(f) method on RDD
> 
>
> Key: SPARK-1063
> URL: https://issues.apache.org/jira/browse/SPARK-1063
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Andrew Ash
>Assignee: Andrew Ash
>
> Based on this GitHub PR: https://github.com/apache/incubator-spark/pull/508
> I've written the below so many times that I think it'd be broadly useful to 
> have a .sortBy(f) method on RDD:
> {code}
>   .keyBy{l =>  }
>   .sortByKey()
>   .map(_._2)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1453) Improve the way Spark on Yarn waits for executors before starting

2014-04-09 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-1453:


 Summary: Improve the way Spark on Yarn waits for executors before 
starting
 Key: SPARK-1453
 URL: https://issues.apache.org/jira/browse/SPARK-1453
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves
Assignee: Thomas Graves


Currently Spark on Yarn just delays a few seconds between when the spark 
context is initialized and when it allows the job to start.  If you are on a 
busy hadoop cluster is might take longer to get the number of executors. 

In the very least we could make this timeout a configurable value.  Its 
currently hardcoded to 3 seconds.  
Better yet would be to allow user to give a minimum number of executors it 
wants to wait for, but that looks much more complex. 




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1446) Spark examples should not do a System.exit

2014-04-09 Thread Sandeep Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Singh reassigned SPARK-1446:


Assignee: Sandeep Singh

> Spark examples should not do a System.exit
> --
>
> Key: SPARK-1446
> URL: https://issues.apache.org/jira/browse/SPARK-1446
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Assignee: Sandeep Singh
>
> The spark examples should exit nice (sparkcontext.stop()) rather then doing a 
> System.exit. The System.exit can cause issues like in SPARK-1407.  
> SparkHdfsLR and JavaWordCount both do the System.exit. We should look through 
> all the examples.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size

2014-04-09 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964361#comment-13964361
 ] 

Shivaram Venkataraman commented on SPARK-1391:
--

I tried to run with this patch yesterday, but unfortunately I dont think the 
non-local jobs were triggered during my run. I will try to synthetically force 
non-local tasks the next time around to verify this.

> BlockManager cannot transfer blocks larger than 2G in size
> --
>
> Key: SPARK-1391
> URL: https://issues.apache.org/jira/browse/SPARK-1391
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 1.0.0
>Reporter: Shivaram Venkataraman
>Assignee: Min Zhou
> Attachments: SPARK-1391.diff
>
>
> If a task tries to remotely access a cached RDD block, I get an exception 
> when the block size is > 2G. The exception is pasted below.
> Memory capacities are huge these days (> 60G), and many workflows depend on 
> having large blocks in memory, so it would be good to fix this bug.
> I don't know if the same thing happens on shuffles if one transfer (from 
> mapper to reducer) is > 2G.
> {noformat}
> 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer 
> message
> java.lang.ArrayIndexOutOfBoundsException
> at 
> it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96)
> at 
> it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134)
> at 
> it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38)
> at 
> org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26)
> at 
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913)
> at 
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922)
> at 
> org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
> at 
> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348)
> at 
> org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323)
> at 
> org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
> at 
> org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at 
> org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at 
> org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
> at 
> org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
> at 
> org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
> at 
> org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661)
> at 
> org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1446) Spark examples should not do a System.exit

2014-04-09 Thread Sandeep Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964390#comment-13964390
 ] 

Sandeep Singh commented on SPARK-1446:
--

https://github.com/apache/spark/pull/370

> Spark examples should not do a System.exit
> --
>
> Key: SPARK-1446
> URL: https://issues.apache.org/jira/browse/SPARK-1446
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Assignee: Sandeep Singh
>
> The spark examples should exit nice (sparkcontext.stop()) rather then doing a 
> System.exit. The System.exit can cause issues like in SPARK-1407.  
> SparkHdfsLR and JavaWordCount both do the System.exit. We should look through 
> all the examples.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1357) [MLLIB] Annotate developer and experimental API's

2014-04-09 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964405#comment-13964405
 ] 

Xiangrui Meng commented on SPARK-1357:
--

Hi Sean, 

Actually, you came in just in time. This was only the first pass, and we are 
still accepting API visibility/annotation patches during the QA period. MLlib 
is still a beta component of Spark, so "1.0" doesn't mean it is stable. And we 
still accept additions (JIRA submitted before April 1) to MLlib, as Patrick 
announced in the dev mailing list.

(I do want to mark all of MLlib experimental to reserve the right to change in 
the future, but we need to find a balance point here.)

I agree that it is future-proof to switch id type from Int to Long in ALS. The 
extra storage requirement is 8 bytes per rating. Inside ALS, we also 
re-partition the ratings, which needs extra storage. We need to consider 
whether we want to switch to Long completely or provide an option to use Long 
ids. Could you submit a patch, either marking ALS experimental or allowing 
using Long ids?

I don't think String type is necessary because we can alway creates a map 
between String ids and Long ids. A String id usually costs more than a Long id. 
For the same reason, classification uses Double for labels.

Please submit a patch for APIs you don't feel comfortable to say "stable" or 
marked "experimental/developer" by me but you think the other way. It would be 
great to keep the discussion going. Thanks!

Best,
Xiangrui

> [MLLIB] Annotate developer and experimental API's
> -
>
> Key: SPARK-1357
> URL: https://issues.apache.org/jira/browse/SPARK-1357
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Patrick Wendell
>Assignee: Xiangrui Meng
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib

2014-04-09 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964430#comment-13964430
 ] 

Xiangrui Meng commented on SPARK-1406:
--

Thanks for sharing your thoughts! Feature transformation is part of the PMML 
standard. It provides primitives to describe feature transformations. It is 
very hard to describe feature transformation in practice, in most cases the 
result is some ad-hoc and non-exchangeable code, which is hard to reuse. I'm 
not a fan of XML, but as you mentioned, PMML is the de facto serialization.

I feel that supporting feature transformation in PMML is as important as -- if 
not important than -- supporting exporting models to PMML. Especially, the 
former provides an entry point to MLlib while the latter provides an exit. (I 
admit that I'm a little selfish on this point.) Btw, Google Prediction API only 
supports PMML's feature transformation: 
https://developers.google.com/prediction/docs/pmml-schema

> PMML model evaluation support via MLib
> --
>
> Key: SPARK-1406
> URL: https://issues.apache.org/jira/browse/SPARK-1406
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Thomas Darimont
>
> It would be useful if spark would provide support the evaluation of PMML 
> models (http://www.dmg.org/v4-2/GeneralStructure.html).
> This would allow to use analytical models that were created with a 
> statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
> would perform the actual model evaluation for a given input tuple. The PMML 
> model would then just contain the "parameterization" of an analytical model.
> Other projects like JPMML-Evaluator do a similar thing.
> https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1454) PySpark accumulators fail to update when runJob takes serialized/captured closures

2014-04-09 Thread William Benton (JIRA)
William Benton created SPARK-1454:
-

 Summary: PySpark accumulators fail to update when runJob takes 
serialized/captured closures
 Key: SPARK-1454
 URL: https://issues.apache.org/jira/browse/SPARK-1454
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.0.0
Reporter: William Benton
Priority: Minor


My patch for SPARK-729 optionally serializes closures when they are cleaned (in 
order to capture the values of mutable free variables at declaration time 
rather than at execution time).  This behavior is currently disabled for the 
closure argument to SparkContext.runJob, because enabling it there causes 
Python accumulators to fail to update.

The purpose of this JIRA is to note this issue and fix whatever is causing 
Python accumulators to behave this way so that closures passed to runJob can be 
captured in general.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1454) PySpark accumulators fail to update when runJob takes serialized/captured closures

2014-04-09 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964524#comment-13964524
 ] 

William Benton commented on SPARK-1454:
---

Can someone please assign this to me?

> PySpark accumulators fail to update when runJob takes serialized/captured 
> closures
> --
>
> Key: SPARK-1454
> URL: https://issues.apache.org/jira/browse/SPARK-1454
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.0.0
>Reporter: William Benton
>Priority: Minor
>
> My patch for SPARK-729 optionally serializes closures when they are cleaned 
> (in order to capture the values of mutable free variables at declaration time 
> rather than at execution time).  This behavior is currently disabled for the 
> closure argument to SparkContext.runJob, because enabling it there causes 
> Python accumulators to fail to update.
> The purpose of this JIRA is to note this issue and fix whatever is causing 
> Python accumulators to behave this way so that closures passed to runJob can 
> be captured in general.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1392) Local spark-shell Runs Out of Memory With Default Settings

2014-04-09 Thread Pat McDonough (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964549#comment-13964549
 ] 

Pat McDonough commented on SPARK-1392:
--

CC: [~matei] [~tdas]] [~adav] [~pwendell]

Here's that OOM problem I talked to you guys about. I think TD's suggestion 
works best:

 * create a new property that sets the minimum amount of heap reserved for the 
system/application. Something similar to: {{spark.system.memoryReservedSize}} 
 * account for that value prior to calculating memory available for storage and 
shuffle
 * set the new property at 300m by default (based on what we are seeing in a 
local spark-shell running JDK7 with the spark-0.9.0-hadoop-2 binary 
distribution). For heap's larger than 3g, it won't even come in to play.

We should also consider increasing the default spark.executor.memory beyond 
512m if we are going to reserve half of it for spark itself.

> Local spark-shell Runs Out of Memory With Default Settings
> --
>
> Key: SPARK-1392
> URL: https://issues.apache.org/jira/browse/SPARK-1392
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
> Environment: OS X 10.9.2, Java 1.7.0_51, Scala 2.10.3
>Reporter: Pat McDonough
>
> Using the spark-0.9.0 Hadoop2 binary from the project download page, running 
> the spark-shell locally in out of the box configuration, and attempting to 
> cache all the attached data, spark OOMs with: java.lang.OutOfMemoryError: GC 
> overhead limit exceeded
> You can work around the issue by either decreasing 
> spark.storage.memoryFraction or increasing SPARK_MEM



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1455) Determine which test suites to run based on code changes

2014-04-09 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-1455:
--

 Summary: Determine which test suites to run based on code changes
 Key: SPARK-1455
 URL: https://issues.apache.org/jira/browse/SPARK-1455
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
 Fix For: 1.1.0


Right now we run the entire set of tests for every change. This means the tests 
take a long time. Our pull request builder checks out the merge branch from 
git, so we could do a diff and figure out what source files were changed, and 
run a more isolated set of tests. We should just run tests in a way that 
reflects the inter-dependencies of the project. E.g:

- If Spark core is modified, we should run all tests
- If just SQL is modified, we should run only the SQL tests
- If just Streaming is modified, we should run only the streaming tests
- If just Pyspark is modified, we only run the PySpark tests.

And so on. I think this would reduce the RTT of the tests a lot and it should 
be pretty easy to accomplish with some scripting foo.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1456) Clean up use of Ordered/Ordering in OrderedRDDFunctions

2014-04-09 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-1456:
--

 Summary: Clean up use of Ordered/Ordering in OrderedRDDFunctions
 Key: SPARK-1456
 URL: https://issues.apache.org/jira/browse/SPARK-1456
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


Need to discuss more with Reynold, but we should clean this up in case we need 
to slightly change APIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib

2014-04-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964638#comment-13964638
 ] 

Sean Owen commented on SPARK-1406:
--

Yes I understand transformations can be described in PMML. Do you mean parsing 
a transformation described in PMML and implementing the transformation? Yes 
that goes hand in hand with supporting import of a model in general.

I would merely suggest this is a step that comes after several others in order 
of priority, like:
- implementing feature transformations in the abstract in the code base, 
separately from the idea of PMML
- implementing some form of model import via JPMML
- implementing more functional in the Model classes to give a reason to want to 
import an external model into MLlib

... and to me this is less useful at this point than export too. I say this 
because the power of MLlib/Spark right now is perceived to be model building, 
making it more producer than consumer at this stage.

> PMML model evaluation support via MLib
> --
>
> Key: SPARK-1406
> URL: https://issues.apache.org/jira/browse/SPARK-1406
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Thomas Darimont
>
> It would be useful if spark would provide support the evaluation of PMML 
> models (http://www.dmg.org/v4-2/GeneralStructure.html).
> This would allow to use analytical models that were created with a 
> statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
> would perform the actual model evaluation for a given input tuple. The PMML 
> model would then just contain the "parameterization" of an analytical model.
> Other projects like JPMML-Evaluator do a similar thing.
> https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1357) [MLLIB] Annotate developer and experimental API's

2014-04-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964649#comment-13964649
 ] 

Sean Owen commented on SPARK-1357:
--

Yeah I think it's reasonable to say that the core ALS API is only in terms of 
numeric IDs and leave a higher-level translation to the caller. Longs give that 
much more space to hash into.

The "cost" in terms of memory of something like a String is just a reference, 
so roughly the same as a Double anyway. I think the more important question is 
whether Double is too hacky API-wise as a representation of fundamentally 
non-numeric data. That's up for debate, but yeah the question here is more 
about reserving the right to change.

I'll submit a PR that marks the items I mention as experimental, for 
consideration. See if it seems reasonable.

> [MLLIB] Annotate developer and experimental API's
> -
>
> Key: SPARK-1357
> URL: https://issues.apache.org/jira/browse/SPARK-1357
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Patrick Wendell
>Assignee: Xiangrui Meng
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

2014-04-09 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964658#comment-13964658
 ] 

Marcelo Vanzin commented on SPARK-1021:
---

I actually played with the idea and just turning the {{rangeBounds}} variable 
into a lazy one doesn't work. That makes the variable be evaluated only when 
the transformation is executed on the worker nodes; at that point, you can't 
execute actions (which are needed to compute {{rangeBounds}}).

One way to work around this would be to have something be evaluated on the RDDs 
when the scheduler walks the graph before submitting jobs to the workers. I'm 
not aware of such functionality in the code, though. Or maybe there's something 
cleaner that can be done here?

> sortByKey() launches a cluster job when it shouldn't
> 
>
> Key: SPARK-1021
> URL: https://issues.apache.org/jira/browse/SPARK-1021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Andrew Ash
>  Labels: starter
>
> The sortByKey() method is listed as a transformation, not an action, in the 
> documentation.  But it launches a cluster job regardless.
> http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
> Some discussion on the mailing list suggested that this is a problem with the 
> rdd.count() call inside Partitioner.scala's rangeBounds method.
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102
> Josh Rosen suggests that rangeBounds should be made into a lazy variable:
> {quote}
> I wonder whether making RangePartitoner .rangeBounds into a lazy val would 
> fix this 
> (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
>   We'd need to make sure that rangeBounds() is never called before an action 
> is performed.  This could be tricky because it's called in the 
> RangePartitioner.equals() method.  Maybe it's sufficient to just compare the 
> number of partitions, the ids of the RDDs used to create the 
> RangePartitioner, and the sort ordering.  This still supports the case where 
> I range-partition one RDD and pass the same partitioner to a different RDD.  
> It breaks support for the case where two range partitioners created on 
> different RDDs happened to have the same rangeBounds(), but it seems unlikely 
> that this would really harm performance since it's probably unlikely that the 
> range partitioners are equal by chance.
> {quote}
> Can we please make this happen?  I'll send a PR on GitHub to start the 
> discussion and testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1457) Change APIs for training algorithms to take optimizer as parameter

2014-04-09 Thread DB Tsai (JIRA)
DB Tsai created SPARK-1457:
--

 Summary: Change APIs for training algorithms to take optimizer as 
parameter 
 Key: SPARK-1457
 URL: https://issues.apache.org/jira/browse/SPARK-1457
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai


Currently, the training api has signature like LogisticRegressionWithSGD. 

If we want to use another optimizer, we've two options, either adding new api 
like LogisticRegressionWithNewOptimizer which causes 99% of the code 
duplication, or we can re-factorize the api to take the optimizer as an option 
like the following. 

class LogisticRegression private (
var optimizer: Optimizer)
  extends GeneralizedLinearAlgorithm[LogisticRegressionModel]





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1458) Add programmatic way to determine Spark version

2014-04-09 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-1458:
---

 Summary: Add programmatic way to determine Spark version
 Key: SPARK-1458
 URL: https://issues.apache.org/jira/browse/SPARK-1458
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core
Affects Versions: 0.9.0
Reporter: Nicholas Chammas
Priority: Minor


As discussed 
[here|http://apache-spark-user-list.1001560.n3.nabble.com/programmatic-way-to-tell-Spark-version-td1929.html],
 I think it would be nice if there was a way to programmatically determine what 
version of Spark you are running. 

The potential use cases are not that important, but they include:
# Branching your code based on what version of Spark is running.
# Checking your version without having to quit and restart the Spark shell.

Right now in PySpark, I believe the only way to determine your version is by 
firing up the Spark shell and looking at the startup banner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1458) Add programmatic way to determine Spark version

2014-04-09 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-1458:


Description: 
As discussed 
[here|http://apache-spark-user-list.1001560.n3.nabble.com/programmatic-way-to-tell-Spark-version-td1929.html],
 I think it would be nice if there was a way to programmatically determine what 
version of Spark you are running. 

The potential use cases are not that important, but they include:
# Branching your code based on what version of Spark is running.
# Checking your version without having to quit and restart the Spark shell.

Right now in PySpark, I believe the only way to determine your version is by 
firing up the Spark shell and looking at the startup banner.

  was:
As discussed 
[here](http://apache-spark-user-list.1001560.n3.nabble.com/programmatic-way-to-tell-Spark-version-td1929.html),
 I think it would be nice if there was a way to programmatically determine what 
version of Spark you are running. 

The potential use cases are not that important, but they include:
# Branching your code based on what version of Spark is running.
# Checking your version without having to quit and restart the Spark shell.

Right now in PySpark, I believe the only way to determine your version is by 
firing up the Spark shell and looking at the startup banner.


> Add programmatic way to determine Spark version
> ---
>
> Key: SPARK-1458
> URL: https://issues.apache.org/jira/browse/SPARK-1458
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 0.9.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> As discussed 
> [here|http://apache-spark-user-list.1001560.n3.nabble.com/programmatic-way-to-tell-Spark-version-td1929.html],
>  I think it would be nice if there was a way to programmatically determine 
> what version of Spark you are running. 
> The potential use cases are not that important, but they include:
> # Branching your code based on what version of Spark is running.
> # Checking your version without having to quit and restart the Spark shell.
> Right now in PySpark, I believe the only way to determine your version is by 
> firing up the Spark shell and looking at the startup banner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1458) Add programmatic way to determine Spark version

2014-04-09 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-1458:


Description: 
As discussed 
[here](http://apache-spark-user-list.1001560.n3.nabble.com/programmatic-way-to-tell-Spark-version-td1929.html),
 I think it would be nice if there was a way to programmatically determine what 
version of Spark you are running. 

The potential use cases are not that important, but they include:
# Branching your code based on what version of Spark is running.
# Checking your version without having to quit and restart the Spark shell.

Right now in PySpark, I believe the only way to determine your version is by 
firing up the Spark shell and looking at the startup banner.

  was:
As discussed 
[here|http://apache-spark-user-list.1001560.n3.nabble.com/programmatic-way-to-tell-Spark-version-td1929.html],
 I think it would be nice if there was a way to programmatically determine what 
version of Spark you are running. 

The potential use cases are not that important, but they include:
# Branching your code based on what version of Spark is running.
# Checking your version without having to quit and restart the Spark shell.

Right now in PySpark, I believe the only way to determine your version is by 
firing up the Spark shell and looking at the startup banner.


> Add programmatic way to determine Spark version
> ---
>
> Key: SPARK-1458
> URL: https://issues.apache.org/jira/browse/SPARK-1458
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 0.9.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> As discussed 
> [here](http://apache-spark-user-list.1001560.n3.nabble.com/programmatic-way-to-tell-Spark-version-td1929.html),
>  I think it would be nice if there was a way to programmatically determine 
> what version of Spark you are running. 
> The potential use cases are not that important, but they include:
> # Branching your code based on what version of Spark is running.
> # Checking your version without having to quit and restart the Spark shell.
> Right now in PySpark, I believe the only way to determine your version is by 
> firing up the Spark shell and looking at the startup banner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1424) InsertInto should work on JavaSchemaRDD as well.

2014-04-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1424:


Fix Version/s: 1.0.0

> InsertInto should work on JavaSchemaRDD as well.
> 
>
> Key: SPARK-1424
> URL: https://issues.apache.org/jira/browse/SPARK-1424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1459) EventLoggingListener does not work with "file://" target dir

2014-04-09 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-1459:
-

 Summary: EventLoggingListener does not work with "file://" target 
dir
 Key: SPARK-1459
 URL: https://issues.apache.org/jira/browse/SPARK-1459
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Marcelo Vanzin


Bug is simple; FileLogger tries to pass a URL to FileOutputStream's 
constructor, and that fails. I'll upload a PR soon.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1459) EventLoggingListener does not work with "file://" target dir

2014-04-09 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964783#comment-13964783
 ] 

Marcelo Vanzin commented on SPARK-1459:
---

PR #375

> EventLoggingListener does not work with "file://" target dir
> 
>
> Key: SPARK-1459
> URL: https://issues.apache.org/jira/browse/SPARK-1459
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Marcelo Vanzin
>
> Bug is simple; FileLogger tries to pass a URL to FileOutputStream's 
> constructor, and that fails. I'll upload a PR soon.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1459) EventLoggingListener does not work with "file://" target dir

2014-04-09 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964783#comment-13964783
 ] 

Marcelo Vanzin edited comment on SPARK-1459 at 4/9/14 11:06 PM:


PR #375 (https://github.com/apache/spark/pull/375)


was (Author: vanzin):
PR #375

> EventLoggingListener does not work with "file://" target dir
> 
>
> Key: SPARK-1459
> URL: https://issues.apache.org/jira/browse/SPARK-1459
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Marcelo Vanzin
>
> Bug is simple; FileLogger tries to pass a URL to FileOutputStream's 
> constructor, and that fails. I'll upload a PR soon.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1455) Determine which test suites to run based on code changes

2014-04-09 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964791#comment-13964791
 ] 

Michael Armbrust commented on SPARK-1455:
-

This is a great idea.  One possible modification: If there are changes in Spark 
Core but not in Spark SQL, it is probably safe to skip running some of the more 
expensive test cases, specifically all of the hive compatibility query tests.  
Since all the query operators just use mapPartitions, I think the other query 
tests would find changes to Spark core that are going to break Spark SQL.

> Determine which test suites to run based on code changes
> 
>
> Key: SPARK-1455
> URL: https://issues.apache.org/jira/browse/SPARK-1455
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
> Fix For: 1.1.0
>
>
> Right now we run the entire set of tests for every change. This means the 
> tests take a long time. Our pull request builder checks out the merge branch 
> from git, so we could do a diff and figure out what source files were 
> changed, and run a more isolated set of tests. We should just run tests in a 
> way that reflects the inter-dependencies of the project. E.g:
> - If Spark core is modified, we should run all tests
> - If just SQL is modified, we should run only the SQL tests
> - If just Streaming is modified, we should run only the streaming tests
> - If just Pyspark is modified, we only run the PySpark tests.
> And so on. I think this would reduce the RTT of the tests a lot and it should 
> be pretty easy to accomplish with some scripting foo.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1460) Set operations on SchemaRDDs are needlessly destructive of schema information.

2014-04-09 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-1460:
---

 Summary: Set operations on SchemaRDDs are needlessly destructive 
of schema information.
 Key: SPARK-1460
 URL: https://issues.apache.org/jira/browse/SPARK-1460
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
 Fix For: 1.1.0


When you do a distinct of a subtract, you get back a normal RDD instead of a 
schema RDD, even though the schema is unchanged.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1293) Support for reading/writing complex types in Parquet

2014-04-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1293:


Fix Version/s: (was: 1.0.0)
   1.1.0

> Support for reading/writing complex types in Parquet
> 
>
> Key: SPARK-1293
> URL: https://issues.apache.org/jira/browse/SPARK-1293
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Andre Schumacher
> Fix For: 1.1.0
>
>
> Complex types include: Arrays, Maps, and Nested rows (structs).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1407) EventLogging to HDFS doesn't work properly on yarn

2014-04-09 Thread Kan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964867#comment-13964867
 ] 

Kan Zhang commented on SPARK-1407:
--

One example of the exception I encountered. Note the exact events (onJobEnd in 
this case) could be different.

Exception in thread "SparkListenerBus" java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:702)
at 
org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1832)
at 
org.apache.hadoop.hdfs.DFSOutputStream.hsync(DFSOutputStream.java:1815)
at 
org.apache.hadoop.hdfs.DFSOutputStream.hsync(DFSOutputStream.java:1798)
at 
org.apache.hadoop.fs.FSDataOutputStream.hsync(FSDataOutputStream.java:123)
at 
org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:138)
at 
org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:138)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.util.FileLogger.flush(FileLogger.scala:138)
at 
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:64)
at 
org.apache.spark.scheduler.EventLoggingListener.onJobEnd(EventLoggingListener.scala:86)
at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$4.apply(SparkListenerBus.scala:49)
at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$4.apply(SparkListenerBus.scala:49)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:49)
at 
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:31)
at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:61)


> EventLogging to HDFS doesn't work properly on yarn
> --
>
> Key: SPARK-1407
> URL: https://issues.apache.org/jira/browse/SPARK-1407
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Blocker
>
> When running on spark on yarn and accessing an HDFS file (like in the 
> SparkHdfsLR example) while using the event logging configured to write logs 
> to HDFS, it throws an exception at the end of the application. 
> SPARK_JAVA_OPTS=-Dspark.eventLog.enabled=true 
> -Dspark.eventLog.dir=hdfs:///history/spark/
> 14/04/03 13:41:31 INFO yarn.ApplicationMaster$$anon$1: Invoking sc stop from 
> shutdown hook
> Exception in thread "Thread-41" java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:398)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1465)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.sync(DFSOutputStream.java:1450)
> at 
> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:116)
> at 
> org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:137)
> at 
> org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:137)
> at scala.Option.foreach(Option.scala:236)
> at org.apache.spark.util.FileLogger.flush(FileLogger.scala:137)
> at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:69)
> at 
> org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:101)
> at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$13.apply(SparkListenerBus.scala:67)
> at 
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$13.apply(SparkListenerBus.scala:67)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:67)
> at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:31)
> at 
> org.apache.spark.scheduler.LiveListenerBus.post(LiveListenerBus.scala:78)
> at 
> org.apache.spark.SparkContext.postApplicationEnd(SparkContext.scala:1081)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:828)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:460)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1461) Support Short-circuit Expression Evaluation

2014-04-09 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-1461:


 Summary: Support Short-circuit Expression Evaluation
 Key: SPARK-1461
 URL: https://issues.apache.org/jira/browse/SPARK-1461
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
 Fix For: 1.1.0


Short-circuit expression evaluation impacts the performance significantly.

e.g.
In expression: (a>b) && (c>d), if a>b equals false or null, the expression 
value is false or null definitely without considering the value of sub 
expression c>d.

However, if c or d contains a stateful UDF  (for example, the UDF row_sequence) 
as its child expression, we have to evaluate the stateful expression but ignore 
its result.



--
This message was sent by Atlassian JIRA
(v6.2#6252)