[GitHub] spark pull request: remove not used test in src/main

2014-07-14 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1397#issuecomment-48869513
  
That one already has a main method. Perhaps best to leave this here since 
it just starts a connection manager.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: remove not used test in src/main

2014-07-14 Thread adrian-wang
Github user adrian-wang commented on the pull request:

https://github.com/apache/spark/pull/1397#issuecomment-48869638
  
OK, I'll close this. Thank you Reynold!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: remove not used test in src/main

2014-07-14 Thread adrian-wang
Github user adrian-wang closed the pull request at:

https://github.com/apache/spark/pull/1397


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2467] Revert SparkBuild to publish-loca...

2014-07-14 Thread ueshin
GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/1398

[SPARK-2467] Revert SparkBuild to publish-local to both .m2 and .ivy2.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark issues/SPARK-2467

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1398.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1398


commit 7f01d587259221c5333880578e6658e6534c52c4
Author: Takuya UESHIN ues...@happy-camper.st
Date:   2014-07-14T06:28:44Z

Revert SparkBuild to publish-local to both .m2 and .ivy2.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL] Cherry picked Hive Thrift/JD...

2014-07-14 Thread liancheng
GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/1399

[SPARK-2410][SQL] Cherry picked Hive Thrift/JDBC server

JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)

Cherry picked the Hive Thrift/JDBC server from 
[branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc). 

(Thanks @chenghao-intel for his initial contribution of the Spark SQL CLI.)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark thriftserver

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1399.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1399


commit b108e5035a68acfe8b5914468e64ddc7f392697a
Author: Cheng Lian lian.cs@gmail.com
Date:   2014-07-14T05:22:25Z

Cherry picked the Hive Thrift server




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1946] Submit tasks after (configured ra...

2014-07-14 Thread li-zhihui
Github user li-zhihui commented on the pull request:

https://github.com/apache/spark/pull/900#issuecomment-48869936
  
@tgravescs add a commit according to comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL] Cherry picked Hive Thrift/JD...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1399#issuecomment-48869960
  
QA tests have started for PR 1399. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16615/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2467] Revert SparkBuild to publish-loca...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1398#issuecomment-48869958
  
QA tests have started for PR 1398. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16616/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL] Cherry picked Hive Thrift/JD...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1399#issuecomment-48869966
  
QA results for PR 1399:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brclass HiveThriftServer2(hiveContext: HiveContext)brclass 
SparkSQLCLIDriver extends CliDriver with Logging {brclass 
SparkSQLCLIService(hiveContext: HiveContext)brclass SparkSQLDriver(val 
context: HiveContext = SparkSQLEnv.hiveContext)brclass 
SparkSQLSessionManager(hiveContext: HiveContext)brclass 
SparkSQLOperationManager(hiveContext: HiveContext) extends OperationManager 
with Logging {brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16615/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...

2014-07-14 Thread scwf
Github user scwf commented on a diff in the pull request:

https://github.com/apache/spark/pull/1385#discussion_r14866324
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -552,17 +552,10 @@ class SparkContext(config: SparkConf) extends Logging 
{
   valueClass: Class[V],
   minPartitions: Int = defaultMinPartitions
   ): RDD[(K, V)] = {
-// A Hadoop configuration can be about 10 KB, which is pretty big, so 
broadcast it.
-val confBroadcast = broadcast(new 
SerializableWritable(hadoopConfiguration))
-val setInputPathsFunc = (jobConf: JobConf) = 
FileInputFormat.setInputPaths(jobConf, path)
--- End diff --

there is no extra reason for broascast Configuration. I think it is because 
the old api here Instantiate HadoopRDD directly and HadoopRDD's construct 
method need broadcastedConf: Broadcast[SerializableWritable[Configuration]], so 
here broadcasted the Configuration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...

2014-07-14 Thread scwf
Github user scwf commented on a diff in the pull request:

https://github.com/apache/spark/pull/1385#discussion_r14866367
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -128,25 +123,13 @@ class HadoopRDD[K, V](
 
   // Returns a JobConf that will be used on slaves to obtain input splits 
for Hadoop reads.
   protected def getJobConf(): JobConf = {
-val conf: Configuration = broadcastedConf.value.value
-if (conf.isInstanceOf[JobConf]) {
-  // A user-broadcasted JobConf was provided to the HadoopRDD, so 
always use it.
-  conf.asInstanceOf[JobConf]
-} else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) {
+val conf: JobConf = broadcastedConf.value.value
+if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) {
--- End diff --

yeah, there is no need to cache the jobconf if it is in broadcast


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...

2014-07-14 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1385#discussion_r14866389
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -552,17 +552,10 @@ class SparkContext(config: SparkConf) extends Logging 
{
   valueClass: Class[V],
   minPartitions: Int = defaultMinPartitions
   ): RDD[(K, V)] = {
-// A Hadoop configuration can be about 10 KB, which is pretty big, so 
broadcast it.
-val confBroadcast = broadcast(new 
SerializableWritable(hadoopConfiguration))
-val setInputPathsFunc = (jobConf: JobConf) = 
FileInputFormat.setInputPaths(jobConf, path)
--- End diff --

There is. See my comment below. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-14 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1259#issuecomment-48871282
  
@andrewor14 the task id is not logged twice. One is TID, which is used to 
correlate with the executors (because executors only know about TID's), and the 
other is the task index, which is the index of the task within a stage.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL] Cherry picked Hive Thrift/JD...

2014-07-14 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1399#issuecomment-48871338
  
Hey @pwendell, would you mind help reviewing my changes to the POM and SBT 
files? Not 100% sure about the correctness. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-14 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/1056#discussion_r14866901
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -37,8 +36,15 @@ import org.apache.spark._
 import org.apache.spark.executor.TaskMetrics
 import org.apache.spark.partial.{ApproximateActionListener, 
ApproximateEvaluator, PartialResult}
 import org.apache.spark.rdd.RDD
-import org.apache.spark.storage.{BlockId, BlockManager, 
BlockManagerMaster, RDDBlockId}
-import org.apache.spark.util.{CallSite, SystemClock, Clock, Utils}
+import org.apache.spark.storage._
+import org.apache.spark.util.{SystemClock, Clock, Utils}
+import scala.Some
+import org.apache.spark.FetchFailed
+import org.apache.spark.storage.RDDBlockId
+import org.apache.spark.util.CallSite
+import akka.actor.OneForOneStrategy
+import org.apache.spark.ExceptionFailure
+import org.apache.spark.storage.BlockManagerMessages.BlockManagerHeartbeat
--- End diff --

Oops my IDE is doing this for some reason


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...

2014-07-14 Thread scwf
Github user scwf commented on a diff in the pull request:

https://github.com/apache/spark/pull/1385#discussion_r14866988
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -206,17 +202,10 @@ class HadoopTableReader(@transient _tableDesc: 
TableDesc, @transient sc: HiveCon
 tableDesc: TableDesc,
 path: String,
 inputFormatClass: Class[InputFormat[Writable, Writable]]): 
RDD[Writable] = {
-
-val initializeJobConfFunc = 
HadoopTableReader.initializeLocalJobConfFunc(path, tableDesc) _
-
-val rdd = new HadoopRDD(
-  sc.sparkContext,
-  
_broadcastedHiveConf.asInstanceOf[Broadcast[SerializableWritable[Configuration]]],
-  Some(initializeJobConfFunc),
-  inputFormatClass,
-  classOf[Writable],
-  classOf[Writable],
-  _minSplitsPerRDD)
+val jobConf = new 
JobConf(_broadcastedHiveConf.value.value.asInstanceOf[Configuration])
--- End diff --

My implements is broadcasting the conf for each HadoopRDD, do you mean we 
just broadcast once for all Hadoop RDD ? right?   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark REEF Module

2014-07-14 Thread deadmau
GitHub user deadmau opened a pull request:

https://github.com/apache/spark/pull/1400

Spark REEF Module

Added reef module that allows Spark to run on top of 
[REEF](http://www.reef-project.org).

Modified:
- SparkContext.scala, pom.xml
Added:
- reef Module

For REEF side module, see this 
[link](https://github.com/cmssnu/spark_on_reef/tree/reef-0.1).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cmssnu/spark reef-0.1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1400.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1400


commit 7515367e361c910ea81bae65e42e32a5a6763a5e
Author: Takuya UESHIN ues...@happy-camper.st
Date:   2014-05-15T18:20:21Z

[SPARK-1845] [SQL] Use AllScalaRegistrar for SparkSqlSerializer to register 
serializers of ...

...Scala collections.

When I execute `orderBy` or `limit` for `SchemaRDD` including `ArrayType` 
or `MapType`, `SparkSqlSerializer` throws the following exception:

```
com.esotericsoftware.kryo.KryoException: Class cannot be created (missing 
no-arg constructor): scala.collection.immutable.$colon$colon
```

or

```
com.esotericsoftware.kryo.KryoException: Class cannot be created (missing 
no-arg constructor): scala.collection.immutable.Vector
```

or

```
com.esotericsoftware.kryo.KryoException: Class cannot be created (missing 
no-arg constructor): scala.collection.immutable.HashMap$HashTrieMap
```

and so on.

This is because registrations of serializers for each concrete collections 
are missing in `SparkSqlSerializer`.
I believe it should use `AllScalaRegistrar`.
`AllScalaRegistrar` covers a lot of serializers for concrete classes of 
`Seq`, `Map` for `ArrayType`, `MapType`.

Author: Takuya UESHIN ues...@happy-camper.st

Closes #790 from ueshin/issues/SPARK-1845 and squashes the following 
commits:

d1ed992 [Takuya UESHIN] Use AllScalaRegistrar for SparkSqlSerializer to 
register serializers of Scala collections.

(cherry picked from commit db8cc6f28abe4326cea6f53feb604920e4867a27)
Signed-off-by: Reynold Xin r...@apache.org

commit f9eeddccbd42064f5d1234b323ac74bb2a39e0aa
Author: Takuya UESHIN ues...@happy-camper.st
Date:   2014-05-15T18:21:33Z

[SPARK-1819] [SQL] Fix GetField.nullable.

`GetField.nullable` should be `true` not only when `field.nullable` is 
`true` but also when `child.nullable` is `true`.

Author: Takuya UESHIN ues...@happy-camper.st

Closes #757 from ueshin/issues/SPARK-1819 and squashes the following 
commits:

8781a11 [Takuya UESHIN] Modify a test to use named parameters.
5bfc77d [Takuya UESHIN] Fix GetField.nullable.

(cherry picked from commit 94c9d6f59859ebc77fae112c2c42c64b7a4d7f83)
Signed-off-by: Reynold Xin r...@apache.org

commit bc9a96e2e97d4a9b4a2075fb026be320b96bd08b
Author: Xiangrui Meng m...@databricks.com
Date:   2014-05-15T18:59:59Z

[SPARK-1741][MLLIB] add predict(JavaRDD) to RegressionModel, 
ClassificationModel, and KMeans

`model.predict` returns a RDD of Scala primitive type (Int/Double), which 
is recognized as Object in Java. Adding predict(JavaRDD) could make life easier 
for Java users.

Added tests for KMeans, LinearRegression, and NaiveBayes.

Will update examples after https://github.com/apache/spark/pull/653 gets 
merged.

cc: @srowen

Author: Xiangrui Meng m...@databricks.com

Closes #670 from mengxr/predict-javardd and squashes the following commits:

b77ccd8 [Xiangrui Meng] Merge branch 'master' into predict-javardd
43caac9 [Xiangrui Meng] add predict(JavaRDD) to RegressionModel, 
ClassificationModel, and KMeans
(cherry picked from commit d52761d67f42ad4d2ff02d96f0675fb3ab709f38)

Signed-off-by: Patrick Wendell pwend...@gmail.com

commit 35870574a6e33a39c139139c8739a82796af5ebb
Author: Sandy Ryza sa...@cloudera.com
Date:   2014-05-15T23:35:39Z

SPARK-1851. Upgrade Avro dependency to 1.7.6 so Spark can read Avro file...

...s

Author: Sandy Ryza sa...@cloudera.com

Closes #795 from sryza/sandy-spark-1851 and squashes the following commits:

79c8227 [Sandy Ryza] SPARK-1851. Upgrade Avro dependency to 1.7.6 so Spark 
can read Avro files
(cherry picked from commit 08e7606a964e3d1ac1d565f33651ff0035c75044)

Signed-off-by: Patrick Wendell pwend...@gmail.com

commit 22f261a1a3efbd466ca0588cc77beb92fb14b6a3
Author: Stevo Slavić ssla...@gmail.com
Date:   2014-05-15T23:44:14Z

SPARK-1803 Replaced colon in filenames with a dash

This patch replaces colon in several filenames with dash to make these 
filenames Windows compatible.


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-14 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/1056#discussion_r14867043
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala ---
@@ -52,25 +52,24 @@ class BlockManagerMasterActor(val isLocal: Boolean, 
conf: SparkConf, listenerBus
 
   private val akkaTimeout = AkkaUtils.askTimeout(conf)
 
-  val slaveTimeout = conf.get(spark.storage.blockManagerSlaveTimeoutMs,
- + (BlockManager.getHeartBeatFrequency(conf) * 3)).toLong
+  val slaveTimeout = 
conf.getLong(spark.storage.blockManagerSlaveTimeoutMs,
+(conf.getInt(spark.executor.heartbeatInterval, 2000) * 3))
--- End diff --

Any reason that would be better?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...

2014-07-14 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1385#discussion_r14867102
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -206,17 +202,10 @@ class HadoopTableReader(@transient _tableDesc: 
TableDesc, @transient sc: HiveCon
 tableDesc: TableDesc,
 path: String,
 inputFormatClass: Class[InputFormat[Writable, Writable]]): 
RDD[Writable] = {
-
-val initializeJobConfFunc = 
HadoopTableReader.initializeLocalJobConfFunc(path, tableDesc) _
-
-val rdd = new HadoopRDD(
-  sc.sparkContext,
-  
_broadcastedHiveConf.asInstanceOf[Broadcast[SerializableWritable[Configuration]]],
-  Some(initializeJobConfFunc),
-  inputFormatClass,
-  classOf[Writable],
-  classOf[Writable],
-  _minSplitsPerRDD)
+val jobConf = new 
JobConf(_broadcastedHiveConf.value.value.asInstanceOf[Configuration])
--- End diff --

Yes. With our current implementation, each Hive partition in Spark SQL 
creates one HadoopRDD. We absolutely cannot afford broadcasting the conf for 
each HadoopRDD/Hive partition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark REEF Module

2014-07-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1400#issuecomment-48872835
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Spark REEF Module

2014-07-14 Thread deadmau
Github user deadmau closed the pull request at:

https://github.com/apache/spark/pull/1400


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Made rdd.py pep8 complaint by using Autopep8 a...

2014-07-14 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1354#issuecomment-48872919
  
Thanks. Merging this in master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Made rdd.py pep8 complaint by using Autopep8 a...

2014-07-14 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1354


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-14 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/1056#discussion_r14867152
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala ---
@@ -129,7 +128,7 @@ class BlockManagerMasterActor(val isLocal: Boolean, 
conf: SparkConf, listenerBus
 case ExpireDeadHosts =
   expireDeadHosts()
 
-case HeartBeat(blockManagerId) =
+case BlockManagerHeartbeat(blockManagerId) =
--- End diff --

The origin of this is still from executors.  The same two purposes remain - 
let the blockmanager master know that the executor's is still alive and find 
out whether the executor's blockmanager should reregister.  It just gets passed 
along inside the generic executor-driver instead of as its own message.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-14 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/1056#discussion_r14867310
  
--- Diff: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala ---
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import akka.actor.Actor
+import org.apache.spark.executor.TaskMetrics
+import org.apache.spark.storage.BlockManagerId
+import org.apache.spark.scheduler.TaskScheduler
+
+case class Heartbeat(executorId: String, taskMetrics: Array[(Long, 
TaskMetrics)],
+  blockManagerId: BlockManagerId) extends Serializable
+
+class HeartbeatReceiver(scheduler: TaskScheduler) extends Actor {
+  override def receive = {
+case Heartbeat(executorId, taskMetrics, blockManagerId) =
+  sender ! scheduler.executorHeartbeat(executorId, taskMetrics, 
blockManagerId)
--- End diff --

This is my first time dealing with Akka, so let me know if I'm missing 
something, but I went this route because the BlockManagerMasterActor needs to 
receive the event, and nobody is able to get a direct reference to it.  It's 
only accessible through an ActorRef.  My understanding of the Actor pattern is 
that Actors should only be communicated with via messages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-14 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-48873857
  
Thanks for the feedback @andrewor14 .  Will upmerge and upload a patch with 
your changes, except for where I had questions above.

Do you know what the easiest way to check the size of an Akka message is?  
Any pointers on where to look for how to deal with the Akka frame size issue?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2277: make TaskScheduler track hosts on ...

2014-07-14 Thread lirui-intel
Github user lirui-intel commented on the pull request:

https://github.com/apache/spark/pull/1212#issuecomment-48873878
  
Hi @mridulm , I've added some test case to capture schedule behavior of 
RACK_LOCAL tasks.
Let me know if I got anything wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2467] Revert SparkBuild to publish-loca...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1398#issuecomment-48875578
  
QA results for PR 1398:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16616/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2460] Optimize SparkContext.hadoopFile ...

2014-07-14 Thread scwf
Github user scwf commented on a diff in the pull request:

https://github.com/apache/spark/pull/1385#discussion_r14868164
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -206,17 +202,10 @@ class HadoopTableReader(@transient _tableDesc: 
TableDesc, @transient sc: HiveCon
 tableDesc: TableDesc,
 path: String,
 inputFormatClass: Class[InputFormat[Writable, Writable]]): 
RDD[Writable] = {
-
-val initializeJobConfFunc = 
HadoopTableReader.initializeLocalJobConfFunc(path, tableDesc) _
-
-val rdd = new HadoopRDD(
-  sc.sparkContext,
-  
_broadcastedHiveConf.asInstanceOf[Broadcast[SerializableWritable[Configuration]]],
-  Some(initializeJobConfFunc),
-  inputFormatClass,
-  classOf[Writable],
-  classOf[Writable],
-  _minSplitsPerRDD)
+val jobConf = new 
JobConf(_broadcastedHiveConf.value.value.asInstanceOf[Configuration])
--- End diff --

get it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: move some test file to match src code

2014-07-14 Thread adrian-wang
GitHub user adrian-wang opened a pull request:

https://github.com/apache/spark/pull/1401

move some test file to match src code

Just move some test suite to corresponding package

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/adrian-wang/spark movetestfiles

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1401.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1401


commit d1a680307b690b8658541204d5c9ed5f9abbee07
Author: Daoyuan daoyuan.w...@intel.com
Date:   2014-07-14T08:25:41Z

move some test file to match src code




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: move some test file to match src code

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1401#issuecomment-48876149
  
QA tests have started for PR 1401. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16617/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t

2014-07-14 Thread mengxr
GitHub user mengxr opened a pull request:

https://github.com/apache/spark/pull/1402

[SPARK-2471] remove runtime scope for jets3t

The assembly jar doesn't include jets3t if we set it to runtime only, but I 
don't know whether it was set this way for a particular reason.

CC: @srowen @ScrapCodes

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mengxr/spark jets3t

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1402.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1402


commit bfa2d175aedb5db21901d0a71bd5dcf3b9ea0903
Author: Xiangrui Meng m...@databricks.com
Date:   2014-07-14T08:42:15Z

remove runtime scope for jets3t




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1402#issuecomment-48877612
  
QA tests have started for PR 1402. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16618/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t

2014-07-14 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1402#issuecomment-48877650
  
It's set that way just because it is only used by Hadoop's FileSystem to 
access S3. Code shouldn't call it directly. Maven should therefore include it 
in the runtime classpath and assembled jars, but not the compile classpath.

I seem to distantly recall that sbt doesn't quite do this. I hope there is 
an equivalent for sbt as that's the right thing to do. It's not the end of the 
world to make jets3t appear on the compile classpath.

(So I assume the sbt assembly still has to be supported? the Maven one 
should be OK in this regard)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t

2014-07-14 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1402#issuecomment-48878585
  
I see. sbt-assembly doesn't pick up runtime only jars by default or maybe 
it doesn't read pom correctly.

@ScrapCodes Do you know whether we can tell sbt-assembly to include runtime 
jars? It is weird that it doesn't do that by default or maybe it is the 
sbt-pom-reader's problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t

2014-07-14 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/1402#issuecomment-48880738
  
Hi,

I just checked it is correctly included by doing `show
runtime:full-classpath`. Have to check why it is not in assembly.


Prashant Sharma


On Mon, Jul 14, 2014 at 2:35 PM, Xiangrui Meng notificati...@github.com
wrote:

 I see. sbt-assembly doesn't pick up runtime only jars by default or maybe
 it doesn't read pom correctly.

 @ScrapCodes https://github.com/ScrapCodes Do you know whether we can
 tell sbt-assembly to include runtime jars? It is weird that it doesn't do
 that by default or maybe it is the sbt-pom-reader's problem.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/1402#issuecomment-48878585.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [SPARK-2222] Add multiclass evaluation...

2014-07-14 Thread avulanov
Github user avulanov commented on a diff in the pull request:

https://github.com/apache/spark/pull/1155#discussion_r14870864
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala 
---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Matrices, Matrix}
+import org.apache.spark.rdd.RDD
+
+import scala.collection.Map
+
+/**
+ * ::Experimental::
+ * Evaluator for multiclass classification.
+ *
+ * @param predictionAndLabels an RDD of (prediction, label) pairs.
+ */
+@Experimental
+class MulticlassMetrics(predictionAndLabels: RDD[(Double, Double)]) {
+
+  private lazy val labelCountByClass: Map[Double, Long] = 
predictionAndLabels.values.countByValue()
+  private lazy val labelCount: Long = labelCountByClass.values.sum
+  private lazy val tpByClass: Map[Double, Int] = predictionAndLabels
+.map { case (prediction, label) =
+  (label, if (label == prediction) 1 else 0)
+}.reduceByKey(_ + _)
+.collectAsMap()
+  private lazy val fpByClass: Map[Double, Int] = predictionAndLabels
+.map { case (prediction, label) =
+  (prediction, if (prediction != label) 1 else 0)
+}.reduceByKey(_ + _)
+.collectAsMap()
+  private lazy val confusions = predictionAndLabels.map {
+case (prediction, label) = ((prediction, label), 1)
+  }.reduceByKey(_ + _).collectAsMap()
+
+  /**
+   * Returns confusion matrix:
+   * predicted classes are in columns,
+   * they are ordered by class label ascending,
+   * as in labels
+   */
+  lazy val confusionMatrix: Matrix = {
+val transposedMatrix = Array.ofDim[Double](labels.size, labels.size)
+for (i - 0 to labels.size - 1; j - 0 to labels.size - 1) {
+  transposedMatrix(i)(j) = confusions.getOrElse((labels(i), 
labels(j)), 0).toDouble
+}
+val flatMatrix = transposedMatrix.flatMap(arr = arr)
--- End diff --

I removed the intermediate matrix. Could you clarify what do you want me to 
do with for loop?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [SPARK-2222] Add multiclass evaluation...

2014-07-14 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1155#issuecomment-48882671
  
@mengxr I addressed your comments, except the one above which I commented.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [SPARK-2222] Add multiclass evaluation...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1155#issuecomment-48882753
  
QA tests have started for PR 1155. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16619/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SQL] By default does not run hive compat...

2014-07-14 Thread witgo
GitHub user witgo opened a pull request:

https://github.com/apache/spark/pull/1403

 [WIP][SQL] By default does not run hive compatibility tests



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/witgo/spark hive_compatibility

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1403.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1403


commit a6643165ce2a21cdbf5cad702bc4922308af5088
Author: witgo wi...@qq.com
Date:   2014-07-14T09:53:24Z

The default does not run hive compatibility tests




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SQL] By default does not run hive compat...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1403#issuecomment-48883582
  
QA tests have started for PR 1403. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16620/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t

2014-07-14 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/1402#issuecomment-48883641
  
It is actually intended that way, assembly is not intended as a replacement 
for runtime classpath. One way to go about is to not have it as runtime 
dependency.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: move some test file to match src code

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1401#issuecomment-48883713
  
QA results for PR 1401:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brclass BroadcastSuite extends FunSuite with LocalSparkContext 
{brclass ConnectionManagerSuite extends FunSuite {brclass PipedRDDSuite 
extends FunSuite with SharedSparkContext {brclass ZippedPartitionsSuite 
extends FunSuite with SharedSparkContext {brclass AkkaUtilsSuite extends 
FunSuite with LocalSparkContext {brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16617/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t

2014-07-14 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1402#issuecomment-48883754
  
What is the assembly then? I always took this term to mean all of the 
runtime dependencies together. How would I make a runnable JAR in SBT in 
general?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t

2014-07-14 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/1402#issuecomment-48884485
  
assembly should just include compile scope. Take for example the case of 
slf4j where end user can bring in any of the log4j or logback. So they are 
supposed to be at runtime classpath. And putting them in assembly jar will kill 
the purpose of having this pluggability at runtime.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t

2014-07-14 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1402#issuecomment-48885032
  
The SLF4J binding is a runtime dependency, and it may be one that a 
down-stream consumer wants to override. But leaving it out entirely yields no 
runtime binding at all -- no logging. There is an argument for not including 
any SLF4J runtime dependency in Spark at all, and which requires users to add 
their own to get any log messages. Debatable -- but a different question.

SLF4J is a bit of a funny example though. What about other runtime 
dependencies? jets3t is a good one. It has to be on the runtime classpath, so 
needs to go into a runtime assembly JAR. It does not need to appear on the 
compile classpath, and shouldn't ideally. 

This is not the same as provided scope, where the user is expected to 
bring the artifact from the local environment.

What's the standard SBT magic for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1402#issuecomment-48885040
  
QA results for PR 1402:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16618/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [SPARK-2222] Add multiclass evaluation...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1155#issuecomment-48890147
  
QA results for PR 1155:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brclass MulticlassMetrics(predictionAndLabels: RDD[(Double, 
Double)]) {br* (equals to precision for multiclass classifierbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16619/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SQL] By default does not run hive compat...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1403#issuecomment-48890757
  
QA results for PR 1403:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16620/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SQL] By default does not run hive compat...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1403#issuecomment-48894943
  
QA tests have started for PR 1403. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16621/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SQL] By default does not run hive compat...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1403#issuecomment-48896247
  
QA tests have started for PR 1403. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16622/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Remove NOTE: SPARK_YARN is deprecated, please...

2014-07-14 Thread witgo
GitHub user witgo opened a pull request:

https://github.com/apache/spark/pull/1404

Remove NOTE: SPARK_YARN is deprecated, please use -Pyarn flag



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/witgo/spark run-tests

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1404.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1404


commit 04dd78e93be9371925424101e3a4722e0b558715
Author: witgo wi...@qq.com
Date:   2014-07-14T14:29:14Z

Remove NOTE: SPARK_YARN is deprecated, please use -Pyarn flag




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SQL] By default does not run hive compat...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1403#issuecomment-48907275
  
QA results for PR 1403:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16622/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Remove NOTE: SPARK_YARN is deprecated, please...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1404#issuecomment-48907508
  
QA tests have started for PR 1404. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16623/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Remove NOTE: SPARK_YARN is deprecated, please...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1404#issuecomment-48920976
  
QA results for PR 1404:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16623/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2154] Schedule next Driver when one com...

2014-07-14 Thread aarondav
GitHub user aarondav opened a pull request:

https://github.com/apache/spark/pull/1405

[SPARK-2154] Schedule next Driver when one completes (standalone mode)



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aarondav/spark 2154

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1405.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1405


commit 24e9ef9c1cf0acfd62330c4e311cd47af85ea4a3
Author: Aaron Davidson aa...@databricks.com
Date:   2014-07-14T16:20:26Z

[SPARK-2154] Schedule next Driver when one completes (standalone mode)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2154] Schedule next Driver when one com...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1405#issuecomment-48922358
  
QA tests have started for PR 1405. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16624/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2474][SQL] For a registered table in Ov...

2014-07-14 Thread yhuai
GitHub user yhuai opened a pull request:

https://github.com/apache/spark/pull/1406

[SPARK-2474][SQL] For a registered table in OverrideCatalog, the Analyzer 
failed to resolve references in the format of tableName.fieldName

Please refer to JIRA (https://issues.apache.org/jira/browse/SPARK-2474) for 
how to reproduce the problem and my understanding of the root cause.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yhuai/spark SPARK-2474

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1406.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1406


commit a5c2145a57f20b98bd34a7de16b0c697b7913ec9
Author: Yin Huai h...@cse.ohio-state.edu
Date:   2014-07-14T16:40:05Z

Support sql/console.

commit 568de1f92fcad7ab3293941106adf7fd943c5bed
Author: Yin Huai h...@cse.ohio-state.edu
Date:   2014-07-14T16:40:17Z

Wrap the relation in a Subquery named by the table name in 
OverrideCatalog.lookupRelation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-14 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1259#issuecomment-48925260
  
Task ID​ is globally unqiue, index is unique within a stage


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-14 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/1056#discussion_r14888926
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala ---
@@ -52,25 +52,24 @@ class BlockManagerMasterActor(val isLocal: Boolean, 
conf: SparkConf, listenerBus
 
   private val akkaTimeout = AkkaUtils.askTimeout(conf)
 
-  val slaveTimeout = conf.get(spark.storage.blockManagerSlaveTimeoutMs,
- + (BlockManager.getHeartBeatFrequency(conf) * 3)).toLong
+  val slaveTimeout = 
conf.getLong(spark.storage.blockManagerSlaveTimeoutMs,
+(conf.getInt(spark.executor.heartbeatInterval, 2000) * 3))
--- End diff --

No huge reason, just that it avoids nesting the `conf.get`s and makes it 
easier to see what the final value is (imagine if you had 5 more gets). Though 
I just noticed that here you can't do that easily because of the `* 3`, so 
scratch that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2471] remove runtime scope for jets3t

2014-07-14 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1402#issuecomment-48927532
  
@ScrapCodes I agree with @srowen that the assembly jar should include 
runtime libraries except those marked provided. If we cannot find a way to 
let sbt understand maven runtime scope, we should merge this change because 
without jets3t all S3 operations are broken.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2474][SQL] For a registered table in Ov...

2014-07-14 Thread yhuai
Github user yhuai closed the pull request at:

https://github.com/apache/spark/pull/1406


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2474][SQL] For a registered table in Ov...

2014-07-14 Thread yhuai
GitHub user yhuai reopened a pull request:

https://github.com/apache/spark/pull/1406

[SPARK-2474][SQL] For a registered table in OverrideCatalog, the Analyzer 
failed to resolve references in the format of tableName.fieldName

Please refer to JIRA (https://issues.apache.org/jira/browse/SPARK-2474) for 
how to reproduce the problem and my understanding of the root cause.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yhuai/spark SPARK-2474

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1406.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1406


commit a5c2145a57f20b98bd34a7de16b0c697b7913ec9
Author: Yin Huai h...@cse.ohio-state.edu
Date:   2014-07-14T16:40:05Z

Support sql/console.

commit c43ad00b8a07c168c56f88f632684c6fec5cd719
Author: Yin Huai h...@cse.ohio-state.edu
Date:   2014-07-14T17:17:29Z

Wrap the relation in a Subquery named by the table name in 
OverrideCatalog.lookupRelation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2474][SQL] For a registered table in Ov...

2014-07-14 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1406#issuecomment-48929018
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2474][SQL] For a registered table in Ov...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1406#issuecomment-48929426
  
QA tests have started for PR 1406. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16626/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [SPARK-2222] Add multiclass evaluation...

2014-07-14 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1155#issuecomment-48930159
  
@avulanov In Scala, for is slower than while. See 
https://issues.scala-lang.org/browse/SI-1338 for example. So please replace the 
for loop with two while loops in your implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [SPARK-2222] Add multiclass evaluation...

2014-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1155#discussion_r14890904
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala 
---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.collection.Map
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Matrices, Matrix}
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for multiclass classification.
+ *
+ * @param predictionAndLabels an RDD of (prediction, label) pairs.
+ */
+@Experimental
+class MulticlassMetrics(predictionAndLabels: RDD[(Double, Double)]) {
+
+  private lazy val labelCountByClass: Map[Double, Long] = 
predictionAndLabels.values.countByValue()
+  private lazy val labelCount: Long = labelCountByClass.values.sum
+  private lazy val tpByClass: Map[Double, Int] = predictionAndLabels
+.map { case (prediction, label) =
+  (label, if (label == prediction) 1 else 0)
+}.reduceByKey(_ + _)
+.collectAsMap()
+  private lazy val fpByClass: Map[Double, Int] = predictionAndLabels
+.map { case (prediction, label) =
+  (prediction, if (prediction != label) 1 else 0)
+}.reduceByKey(_ + _)
+.collectAsMap()
+  private lazy val confusions = predictionAndLabels.map {
+case (prediction, label) = ((prediction, label), 1)
+  }.reduceByKey(_ + _).collectAsMap()
+
+  /**
+   * Returns confusion matrix:
+   * predicted classes are in columns,
+   * they are ordered by class label ascending,
+   * as in labels
+   */
+  lazy val confusionMatrix: Matrix = {
+val transposedFlatMatrix = Array.ofDim[Double](labels.size * 
labels.size)
--- End diff --

Save `labels.size` to `n`? Btw, I'm not sure whether we should use `lazy 
val` here because the result matrix could be 1000x1000, different from other 
lazy vals used here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1390#issuecomment-48930414
  
Thanks for reviewing this everyone.  I'm all for commenting and cleaning 
things up here, but if possible I'd like to merge this in today.  There are a 
couple of people blocking on this as its a pretty severe performance bug.  How 
about we just add some TODOs that can be taken care of in a follow up PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [SPARK-2222] Add multiclass evaluation...

2014-07-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1155#discussion_r14891003
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala 
---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.evaluation
+
+import scala.collection.Map
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Matrices, Matrix}
+import org.apache.spark.rdd.RDD
+
+/**
+ * ::Experimental::
+ * Evaluator for multiclass classification.
+ *
+ * @param predictionAndLabels an RDD of (prediction, label) pairs.
+ */
+@Experimental
+class MulticlassMetrics(predictionAndLabels: RDD[(Double, Double)]) {
+
+  private lazy val labelCountByClass: Map[Double, Long] = 
predictionAndLabels.values.countByValue()
+  private lazy val labelCount: Long = labelCountByClass.values.sum
+  private lazy val tpByClass: Map[Double, Int] = predictionAndLabels
+.map { case (prediction, label) =
+  (label, if (label == prediction) 1 else 0)
+}.reduceByKey(_ + _)
+.collectAsMap()
+  private lazy val fpByClass: Map[Double, Int] = predictionAndLabels
+.map { case (prediction, label) =
+  (prediction, if (prediction != label) 1 else 0)
+}.reduceByKey(_ + _)
+.collectAsMap()
+  private lazy val confusions = predictionAndLabels.map {
+case (prediction, label) = ((prediction, label), 1)
--- End diff --

The code style is not consistent with the blocks above. Please move `case 
(prediction, label) =` to the line above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2474][SQL] For a registered table in Ov...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1406#issuecomment-48931344
  
QA tests have started for PR 1406. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16627/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: move some test file to match src code

2014-07-14 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1401#issuecomment-48931698
  
Merging in master. Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: move some test file to match src code

2014-07-14 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1401


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

2014-07-14 Thread jkbradley
GitHub user jkbradley opened a pull request:

https://github.com/apache/spark/pull/1407

SPARK-1215: Clustering: Index out of bounds error

Bug fix for JIRA SPARK 1215: Clustering: Index out of bounds error

 https://issues.apache.org/jira/browse/SPARK-1215

Solution: Print warning, and use duplicate cluster centers so that exactly 
k centers are returned.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jkbradley/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1407.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1407


commit 97f2104bac2ab864c2a03f9a12a4b936557ae6d6
Author: Joseph Bradley joseph.kurata.brad...@gmail.com
Date:   2014-05-20T01:35:53Z

added RDD::stratifiedSample method and associated unit tests in RDDSuite.  
Method is built off of RDD::takeSample method.

commit 91e83338820158b96cda492668dbed5fff33f19b
Author: Joseph Bradley joseph.kurata.brad...@gmail.com
Date:   2014-05-20T01:36:30Z

added RDD::stratifiedSample method documentation

commit d6f8913b7e370a82138b9c623754b32a59c21cf6
Author: Joseph Bradley joseph.kurata.brad...@gmail.com
Date:   2014-05-21T04:58:12Z

updated stratifiedSample to be more scalable, keeping data in RDDs instead 
of collecting to the driver

commit 21eead6a412508b536358f4e557e2fab23c9c696
Author: Joseph Bradley joseph.kurata.brad...@gmail.com
Date:   2014-05-23T19:51:07Z

updated stratifiedSample to use selection-rejection to select samples on 
each partition in 1 pass, rather than pre-selecting indices

commit 91f4b19702bc58a77d28316674eace881a81165f
Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com
Date:   2014-07-09T21:37:25Z

merging with new spark

commit c0cb5f0d8c6104e3eb6cfa44820ba00b81bc7262
Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com
Date:   2014-07-11T18:43:17Z

merging with updated spark

commit 7d1b812a720cffdefe78ddb6e641930e7ae4975b
Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com
Date:   2014-07-11T18:47:41Z

removed my coding test updates

commit 18e5c8ad740871be92c6d7b73f5d35e25641a734
Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com
Date:   2014-07-12T01:12:44Z

Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle 
case with fewer distinct data points than clusters k.  Added two related unit 
tests to KMeansSuite.

commit e2bf638c6b3e8cc9cec3362caddb2305109d4c0a
Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com
Date:   2014-07-14T17:52:33Z

Merge remote-tracking branch 'upstream/master'




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL][CORE] SPARK-2102

2014-07-14 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1377#issuecomment-48934357
  
@pwendell, any thoughts on the additional option for kryo?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

2014-07-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1407#issuecomment-48934486
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2154] Schedule next Driver when one com...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1405#issuecomment-48934747
  
QA results for PR 1405:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16624/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin closed the pull request at:

https://github.com/apache/spark/pull/1390


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1390#issuecomment-48935494
  
@yhuai suggested a much simpler fix -- I benchmarked this and it gave the 
same performance boost. I am closing this and opening a new PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [PySpark] hijack hash to make hash of None con...

2014-07-14 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/1371#issuecomment-48935684
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]

2014-07-14 Thread freeman-lab
Github user freeman-lab commented on the pull request:

https://github.com/apache/spark/pull/1361#issuecomment-48935929
  
@mengxr I added two tests, they check that parameter estimates are 
accurate, and improve over time. The tests use temporary file writing / file 
streams, which is clunky, but @tdas will help add dependencies on the streaming 
test suite so we can use its utilities instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2474][SQL] For a registered table in Ov...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1406#issuecomment-48936402
  
QA results for PR 1406:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16625/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
GitHub user concretevitamin opened a pull request:

https://github.com/apache/spark/pull/1408

[SPARK-2443][SQL] Fix slow read from partitioned tables

This fix obtains a comparable performance boost as [PR 
#1390](https://github.com/apache/spark/pull/1390) by moving an array update and 
deserializer initialization out of a potentially very long loop. Suggested by 
@yhuai. The below results are updated for this fix.

## Benchmarks
Generated a local text file with 10M rows of simple key-value pairs. The 
data is loaded as a table through Hive. Results are obtained on my local 
machine using hive/console.

Without the fix:

Type | Non-partitioned | Partitioned (1 part)
 |  | -
First run | 9.52s end-to-end (1.64s Spark job) | 36.6s (28.3s)
Stablized runs | 1.21s (1.18s) | 27.6s (27.5s)

With this fix:

Type | Non-partitioned | Partitioned (1 part)
 |  | -
First run | 9.57s (1.46s) | 11.0s (1.69s)
Stablized runs | 1.13s (1.10s) | 1.23s (1.19s)



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/concretevitamin/spark slow-read-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1408.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1408


commit d86e437218f99179934ccd9b4d5d89c02b09459d
Author: Zongheng Yang zonghen...@gmail.com
Date:   2014-07-14T18:03:07Z

Move update  initialization out of potentially long loop.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1390#issuecomment-48936743
  
New PR here: #1408 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1408#issuecomment-48936856
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1408#issuecomment-48937213
  
QA tests have started for PR 1408. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16631/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1390#discussion_r14894946
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc: 
TableDesc, @transient sc: HiveCon
 
   // Create local references so that the outer object isn't serialized.
   val tableDesc = _tableDesc
+  val tableSerDeClass = tableDesc.getDeserializerClass
+
   val broadcastedHiveConf = _broadcastedHiveConf
   val localDeserializer = partDeserializer
 
   val hivePartitionRDD = createHadoopRdd(tableDesc, inputPathStr, ifc)
-  hivePartitionRDD.mapPartitions { iter =
+  hivePartitionRDD.mapPartitions { case iter =
--- End diff --

I initially thought in a function context, `{ case x = ... }` will be 
optimized to `{ x = ... }`. I did a `scalac -print` on a simple program to 
confirm that this is not the case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-48941276
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-48941686
  
QA tests have started for PR 1238. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16632/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2474][SQL] For a registered table in Ov...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1406#issuecomment-48941916
  
QA results for PR 1406:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16626/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r14896742
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala 
---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+
+/**
+ * New correlation algorithms should implement this trait
+ */
+trait Correlation {
+
+  /**
+   * Compute correlation for two datasets.
+   */
+  def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double
+
+  /**
+   * Compute the correlation matrix S, for the input matrix, where S(i, j) 
is the correlation
+   * between column i and j.
+   */
+  def computeCorrelationMatrix(X: RDD[Vector]): Matrix
+
+  /**
+   * Combine the two input RDD[Double]s into an RDD[Vector] and compute 
the correlation using the
+   * correlation implementation for RDD[Vector]
+   */
+  def computeCorrelationWithMatrixImpl(x: RDD[Double], y: RDD[Double]): 
Double = {
+val mat: RDD[Vector] = x.zip(y).mapPartitions({ iter =
--- End diff --

No can do here since I have a second argument (preservePartitioning)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r14896912
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala 
---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+
+/**
+ * New correlation algorithms should implement this trait
+ */
+trait Correlation {
+
+  /**
+   * Compute correlation for two datasets.
+   */
+  def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double
+
+  /**
+   * Compute the correlation matrix S, for the input matrix, where S(i, j) 
is the correlation
+   * between column i and j.
+   */
+  def computeCorrelationMatrix(X: RDD[Vector]): Matrix
+
+  /**
+   * Combine the two input RDD[Double]s into an RDD[Vector] and compute 
the correlation using the
+   * correlation implementation for RDD[Vector]
+   */
+  def computeCorrelationWithMatrixImpl(x: RDD[Double], y: RDD[Double]): 
Double = {
+val mat: RDD[Vector] = x.zip(y).mapPartitions({ iter =
+  iter.map {case(xi, yi) = new DenseVector(Array(xi, yi))}
+}, preservesPartitioning = true)
+computeCorrelationMatrix(mat)(0, 1)
+  }
+
+}
+
+/**
+ * Delegates computation to the specific correlation object based on the 
input method name
+ *
+ * Currently supported correlations: pearson, spearman.
+ * After new correlation algorithms are added, please update the 
documentation here and in
+ * Statistics.scala for the correlation APIs.
+ *
+ * Cases are ignored when doing method matching. We also allow initials, 
e.g. P for pearson, as
+ * long as initials are unique in the supported set of correlation 
algorithms. In addition, a
--- End diff --

Does that mean no toLowerCase either?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2474][SQL] For a registered table in Ov...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1406#issuecomment-48945794
  
QA results for PR 1406:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16627/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r14897552
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/PearsonCorrelation.scala
 ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import breeze.linalg.{DenseMatrix = BDM}
+
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector}
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+import org.apache.spark.rdd.RDD
+
+/**
+ * Compute Pearson correlation for two RDDs of the type RDD[Double] or the 
correlation matrix
+ * for an RDD of the type RDD[Vector].
+ *
+ * Definition of Pearson correlation can be found at
+ * 
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
+ */
+object PearsonCorrelation extends Correlation {
+
+  /**
+   * Compute the Pearson correlation for two datasets.
+   */
+  override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double 
= {
+computeCorrelationWithMatrixImpl(x, y)
+  }
+
+  /**
+   * Compute the Pearson correlation matrix S, for the input matrix, where 
S(i, j) is the
+   * correlation between column i and j.
+   */
+  override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = {
+val rowMatrix = new RowMatrix(X)
+val cov = rowMatrix.computeCovariance()
+computeCorrelationMatrixFromCovariance(cov)
+  }
+
+  /**
+   * Compute the pearson correlation matrix from the covariance matrix
+   */
+  def computeCorrelationMatrixFromCovariance(covarianceMatrix: Matrix): 
Matrix = {
+val cov = covarianceMatrix.toBreeze.asInstanceOf[BDM[Double]]
+val n = cov.cols
+
+// Compute the standard deviation on the diagonals first
+var i = 0
+while (i  n) {
+  cov(i, i) = math.sqrt(cov(i, i))
+  i +=1
+}
+// or we could put the stddev in its own array to trade space for one 
less pass over the matrix
+
+// TODO: use blas.dspr instead to compute the correlation matrix
+// if the covariance matrix comes in the upper triangular form for free
+
+// Loop through columns since cov is column major
+var j = 0
+var sigma = 0.0
+while (j  n) {
+  sigma = cov(j, j)
+  i = 0
+  while (i  j) {
+val covariance = cov(i, j) / (sigma * cov(i, i))
--- End diff --

What do you want returned when cov(i, i) is zero? Double.NaN? 0.0? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

2014-07-14 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1407#issuecomment-48948316
  
Jenkins, add to whitelist and test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-14 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r14898523
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/PearsonCorrelation.scala
 ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import breeze.linalg.{DenseMatrix = BDM}
+
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector}
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+import org.apache.spark.rdd.RDD
+
+/**
+ * Compute Pearson correlation for two RDDs of the type RDD[Double] or the 
correlation matrix
+ * for an RDD of the type RDD[Vector].
+ *
+ * Definition of Pearson correlation can be found at
+ * 
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
+ */
+object PearsonCorrelation extends Correlation {
+
+  /**
+   * Compute the Pearson correlation for two datasets.
+   */
+  override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double 
= {
+computeCorrelationWithMatrixImpl(x, y)
+  }
+
+  /**
+   * Compute the Pearson correlation matrix S, for the input matrix, where 
S(i, j) is the
+   * correlation between column i and j.
+   */
+  override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = {
+val rowMatrix = new RowMatrix(X)
+val cov = rowMatrix.computeCovariance()
+computeCorrelationMatrixFromCovariance(cov)
+  }
+
+  /**
+   * Compute the pearson correlation matrix from the covariance matrix
+   */
+  def computeCorrelationMatrixFromCovariance(covarianceMatrix: Matrix): 
Matrix = {
+val cov = covarianceMatrix.toBreeze.asInstanceOf[BDM[Double]]
+val n = cov.cols
+
+// Compute the standard deviation on the diagonals first
+var i = 0
+while (i  n) {
+  cov(i, i) = math.sqrt(cov(i, i))
+  i +=1
+}
+// or we could put the stddev in its own array to trade space for one 
less pass over the matrix
+
+// TODO: use blas.dspr instead to compute the correlation matrix
+// if the covariance matrix comes in the upper triangular form for free
+
+// Loop through columns since cov is column major
+var j = 0
+var sigma = 0.0
+while (j  n) {
+  sigma = cov(j, j)
+  i = 0
+  while (i  j) {
+val covariance = cov(i, j) / (sigma * cov(i, i))
--- End diff --

I think the most honest result is NaN. R will return an error for example. 
You will get that already as the result of 0.0 / 0.0 in the JVM. It's worth 
documenting!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1407#issuecomment-48949060
  
QA tests have started for PR 1407. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16633/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1390#discussion_r14898638
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc: 
TableDesc, @transient sc: HiveCon
 
   // Create local references so that the outer object isn't serialized.
   val tableDesc = _tableDesc
+  val tableSerDeClass = tableDesc.getDeserializerClass
+
   val broadcastedHiveConf = _broadcastedHiveConf
   val localDeserializer = partDeserializer
 
   val hivePartitionRDD = createHadoopRdd(tableDesc, inputPathStr, ifc)
-  hivePartitionRDD.mapPartitions { iter =
+  hivePartitionRDD.mapPartitions { case iter =
--- End diff --

oh really?  how does the generated bytecode differ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r14899055
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
 ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import org.apache.spark.Partitioner
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector}
+import org.apache.spark.rdd.{CoGroupedRDD, RDD}
+
+/**
+ * Compute Spearman's correlation for two RDDs of the type RDD[Double] or 
the correlation matrix
+ * for an RDD of the type RDD[Vector].
+ *
+ * Definition of Spearman's correlation can be found at
+ * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient
+ */
+object SpearmansCorrelation extends Correlation {
+
+  /**
+   * Compute Spearman's correlation for two datasets.
+   */
+  override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double 
= {
+computeCorrelationWithMatrixImpl(x, y)
+  }
+
+  /**
+   * Compute Spearman's correlation matrix S, for the input matrix, where 
S(i, j) is the
+   * correlation between column i and j.
+   */
+  override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = {
+val indexed = X.zipWithIndex()
--- End diff --

I need the exact index of each entry in order to compute accurate ranks in 
the case of duplicate values. Doesn't seem like zipWithUniqueId allows me to do 
that esp when duplicate values fall into different partitions. Suggestions on 
making it work with zipWithUniqueId?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-14 Thread dorx
Github user dorx commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r14899116
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
 ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import org.apache.spark.Partitioner
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector}
+import org.apache.spark.rdd.{CoGroupedRDD, RDD}
+
+/**
+ * Compute Spearman's correlation for two RDDs of the type RDD[Double] or 
the correlation matrix
+ * for an RDD of the type RDD[Vector].
+ *
+ * Definition of Spearman's correlation can be found at
+ * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient
+ */
+object SpearmansCorrelation extends Correlation {
+
+  /**
+   * Compute Spearman's correlation for two datasets.
+   */
+  override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double 
= {
+computeCorrelationWithMatrixImpl(x, y)
+  }
+
+  /**
+   * Compute Spearman's correlation matrix S, for the input matrix, where 
S(i, j) is the
+   * correlation between column i and j.
+   */
+  override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = {
+val indexed = X.zipWithIndex()
+// Attempt to checkpoint the RDD before splitting it into numCols 
RDD[Double]s to avoid
+// computing the lineage prefix multiple times.
+// If checkpoint directory not set, cache the RDD instead.
+try {
+  indexed.checkpoint()
+} catch {
+  case e: Exception = indexed.cache()
+}
+
+val numCols = X.first.size
+val ranks = new Array[RDD[(Long, Double)]](numCols)
+
+// Note: we use a for loop here instead of a while loop with a single 
index variable
+// to avoid race condition caused by closure serialization
+for (k - 0 until numCols) {
+  val column = indexed.map {case(vector, index) = {
+(vector(k), index)}
+  }
+  ranks(k) = getRanks(column)
+}
+
+val ranksMat: RDD[Vector] = makeRankMatrix(ranks)
+PearsonCorrelation.computeCorrelationMatrix(ranksMat)
+  }
+
+  /**
+   * Compute the ranks for elements in the input RDD, using the average 
method for ties.
+   *
+   * With the average method, elements with the same value receive the 
same rank that's computed
+   * by taking the average of their positions in the sorted list.
+   * e.g. ranks([2, 1, 0, 2]) = [3.5, 2.0, 1.0, 3.5]
+   */
+  private def getRanks(indexed: RDD[(Double, Long)]): RDD[(Long, Double)] 
= {
+// Get elements' positions in the sorted list for computing average 
rank for duplicate values
+val sorted = indexed.sortByKey().zipWithIndex()
+val groupedByValue = sorted.groupBy(_._1._1)
--- End diff --

Breaking ties is a loose description of what happens. I actually need all 
items of the same value in the same partition in order to take the average of 
their positions in the sorted list. I'm open to suggestions on how to make it 
work with mapPartitions though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1408#issuecomment-48951901
  
QA results for PR 1408:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16631/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   3   >